Bonum Certa Men Certa

Techrights Coding Projects: Making the Web Light Again

A boatload of bytes that serve no purpose at all (99% of all the traffic sent from some Web sites)

Very bloated boat



Summary: Ongoing technical projects that improve access to information and better organise credible information preceded by a depressing overview regarding the health of the Web (it's unbelievably bloated)

OVER the past few months (since spring) we've been working hard on coding automation and improving the back end in various ways. More than 100 hours were spent on this and it puts us in a better position to grow in the long run and also improve uptime. Last year we left behind most US/USPTO coverage to better focus on the European Patent Office (EPO) and GNU/Linux -- a subject neglected here for nearly half a decade (more so after we had begun coverage of EPO scandals).



As readers may have noticed, in recent months we were able to produce more daily links (and more per day). About a month ago we reduced the volume of political coverage in these links. Journalism is waning and the quality of reporting -- not to mention sites -- is rapidly declining.

"As readers may have noticed, in recent months we were able to produce more daily links (and more per day)."To quote one of our guys, "looking at the insides of today's web sites has been one of the most depressing things I have experienced in recent decades. I underestimated the cruft in an earlier message. Probably 95% of the bytes transmitted between client and server have nothing to do with content. That's a truly rotten infrastructure upon which society is tottering."

We typically gather and curate news using RSS feed readers. These keep sites light and tidy. They help us survey the news without wrestling with clickbait, ads, and spam. It's the only way to keep up with quality while leaving out cruft and FUD (and Microsoft's googlebombing). A huge amount of effort goes into this and it takes a lot of time. It's all done manually.

"We typically gather and curate news using RSS feed readers. These keep sites light and tidy. They help us survey the news without wrestling with clickbait, ads, and spam.""I've been letting wget below run while I am mostly outside painting part of the house," said that guy, having chosen to survey/assess the above-stated problem. "It turns out that the idea that 95% of what web severs send is crap was too optimistic. I spidered the latest URL from each one of the unique sites sent in the links from January through July and measured the raw size for the individual pages and their prerequisites. Each article, including any duds and 404 messages, averaged 42 objects [3] per article. The median, however, was 22 objects. Many had hundreds of objects, not counting cookies or scripts that call in scripts.

"I measured disk space for each article, then I ran lynx over the same URLs to get the approximate size of the content. If one counts everything as content then the lynx output is on average 1% the size of the raw material. If I estimate that only 75% or 50% of the text rendered is actual content then that number obviously goes down proportionally.

"I suppose that means that 99% of the electricity used to push those bits around is wasted as well. By extension, it could also mean that 99% of the greenhouse gases produced by that electricity is produced for no reason.

"The results are not scientifically sound but satisfy my curiosity on the topic, for now.

"Eliminating the dud URLs will produce a much higher object count.

“The results are not scientifically sound but satisfy my curiosity on the topic, for now.”
      --Anonymous
"Using more mainstream sites and fewer tech blogs will drive up the article sizes greatly.

"The work is not peer reviewed or even properly planned. I just tried some spur of the minute checks on article sizes in the first way I could think of," said the guy. We covered this subject before in relation to JavaScript bloat and sites' simplicity, but here we have actual numbers to present.

"The numbers depend on the quality of the data," the guy added, "that is to say the selection of links and the culling the results of 404's, paywall messages, and cookie warnings and so on.

"As mentioned I just took the latest link from each of the sites I have bookmarked this year. That skews it towards lean tech blogs. Though some publishers which should know very much better are real pigs:




$ wget --continue --page-requisites --timeout=30 --directory-prefix=./test.a/ https://www.technologyreview.com/s/614079/what-is-geoengineering-and-why-should-you-care-climate-change-harvard/ . . .

$ lynx --dump https://www.technologyreview.com/s/614079/what-is-geoengineering-and-why-should-you-care-climate-change-harvard/ > test.b

$ du -bs ./test.? 2485779 ./test.a 35109 ./test.b



"Trimming some of the lines of cruft from the text version for that article, I get close to two orders of magnitude difference between the original edition versus the trimmed text edition:

$ du -bs ./test.?
2485779	./test.a
35109	./test.b
27147	./test.c


"Also the trimmed text edition is close to 75% the size of the automated text edition. So, at least for that article, the guess of 75% content may be about right. However, given the quick and dirty approach, of this survey, not much can be said conclusively except 1) there is a lot of waste, 2) there is an opportunity for someone to do an easy piece of research."

Based on links from 2019-08-08 and 2019-08-09, we get one set of results (extracted all URLs saved from January 2019 through July 2019; http and https only, eliminated PDF and other links to obviously non-html material). Technical appendices and footnotes are below for those wishing to explore further and reproduce.







+ this only retrieves the first layer of javascript, far from all of it + some site gave wget trouble, should have fiddled the agent string, --user-agent="" + too many sites respond without proper HTTP response headers, slows collection down intolerably + the pages themselves often contain many dead links + serial fetching is slow and because the sites are unique

$ find . -mindepth 1 -maxdepth 1 -type d -print | wc -l 91 $ find . -mindepth 1 -type f -print | wc -l 4171 which is an average of 78 objects per "article"

+ some sites were tech blogs with lean, hand-crafted HTML, mainstream sites are much heavier, so the above average is skewed towards being too light

Quantity and size of objects associated with articles, does not count cookies nor secondary scripts:

$ find . -mindepth 1 -type f -printf '%s\t%p\n' \ | sort -k1,1n -k2,2 \ | awk '$1>10{ sum+=$1; c++; s[c]=$1; n[c]=$2 } END{ printf "%10s\t%10s\n","Bytes","Measurement"; printf "%10d\tSMALLEST\n",s[1]; for (i in s){ if(i==int(c/2)){ printf "%10d\tMEDIAN SIZE\n",s[i]; } }; printf "%10d\tLARGEST\n",s[c]; printf "%10d\tAVG SIZE\n",sum/c; printf "%10d\tCOUNT\n",c; }'

Bytes File Size 13 SMALLEST 10056 MEDIAN SIZE 32035328 LARGEST 53643 AVG SIZE 38164 COUNT









Overall article size [1] including only the first layer of scripts,

Bytes Article Size 8442 SMALLEST 995476 MEDIAN 61097209 LARGEST 2319854 AVG 921 COUNT

Estimated content [2] size including links, headers, navigation text, etc:

+ deleted files with errors or warnings, probably a mistake as that skews the results for lynx higher

Bytes Article Size 929 SMALLEST 18782 MEDIAN 244311 LARGEST 23997 AVG 889 COUNT

+ lynx returns all text within the document not just the main content, at 75% content the figures are more realistic for some sites:

Bytes Measurement 697 SMALLEST 14087 MEDIAN 183233 LARGEST 17998 AVG 889 COUNT

at 50% content the figures are more realistic for other sites:

465 SMALLEST 9391 MEDIAN 122156 LARGEST 11999 AVG 889 COUNT






       


$ du -bs * \ | sort -k1,1n -k2,2 \ | awk '$2!="l" && $1 { c++; s[c]=$1; n[c]=$2; sum+=$1 } END { for (i in s){ if(i==int(c/2)){ m=i }; printf "% 10d\t%s\n", s[i],n[i] }; printf "% 10s\tArticle Size\n","Bytes"; printf "% 10d\tSMALLEST %s\n",s[1],n[1]; printf "% 10d\tMEDIAN %s\n",s[m],n[m]; printf "% 10d\tLARGEST %s\n",s[c],n[c]; printf "% 10d\tAVG\n", sum/c; printf "% 10d\tCOUNT\n",c; }' OFS=$'\t'









[1]

$ time bash -c 'count=0; shuf l \ | while read u; do echo $u; wget --continue --page-requisites --timeout=30 "$u" & echo $((count++)); if ((count % 5 == 0)); then wait; fi; done;'









[2]

$ count=0; time for i in $(cat l); do echo;echo $i; lynx -dump "$i" > $count; echo $((count++)); done;








[3]

$ find . -mindepth 1 -maxdepth 1 -type d -print | wc -l 921

$ find . -mindepth 1 -type f -print | wc -l 38249









[4]

$ find . -mindepth 1 -type f -print \ | awk '{sub("\./","");sub("/.*","");print;}' | uniq -c | sort -k1,1n -k2,2 | awk '$1{c++;s[c]=$1;sum+=$1;} END{for(i in s){if(i == int(c/2)){m=s[i];}}; print "MEDIAN: ",m; print "AVG", sum/c; print "Quantity",c; }'









[5]

$ find . -mindepth 1 -type f -name '*.js' -exec du -sh {} \; | sort -k1,1rh | head 16M ./www.icij.org/app/themes/icij/dist/scripts/main_8707d181.js 3.4M ./europeanconservative.com/wp-content/themes/Generations/assets/scripts/fontawesome-all.min.js 1.8M ./www.9news.com.au/assets/main.f7ba1448.js 1.8M ./www.technologyreview.com/_next/static/chunks/commons.7eed6fd0fd49f117e780.js 1.8M ./www.thetimes.co.uk/d/js/app-7a9b7f4da3.js 1.5M ./www.crossfit.com/main.997a9d1e71cdc5056c64.js 1.4M ./www.icann.org/assets/application-4366ce9f0552171ee2c82c9421d286b7ae8141d4c034a005c1ac3d7409eb118b.js 1.3M ./www.digitalhealth.net/wp-content/plugins/event-espresso-core-reg/assets/dist/ee-vendor.e12aca2f149e71e409e8.dist.js 1.2M ./www.fresnobee.com/wps/build/webpack/videoStory.bundle-69dae9d5d577db8a7bb4.js 1.2M ./www.ft.lk/assets/libs/angular/angular/angular.js






[6] About page bloat, one can pick just about any page and find from one to close to two orders of magnitude difference between the lynx dump and the full web page. For example,




$ wget --continue --page-requisites --timeout=30 \ --directory-prefix=./test.a/ \ https://www.newsweek.com/saudi-uae-war-themselves-yemen-1453371 . . .

$ lynx --dump \ https://www.newsweek.com/saudi-uae-war-themselves-yemen-1453371 \ > test.b

$ du -bs ./test.? 250793 ./test.a 15385 ./test.b

Recent Techrights' Posts

"Alternative to Microsoft Office" Must Use Free/Open Standards/Formats for Real Sovereignty
It would make sense for the EU to invest in its own workers and its own software projects, more so now that there are hostile countries both to the east and to the west
When Everybody Has a Right/Access to An Attorney/Lawyer (But Some Get Funding From Malicious American Corporations to Spend a Million Dollars on Many Lawyers and Several Barristers)
And send about 75 KG of legal papers to the residence of the "opponent"
European Qualifying Examination (EQE) Being Reduced to Pieces of Papers One Can Buy, Patent System Rapidly Losing Its Legitimacy
Welcome to the "new Europe"
 
Atlassian Corp: We're Doing Layoffs Because of "Hey Hi"; Wall Street: Atlassian Corp is Just a Failing Business
Don't ask "the media"
Microsofters' SLAPP Censorship - Part 11 Out of 200: Cannot Censor His Spouse, Accusations Are Repeated Today
He already has a history of threatening to sue gay people in America; he cannot take criticism too well
Price of Storage, Price of Energy... What Next?
EPO workers are going on strike because their salaries don't keep up with price increases and tech companies without connections in "the channel" face long delays, low availability, and high prices (no "bulk" purchases), which further solidifies monopolies.
Don't Forget Red Hat's RTO (Return-to-office) Layoffs
How many people still remember that Red Hat did the same thing?
Reminder: Microsoft silent Layoffs by RTO (Commute Time and Lack of Comfort/Work Satisfaction) Already in Effect This Year
It's difficult to measure how many employees have already "left on their own" due to the RTO policy
Founder of IBM Ventures Has Just Quit IBM
Some people leave IBM and many people 'leave' IBM
Signs of Impeding Mass Layoffs - Not Just Quiet Layoffs - at Microsoft
Beneath the surface there are waves of layoffs and even entire teams are let go
Career Science and Academia as Corporate Propaganda 'on Tap'
article about surveillance
Veteran GNU/Linux Journalist Jack Wallen Tries Geminispace and Likes It
It'll turn 7 some time soon
Scheduled Maintenance Tonight
There will be similar work early next week
IBM Has No Clue How to Integrate Companies Like Red Hat
IBM is failing to respect this company's culture
Fake Articles From Sites With "Linux" in Their Name/Domain Name
we can at least hope that linuxteck.com made a decision to quit slop
Links 13/03/2026: New US Weapons for Taiwan, Pakistan Air Strikes Hit Kabul
Links for the day
Gemini Links 13/03/2026: Exhaustion and Smartphone Addiction
Links for the day
Friday the 13th & Debian Developers afraid to nominate in DPL elections
Reprinted with permission from Daniel Pocock
Links 13/03/2026: Chatbot "Pentagon Contract" (Bailout) and Secret Service Ditches Slop Pusher
Links for the day
Priorities in 2026
2026 is an interesting year
Willis Towers Watson (WTW) Producing More Propaganda for EPO "Cocaine Communication Managers"
The Local Staff Committee The Hague (LSCTH) has this new paper about Willis Towers Watson (WTW) and its annual EPO-sponsored propaganda, pretending all is well when things are clearly dire
Head of Microsoft Office and Microsoft 360 is Leaving Microsoft Amid Problems and Mass Layoffs
Microsoft is like a "legacy" company
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Thursday, March 12, 2026
IRC logs for Thursday, March 12, 2026
Gemini Links 13/03/2026: "Someone to Take Over Antenna" and Random Seed/RNG
Links for the day
By Expanding to Advocacy of Ponzi Schemes and Bill Epsteingate (Sex Trafficking), Linux Foundation Revenue Grew to $220,730,594, But Salary of Linus Torvalds Not Even in Top 10 Anymore!
true!
In the Name of Transparency, Today We Show Our Defence and Counterclaim
already uploaded by the other side
IBM Cannot Even Do Payroll, Now a "Legitimate Target" of Iran
Missiles or not, it seems like IBM systems will be targeted more by cybercriminals
Links 12/03/2026: Heating Bills to Soar, "Banks in Gulf Evacuate Their Offices"
Links for the day
Gemini Links 12/03/2026: On Phone Anxiety and Bjorn "Looking for Someone to Take Over Antenna"
Links for the day
Cultification: best candidates avoiding Debian leader elections
Reprinted with permission from Daniel Pocock
Richard Stallman (RMS) et al Cited in 'Nature' (Journal/Site) Today, "CODE beyond FAIR"
Under Open Access
The Register MS, on Verge of Collapse, Keeps Promoting a Ponzi Scheme for China
Publishers that participate in this simply don't care about their readers
Overview of False Narratives and Lies Used to Lower Salaries at the European Patent Office (EPO), Abandoning Patent Quality and the EPC
Many of the latter slides are the same as Munich's
Links 12/03/2026: Atlassian Layoffs, GAFAN Covering up Slop-Induced Outages, "Age-verification in Operating Systems and the Internet"
Links for the day
The EPO's President, Who Covers Up Cocaine Use, is Trying to Suppress Communication Between EPO Staff Under the Guise of 'Privacy' (and in Defiance of a Court Ruling)
Why does Europe's second-largest institution: 1) curtail communication among staff (including union) and 2) go out of its way to avoid obeying a court order from ILOAT in Geneva?
Exactly One Week Before Next EPO Strike, Media Intentionally Not Mentioning EPO Strikes
One form of propaganda technique/s involves the systematic suppression of certain topics, or of particular "narratives"
Microsofters' SLAPP Censorship - Part 10 Out of 200: Showing Public Tweets is Not a Privacy Violation, But This Isn't About Justice, It's About Censorship
It's time to put a stop to this abuse of process (which is what the Judge deemed it to be last year)
Suicide of disgruntled employee? Bus fire at Kerzers / Chiètres, Switzerland, at least six dead
Reprinted with permission from Daniel Pocock
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Wednesday, March 11, 2026
IRC logs for Wednesday, March 11, 2026
Gemini Links 12/03/2026: "on Urbit" and the True Cost (or Criticism) of "Social Control Media"
Links for the day
Slop About "linux" in Google News
Once people recognise that those sites are fake it's hard to 'unsee' what they are
An American War on GNU/Linux, Software Freedom, and British Investigative, Science-Based Reporting - Part V - Attempts to Take Down and Suppress Criticism of Back Doors Controlled by Microsoft and the American Government
The cost of maintaining illusions
IBM's Payroll: Cannot Even Pay the People What They're Legally Entitled to
How financially-stressed is IBM at this point?
Slides From the European Patent Office (EPO) Explain Why They're Striking, How They're Striking, and What Comes Next
A week from now the strike will go ahead
GAFAM Datacentres Are Facilities of War, So Risk of Downtime by Missiles or State-Sponsored Cracking Has Vastly Increased
How safe is your business in "clown computing" or DCs marked as some "legitimate targets" at wartime?
Companies That Take Away Blood and Sweat From the Community to Sell a Ponzi Scheme to Everybody
We need Free software that is run by communities
1,234 People Gather Online to Plan Next EPO Strikes and Other Industrial Actions
yesterday an online gathering orchestrated the next moves by EPO staff
Links 11/03/2026: Fake Videos Swarm YouTube, "Ukraine Can Now Manufacture ‘China-Free’ Drones"
Links for the day
Gemini Links 11/03/2026: Lagrange for iOS and Android and "Turning a Folder of Git Repos Into Project Launcher"
Links for the day
Kafkaesque: Unlawful Activities in the UK to Cover Up Unlawful Activities in the United States of America
Why is bribery and even extortion seen is OK? Because rich people do those things?
Former IBM Executive, Ron Hovsepian, Doomed S.u.S.E. (SUSE)
SUSE is like a child nobody wants to raise
Quiet Layoffs or Silent Layoffs Alleged at Microsoft
Will some investigative journalists do their job now and ask Microsoft tough questions?
After a Long Lull LinuxTeck (linuxteck.com) Came Back Only as a Slopfarm
Unlike Linuxiac, LinuxTeck wasn't very active in recent years
Links 11/03/2026: EPO and USPTO Software Patents Thrown Out Again, Copyright Concerns Over Slop (Plagiarism Using Buzzwords)
Links for the day
Microsofters' SLAPP Censorship - Part 9 Out of 200: 5RB Barrister Does Not Even Know the Name of His Own Client (That He Was Paid Well Over $200,000 to 'Speak' or 'Cover' for)
If you assault women in the United States, there's a barrister available for you in the UK
IBM's Fedora is Now Led by GAFAM Slop
The official word of Fedora is partly slop
IBM 'Dinobabies' Speak Out
"They want newbies out of school at a much cheaper rate"
Links 11/03/2026: "Drill, Baby, Drill" and Social Control Media Recognised as Threat to Democracy
Links for the day
5 Years Since Freenode Conflict
IRC isn't going away
A Week Ahead of Next EPO Strike the Staff Representatives Show the Administrative Council That the Office Lost the Best Staff, It's No Longer Attractive
the message circulated regarding the open letter to the Administrative Council
Jeff Bezos as an Individual Said to Have Enough Capital to Buy IBM
Assuming a market capitalisation of 234.70 billion
Starting Soon: Another New Series About Richard Stallman
There are some inside stories we can tell
Gemini Links 11/03/2026: School, Code Slop, and "Fancy Weapons"
Links for the day
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Tuesday, March 10, 2026
IRC logs for Tuesday, March 10, 2026