Bonum Certa Men Certa

Techrights Coding Projects: Making the Web Light Again

A boatload of bytes that serve no purpose at all (99% of all the traffic sent from some Web sites)

Very bloated boat



Summary: Ongoing technical projects that improve access to information and better organise credible information preceded by a depressing overview regarding the health of the Web (it's unbelievably bloated)

OVER the past few months (since spring) we've been working hard on coding automation and improving the back end in various ways. More than 100 hours were spent on this and it puts us in a better position to grow in the long run and also improve uptime. Last year we left behind most US/USPTO coverage to better focus on the European Patent Office (EPO) and GNU/Linux -- a subject neglected here for nearly half a decade (more so after we had begun coverage of EPO scandals).



As readers may have noticed, in recent months we were able to produce more daily links (and more per day). About a month ago we reduced the volume of political coverage in these links. Journalism is waning and the quality of reporting -- not to mention sites -- is rapidly declining.

"As readers may have noticed, in recent months we were able to produce more daily links (and more per day)."To quote one of our guys, "looking at the insides of today's web sites has been one of the most depressing things I have experienced in recent decades. I underestimated the cruft in an earlier message. Probably 95% of the bytes transmitted between client and server have nothing to do with content. That's a truly rotten infrastructure upon which society is tottering."

We typically gather and curate news using RSS feed readers. These keep sites light and tidy. They help us survey the news without wrestling with clickbait, ads, and spam. It's the only way to keep up with quality while leaving out cruft and FUD (and Microsoft's googlebombing). A huge amount of effort goes into this and it takes a lot of time. It's all done manually.

"We typically gather and curate news using RSS feed readers. These keep sites light and tidy. They help us survey the news without wrestling with clickbait, ads, and spam.""I've been letting wget below run while I am mostly outside painting part of the house," said that guy, having chosen to survey/assess the above-stated problem. "It turns out that the idea that 95% of what web severs send is crap was too optimistic. I spidered the latest URL from each one of the unique sites sent in the links from January through July and measured the raw size for the individual pages and their prerequisites. Each article, including any duds and 404 messages, averaged 42 objects [3] per article. The median, however, was 22 objects. Many had hundreds of objects, not counting cookies or scripts that call in scripts.

"I measured disk space for each article, then I ran lynx over the same URLs to get the approximate size of the content. If one counts everything as content then the lynx output is on average 1% the size of the raw material. If I estimate that only 75% or 50% of the text rendered is actual content then that number obviously goes down proportionally.

"I suppose that means that 99% of the electricity used to push those bits around is wasted as well. By extension, it could also mean that 99% of the greenhouse gases produced by that electricity is produced for no reason.

"The results are not scientifically sound but satisfy my curiosity on the topic, for now.

"Eliminating the dud URLs will produce a much higher object count.

“The results are not scientifically sound but satisfy my curiosity on the topic, for now.”
      --Anonymous
"Using more mainstream sites and fewer tech blogs will drive up the article sizes greatly.

"The work is not peer reviewed or even properly planned. I just tried some spur of the minute checks on article sizes in the first way I could think of," said the guy. We covered this subject before in relation to JavaScript bloat and sites' simplicity, but here we have actual numbers to present.

"The numbers depend on the quality of the data," the guy added, "that is to say the selection of links and the culling the results of 404's, paywall messages, and cookie warnings and so on.

"As mentioned I just took the latest link from each of the sites I have bookmarked this year. That skews it towards lean tech blogs. Though some publishers which should know very much better are real pigs:




$ wget --continue --page-requisites --timeout=30 --directory-prefix=./test.a/ https://www.technologyreview.com/s/614079/what-is-geoengineering-and-why-should-you-care-climate-change-harvard/ . . .

$ lynx --dump https://www.technologyreview.com/s/614079/what-is-geoengineering-and-why-should-you-care-climate-change-harvard/ > test.b

$ du -bs ./test.? 2485779 ./test.a 35109 ./test.b



"Trimming some of the lines of cruft from the text version for that article, I get close to two orders of magnitude difference between the original edition versus the trimmed text edition:

$ du -bs ./test.?
2485779	./test.a
35109	./test.b
27147	./test.c


"Also the trimmed text edition is close to 75% the size of the automated text edition. So, at least for that article, the guess of 75% content may be about right. However, given the quick and dirty approach, of this survey, not much can be said conclusively except 1) there is a lot of waste, 2) there is an opportunity for someone to do an easy piece of research."

Based on links from 2019-08-08 and 2019-08-09, we get one set of results (extracted all URLs saved from January 2019 through July 2019; http and https only, eliminated PDF and other links to obviously non-html material). Technical appendices and footnotes are below for those wishing to explore further and reproduce.







+ this only retrieves the first layer of javascript, far from all of it + some site gave wget trouble, should have fiddled the agent string, --user-agent="" + too many sites respond without proper HTTP response headers, slows collection down intolerably + the pages themselves often contain many dead links + serial fetching is slow and because the sites are unique

$ find . -mindepth 1 -maxdepth 1 -type d -print | wc -l 91 $ find . -mindepth 1 -type f -print | wc -l 4171 which is an average of 78 objects per "article"

+ some sites were tech blogs with lean, hand-crafted HTML, mainstream sites are much heavier, so the above average is skewed towards being too light

Quantity and size of objects associated with articles, does not count cookies nor secondary scripts:

$ find . -mindepth 1 -type f -printf '%s\t%p\n' \ | sort -k1,1n -k2,2 \ | awk '$1>10{ sum+=$1; c++; s[c]=$1; n[c]=$2 } END{ printf "%10s\t%10s\n","Bytes","Measurement"; printf "%10d\tSMALLEST\n",s[1]; for (i in s){ if(i==int(c/2)){ printf "%10d\tMEDIAN SIZE\n",s[i]; } }; printf "%10d\tLARGEST\n",s[c]; printf "%10d\tAVG SIZE\n",sum/c; printf "%10d\tCOUNT\n",c; }'

Bytes File Size 13 SMALLEST 10056 MEDIAN SIZE 32035328 LARGEST 53643 AVG SIZE 38164 COUNT









Overall article size [1] including only the first layer of scripts,

Bytes Article Size 8442 SMALLEST 995476 MEDIAN 61097209 LARGEST 2319854 AVG 921 COUNT

Estimated content [2] size including links, headers, navigation text, etc:

+ deleted files with errors or warnings, probably a mistake as that skews the results for lynx higher

Bytes Article Size 929 SMALLEST 18782 MEDIAN 244311 LARGEST 23997 AVG 889 COUNT

+ lynx returns all text within the document not just the main content, at 75% content the figures are more realistic for some sites:

Bytes Measurement 697 SMALLEST 14087 MEDIAN 183233 LARGEST 17998 AVG 889 COUNT

at 50% content the figures are more realistic for other sites:

465 SMALLEST 9391 MEDIAN 122156 LARGEST 11999 AVG 889 COUNT






       


$ du -bs * \ | sort -k1,1n -k2,2 \ | awk '$2!="l" && $1 { c++; s[c]=$1; n[c]=$2; sum+=$1 } END { for (i in s){ if(i==int(c/2)){ m=i }; printf "% 10d\t%s\n", s[i],n[i] }; printf "% 10s\tArticle Size\n","Bytes"; printf "% 10d\tSMALLEST %s\n",s[1],n[1]; printf "% 10d\tMEDIAN %s\n",s[m],n[m]; printf "% 10d\tLARGEST %s\n",s[c],n[c]; printf "% 10d\tAVG\n", sum/c; printf "% 10d\tCOUNT\n",c; }' OFS=$'\t'









[1]

$ time bash -c 'count=0; shuf l \ | while read u; do echo $u; wget --continue --page-requisites --timeout=30 "$u" & echo $((count++)); if ((count % 5 == 0)); then wait; fi; done;'









[2]

$ count=0; time for i in $(cat l); do echo;echo $i; lynx -dump "$i" > $count; echo $((count++)); done;








[3]

$ find . -mindepth 1 -maxdepth 1 -type d -print | wc -l 921

$ find . -mindepth 1 -type f -print | wc -l 38249









[4]

$ find . -mindepth 1 -type f -print \ | awk '{sub("\./","");sub("/.*","");print;}' | uniq -c | sort -k1,1n -k2,2 | awk '$1{c++;s[c]=$1;sum+=$1;} END{for(i in s){if(i == int(c/2)){m=s[i];}}; print "MEDIAN: ",m; print "AVG", sum/c; print "Quantity",c; }'









[5]

$ find . -mindepth 1 -type f -name '*.js' -exec du -sh {} \; | sort -k1,1rh | head 16M ./www.icij.org/app/themes/icij/dist/scripts/main_8707d181.js 3.4M ./europeanconservative.com/wp-content/themes/Generations/assets/scripts/fontawesome-all.min.js 1.8M ./www.9news.com.au/assets/main.f7ba1448.js 1.8M ./www.technologyreview.com/_next/static/chunks/commons.7eed6fd0fd49f117e780.js 1.8M ./www.thetimes.co.uk/d/js/app-7a9b7f4da3.js 1.5M ./www.crossfit.com/main.997a9d1e71cdc5056c64.js 1.4M ./www.icann.org/assets/application-4366ce9f0552171ee2c82c9421d286b7ae8141d4c034a005c1ac3d7409eb118b.js 1.3M ./www.digitalhealth.net/wp-content/plugins/event-espresso-core-reg/assets/dist/ee-vendor.e12aca2f149e71e409e8.dist.js 1.2M ./www.fresnobee.com/wps/build/webpack/videoStory.bundle-69dae9d5d577db8a7bb4.js 1.2M ./www.ft.lk/assets/libs/angular/angular/angular.js






[6] About page bloat, one can pick just about any page and find from one to close to two orders of magnitude difference between the lynx dump and the full web page. For example,




$ wget --continue --page-requisites --timeout=30 \ --directory-prefix=./test.a/ \ https://www.newsweek.com/saudi-uae-war-themselves-yemen-1453371 . . .

$ lynx --dump \ https://www.newsweek.com/saudi-uae-war-themselves-yemen-1453371 \ > test.b

$ du -bs ./test.? 250793 ./test.a 15385 ./test.b

Recent Techrights' Posts

Investigative Journalism Protects Society From Corruption, Crimes Against Women, Assaults on Civil Society
"what is the point of men doing military practice to defend a system that is so rotten?"
Swiss pimp usurping reputation of legendary Tissot boss Francois Thiébaud from France (BaselWorld, SWATCH Group SA)
Reprinted with permission from Daniel Pocock
Paris 'Love Nest' & Debian Outreachy: from Lycée Lakanal to ENS Cachan, Cr@ns, nepotism
Reprinted with permission from Daniel Pocock
Richard Stallman to Give Public Talk in 3 Hours, Then in the Technical University of Munich (Germany) Next Week
Richard Stallman at TUM on 21.10.2025 18:00, MW2001
Leaks and Whistleblowers: Our Plan for Today
Society simply cannot advance when too many people self-censor
 
The EPO's War on Techrights Was a Massive Mistake
The EPO started the SLAPPs after we had published a few hundreds of articles; we've since then published close to 6,000 because the attacks on us emboldened insiders to help us
General-Purpose Computers to Become Growing Area of Coverage
Without them, we have little left for controlling our lives
"They missed a great opportunity to shut up." -Jacques Chirac
Brett Wilson LLP has been trying to cheat the legal system many times
Harassment evidence: Switzerland, overcrowded fitness and yoga centers, incompetence and racism in accident response
Reprinted with permission from Daniel Pocock
Vincent Danjean & Debian NXIVM collateral, blackmail risks
Reprinted with permission from Daniel Pocock
In Sweden This Past Friday Richard Stallman Explained Why Copyleft is Important
And he didn't have to 'bash' BSDs, either
Dystopian Trends in Technology Make Richard Stallman More Relevant Than Ever
It's good to see him attracting vast audiences
IBM Layoffs Due to a Lack of Money and Company Debt Rising by Almost 10 Billion Dollars in 6 Months
IBM didn't buy Red Hat for any ideological reasons; it was a fast "cash grab" for revenue
Forbes Already Stopped Being a News Sites. Now It's a Spam and Propaganda Platform for "Paying Partners" (Companies).
news from Forbes became very scarce
Is the Second-Largest Institution in Europe (EPO) Gradually Becoming More Like a Sweatshop?
Underpaid, unqualified, inexperienced and incompatible people are already recruited to replace veteran examiners
The Register MS Has No FOSS Coverage Anymore
The Editor in Chief is like a Microsoft plant
Links 13/10/2025: "Toasty Subwoofer" and WiFi Speakers "Are About To Go Dumb"
Links for the day
Gemini Links 13/10/2025: iNaturalist and Tove Jansson’s Moominpappa at Sea
Links for the day
Microsoft Does Not Deny That Large Retailers Like Walmart, Costco and Target Are Giving Up on XBox (and Not Stocking It)
No doubt XBox is in trouble and rumours suggest that more mass layoffs are imminent
We'll Encourage Richard Stallman to Talk About Software Patents at the EPO Next Week When He Visits Munich (EPO Headquarters)
Go listen to Richard Stahlmann
Arnaud Parreaux lost case defending rogue employer
Reprinted with permission from Daniel Pocock
Mathieu Elias Parreaux declared bankrupt in Switzerland
Reprinted with permission from Daniel Pocock
Breakdown of the Rule of Law and Patent Law in the European Union (EU)
The EPO cannot recruit suitably qualified patent examiners this way, let alone retain them
Gemini Links 13/10/2025: Good Films, Wizard of Earthsea, Upgrading the Steam Controller's Stick
Links for the day
It's Not Justice When One Side Denies the Other Side the Ability to Even Speak
At this stage, Brett Wilson LLP is in my humble opinion acting in contempt of the Court
Links 13/10/2025: Australian Catholic University Uses Slop to Libel Students, Canada Threatens to Kill Beluga Whales
Links for the day
How Not to Silence Tux Machines (It'll Only Backfire, Badly)
defending Microsoft while attacking this site
Slopwatch: UbuntuPIT and Google News
It seems abundantly clear that Google News and Google in general participates in the slop epidemic
Vincent Danjean (not INTERPOL), Claire Bardel & Debian pregnancy cluster
Reprinted with permission from Daniel Pocock
Christmas lynchings: Martin Krafft (madduck), Penny Leach (mjollnir) & Debian pregnancy cluster
Reprinted with permission from Daniel Pocock
Gemini Links 13/10/2025: Birthdays and "Committee Unable to Contact Nobel Prize Winner"
Links for the day
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Sunday, October 12, 2025
IRC logs for Sunday, October 12, 2025
The Same People Who Attacked Richard Stallman (RMS) Are Attacking Daniel Pocock to Discourage People From Listening to His Information
Pocock is being demonised for the same reasons and by the same people who attack RMS
Your Typical Anti-Richard Stallman (RMS) Cancellist
"About the RMS cancellation"
Richard Stallman (RMS) Has Announced His Talk in Rome Less Than 20 Hours in Advance (and on a Sunday)
Why did he wait until the night before?
We Are Safe in a Modern "Tech" Society, Right?
People are safer if they control their own computing
GNU Tools Cauldron Event in Portugal: Videos Now Available via Invidious
Go have a look
The Way Things Are Going, They May Soon Stop Saying "Web Address" and Instead Say "Chrome Address"
The Web isn't built or based around open Web standards anymore. It's centered around user-agent.
Microsoft as a Golden Cage
"I was laid off by Microsoft and can't find a job. I'm weeks away from giving up my apartment and moving across the country to live with family."
Weekend Discussion About How IBM's Bluewashing of Red Hat Will Cause "Enshittification" for Users
"I worked at a software company that was acquired by IBM so I knew it was game over for RedHat the day they were acquired"
Brett Wilson LLP Getting Sued by Its Very Own Clients, a Legal Story That Has Made the Mainstream News (Law360)
Law360 or Law.com are about as mainstream as one can get in that "sector" (litigation 'industry')
Slopwatch: GNU/Linux Sites That Became Slopfarms and Spamfarms
The Web is a mess and "Linux" or "Ubuntu" sites became part of the problem
Richard Stallman's Talk 25 Hours Away, Aula Magna Palazzo del Rettorato (CU001), Sapienza Università di Roma (Piazzale Aldo Moro, 5)
The talk is 25 hours away and we see some QR code for it
Gemini Links 12/10/2025: Watches, the Depression of 2026, Gamboling with Odds
Links for the day
Links 12/10/2025: 'False' DMCA Claims and Slop Facing Perils Again (the Hype Wears Off)
Links for the day
Microsoft Has Just Lost Privacy Case in Austria and Its Latest Moves Make a Complete Ban Seem Imperative
Microsoft is not a software company, it's a spying agency that uses software to collect data
The Register MS: Microsoft is the Security Expert, Not the Prime Culprit, So Buy More Microsoft
This front page feature is devoid of any actual substance, it's just Microsoft copypasta
Stefano Zacchiroli (Zack) & Debian pregnancy cluster
Reprinted with permission from Daniel Pocock
Lucas Nussbaum & Debian pregnancy cluster
Reprinted with permission from Daniel Pocock
Gemini Links 12/10/2025: "Palm Computering", Further Exploration of Slide Rules, and Key Takeaways from The Well-Grounded Rubyist
Links for the day
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Saturday, October 11, 2025
IRC logs for Saturday, October 11, 2025
Tomorrow: Founder of the Free Software Foundation and of GNU/Linux, Richard Stallman, Speaks in Roma (Rome), Italy at 4PM
GNU/Linux is more important than ever in this dystopian world
Microsoft and Apple Are Rare Topics in Geminispace
in Geminispace it's rather safe to assume everyone is into BSD, GNU/Linux, and sometimes retro
Qualcomm and Manchester United Appear to Have Dumped Microsoft (Qualcomm Now Invests More in Linux, Apparently)
It's a relief to no longer see Microsoft logos and brands on a local football club's gear (I'm not a Manchester United fan, but not a foe either)
As Guest of Honour in Rome, Founder of the Free Software Foundation to Speak ("Distinguished Lecture") After Introduction by Leonardo Querzoni
Happy hacking...
All Things Open is Proprietary
The OSI has become a front group of proprietary software openwashers, led and sponsored by proprietary giants
When Microsoft Lays Off Lots of Workers They Say It "Invests in AI" (a Lie), Now It's "Reshuffles" or "Microsoft Tightens"
Microsoft "news" by bots
"I saw Richard Stallman give a talk in the mid 80s, which began my fear and loathing of software patents" and "Richard Stallman was always right."
"By betraying the legacy of our ancestors, we’ve set ourselves on a path toward self-destruction — moral, intellectual, economic, and ultimately biological."
There Were Several Waves of Microsoft Shanghai Layoffs in 2025, Western Media Continues to Turn a Blind Eye to Chinese Layoffs of an Epic Scale
Sometimes select Taiwanese news sites (published in English) or automated translations are all we have
Brett Wilson LLP Spreads Trumpism to the United Kingdom, Looking to Profit From 'Legal Colonialism' (Overriding Sovereignty)
There's growing recognition of this conundrum worldwide
The Demise of Shopping in Person
In a world like this, how valued is the customer?
This Past Friday, "Nearly 700 People Came to Listen to RMS!" (Richard Stallman)
"Nearly 700 people came to listen to RMS!"
Distinguished Lecture by Richard Stallman This Coming Monday in Rome
After "Free software, Crucial for Freedom in a Digital World"
Slopwatch: UbuntuPIT Churning Out Plagiarism and the Slopfarm LinuxSecurity Turns to Pseudonyms
Our hunch is, UbuntuPIT will sooner or later realise that this toxic approach is just harming UbuntuPIT and tainting the reputation of past articles
The Lawsuit by Clients of Brett Wilson LLP Against Brett Wilson LLP is Officially On, It is Progressing, The 'Experts' Pick Outside Law Firms (RPC and Mills & Reeve) to Spare Them From Litigants in Person
So it is probably quite potent
Gemini Links 11/10/2025: Nyctography, Gerrymandering, and Lurking
Links for the day
The 'Culture Wars' in Free Software Have Gone Out of Control
Social control media amplifies such utterly infantile discourse
Teaser: To Compensate for the Fact Our Clients Are Terrible Human Beings Who Strangle Women (While on Microsoft's Payroll) and We Get Paid by Mystery Parties We Bombard You and Your Wife With Almost 10 Kilograms of Legal Papers
If you can't win an argument, then drown the other side with papers?
Links 11/10/2025: World Mental Health Day 2025, Another European Legal Defeat for Microsoft 360
Links for the day
MIT Technology Review is Part-Time SPAMfarm of Billionaires and Mega-Corporations
Does MIT operate its own "b2b" SPAMfarm?
Open Source Initiative Executive Director Leaves, Replacement Sought by Monopolists, Not the Community or OSI Members
Serves to show who runs this show...
Links 11/10/2025: China-US Tensions Grow Again, "Hey Hi" More Widely Recognised as Bubble Made of Capital That Doesn't Exist
Links for the day
Now Confirmed in Western Media: Microsoft Azure Layoffs This Month
Affirmed by more sources moments ago
Peter O'Callaghan QC represented grandparents, Westernport Hotel, at Liquor Royal Commission
Reprinted with permission from Daniel Pocock
Either The Register MS Divests From FOSS Coverage or Liam Proven is on Long Holiday
Publishers perish when their audience loses trust in them
Microsoft Cancelling Another Datacentre is a Sign of Financial Trouble and Lack of Growth
The debt continues to grow
Gemini Links 11/10/2025: An Evening at the Fair and Fast Fourier Friday
Links for the day
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Friday, October 10, 2025
IRC logs for Friday, October 10, 2025