Techrights Coding Projects: Making the Web Light Again

Dr. Roy Schestowitz

2019-08-10 16:05:17 UTC
Modified: 2019-08-10 16:05:17 UTC

A boatload of bytes that serve no purpose at all (99% of all the traffic sent from some Web sites)

Very bloated boat

Summary: Ongoing technical projects that improve access to information and better organise credible information preceded by a depressing overview regarding the health of the Web (it's unbelievably bloated)

OVER the past few months (since spring) we've been working hard on coding automation and improving the back end in various ways. More than 100 hours were spent on this and it puts us in a better position to grow in the long run and also improve uptime. Last year we left behind most US/USPTO coverage to better focus on the European Patent Office (EPO) and GNU/Linux -- a subject neglected here for nearly half a decade (more so after we had begun coverage of EPO scandals).

As readers may have noticed, in recent months we were able to produce more daily links (and more per day). About a month ago we reduced the volume of political coverage in these links. Journalism is waning and the quality of reporting -- not to mention sites -- is rapidly declining.

"As readers may have noticed, in recent months we were able to produce more daily links (and more per day)."To quote one of our guys, "looking at the insides of today's web sites has been one of the most depressing things I have experienced in recent decades. I underestimated the cruft in an earlier message. Probably 95% of the bytes transmitted between client and server have nothing to do with content. That's a truly rotten infrastructure upon which society is tottering."

We typically gather and curate news using RSS feed readers. These keep sites light and tidy. They help us survey the news without wrestling with clickbait, ads, and spam. It's the only way to keep up with quality while leaving out cruft and FUD (and Microsoft's googlebombing). A huge amount of effort goes into this and it takes a lot of time. It's all done manually.

"We typically gather and curate news using RSS feed readers. These keep sites light and tidy. They help us survey the news without wrestling with clickbait, ads, and spam.""I've been letting wget below run while I am mostly outside painting part of the house," said that guy, having chosen to survey/assess the above-stated problem. "It turns out that the idea that 95% of what web severs send is crap was too optimistic. I spidered the latest URL from each one of the unique sites sent in the links from January through July and measured the raw size for the individual pages and their prerequisites. Each article, including any duds and 404 messages, averaged 42 objects [3] per article. The median, however, was 22 objects. Many had hundreds of objects, not counting cookies or scripts that call in scripts.

"I measured disk space for each article, then I ran lynx over the same URLs to get the approximate size of the content. If one counts everything as content then the lynx output is on average 1% the size of the raw material. If I estimate that only 75% or 50% of the text rendered is actual content then that number obviously goes down proportionally.

"I suppose that means that 99% of the electricity used to push those bits around is wasted as well. By extension, it could also mean that 99% of the greenhouse gases produced by that electricity is produced for no reason.

"The results are not scientifically sound but satisfy my curiosity on the topic, for now.

"Eliminating the dud URLs will produce a much higher object count.

“The results are not scientifically sound but satisfy my curiosity on the topic, for now.”
--Anonymous"Using more mainstream sites and fewer tech blogs will drive up the article sizes greatly.

"The work is not peer reviewed or even properly planned. I just tried some spur of the minute checks on article sizes in the first way I could think of," said the guy. We covered this subject before in relation to JavaScript bloat and sites' simplicity, but here we have actual numbers to present.

"The numbers depend on the quality of the data," the guy added, "that is to say the selection of links and the culling the results of 404's, paywall messages, and cookie warnings and so on.

"As mentioned I just took the latest link from each of the sites I have bookmarked this year. That skews it towards lean tech blogs. Though some publishers which should know very much better are real pigs:






$ wget --continue --page-requisites --timeout=30
--directory-prefix=./test.a/
https://www.technologyreview.com/s/614079/what-is-geoengineering-and-why-should-you-care-climate-change-harvard/
. . .





$ lynx --dump
https://www.technologyreview.com/s/614079/what-is-geoengineering-and-why-should-you-care-climate-change-harvard/
> test.b





$ du -bs ./test.?
2485779	./test.a
35109	./test.b

"Trimming some of the lines of cruft from the text version for that article, I get close to two orders of magnitude difference between the original edition versus the trimmed text edition:

$ du -bs ./test.?
2485779	./test.a
35109	./test.b
27147	./test.c

"Also the trimmed text edition is close to 75% the size of the automated text edition. So, at least for that article, the guess of 75% content may be about right. However, given the quick and dirty approach, of this survey, not much can be said conclusively except 1) there is a lot of waste, 2) there is an opportunity for someone to do an easy piece of research."

Based on links from 2019-08-08 and 2019-08-09, we get one set of results (extracted all URLs saved from January 2019 through July 2019; http and https only, eliminated PDF and other links to obviously non-html material). Technical appendices and footnotes are below for those wishing to explore further and reproduce. ⬆






+ this only retrieves the first layer of javascript, far from all of it
+ some site gave wget trouble, should have fiddled the agent string,
	--user-agent=""
+ too many sites respond without proper HTTP response headers,
	slows collection down intolerably
+ the pages themselves often contain many dead links
+ serial fetching is slow and because the sites are unique





$ find . -mindepth 1 -maxdepth 1 -type d -print | wc -l
91
$ find . -mindepth 1 -type f -print | wc -l
4171
which is an average of 78 objects per "article"





+ some sites were tech blogs with lean, hand-crafted HTML,
	mainstream sites are much heavier,
	so the above average is skewed towards being too light





Quantity and size of objects associated with articles,
does not count cookies nor secondary scripts:





$ find . -mindepth 1 -type f -printf '%s\t%p\n' \
| sort -k1,1n -k2,2 \
| awk '$1>10{
		sum+=$1;
		c++;
		s[c]=$1;
		n[c]=$2
	}
	END{
		printf "%10s\t%10s\n","Bytes","Measurement";
		printf "%10d\tSMALLEST\n",s[1];
		for (i in s){
			if(i==int(c/2)){
				printf "%10d\tMEDIAN SIZE\n",s[i];
			}
		};
		printf "%10d\tLARGEST\n",s[c];
		printf "%10d\tAVG SIZE\n",sum/c;
		printf "%10d\tCOUNT\n",c;
	}'





     Bytes      File Size
        13      SMALLEST
     10056      MEDIAN SIZE
  32035328      LARGEST
     53643      AVG SIZE
     38164      COUNT






Overall article size [1] including only the first layer of scripts,





     Bytes      Article Size
      8442      SMALLEST
    995476      MEDIAN
  61097209      LARGEST
   2319854      AVG
       921      COUNT





Estimated content [2] size including links, headers, navigation text, etc:





+ deleted files with errors or warnings,
	probably a mistake as that skews the results for lynx higher





     Bytes      Article Size
       929      SMALLEST
     18782      MEDIAN
    244311      LARGEST
     23997      AVG
       889      COUNT





+ lynx returns all text within the document not just the main content,
	at 75% content the figures are more realistic for some sites:





     Bytes      Measurement
       697	SMALLEST
     14087	MEDIAN
    183233	LARGEST
     17998	AVG
       889	COUNT





	at 50% content the figures are more realistic for other sites:





       465	SMALLEST
      9391	MEDIAN
    122156	LARGEST
     11999	AVG
       889	COUNT

       




$ du -bs * \
| sort -k1,1n -k2,2 \
| awk '$2!="l" && $1 {
		c++;
		s[c]=$1;
		n[c]=$2;
		sum+=$1
	}
	END {
		for (i in s){
			if(i==int(c/2)){
				m=i
			};
			printf "% 10d\t%s\n", s[i],n[i]
		};
		printf "% 10s\tArticle Size\n","Bytes";
		printf "% 10d\tSMALLEST %s\n",s[1],n[1];
		printf "% 10d\tMEDIAN %s\n",s[m],n[m];
		printf "% 10d\tLARGEST  %s\n",s[c],n[c];
		printf "% 10d\tAVG\n", sum/c;
		printf "% 10d\tCOUNT\n",c;
	}' OFS=$'\t'






[1]





$ time bash -c 'count=0;
shuf l \
| while read u; do
	echo $u;
	wget --continue --page-requisites --timeout=30 "$u" &
	echo $((count++));
	if ((count % 5 == 0)); then
		wait;
	fi;
	done;'






[2]





$ count=0;
time for i in $(cat l); do
	echo;echo $i;
	lynx -dump "$i" > $count;
	echo $((count++));
	done;






[3]





$ find . -mindepth 1 -maxdepth 1 -type d -print | wc -l
921





$ find . -mindepth 1 -type f -print | wc -l
38249






[4]





$ find . -mindepth 1 -type f -print \
| awk '{sub("\./","");sub("/.*","");print;}' | uniq -c | sort -k1,1n
-k2,2 | awk '$1{c++;s[c]=$1;sum+=$1;} END{for(i in s){if(i ==
int(c/2)){m=s[i];}}; print "MEDIAN: ",m; print "AVG", sum/c; print
"Quantity",c; }'






[5] 





$ find . -mindepth 1 -type f -name '*.js' -exec du -sh {} \; | sort
-k1,1rh | head
16M     ./www.icij.org/app/themes/icij/dist/scripts/main_8707d181.js
3.4M
./europeanconservative.com/wp-content/themes/Generations/assets/scripts/fontawesome-all.min.js
1.8M    ./www.9news.com.au/assets/main.f7ba1448.js
1.8M
./www.technologyreview.com/_next/static/chunks/commons.7eed6fd0fd49f117e780.js
1.8M    ./www.thetimes.co.uk/d/js/app-7a9b7f4da3.js
1.5M    ./www.crossfit.com/main.997a9d1e71cdc5056c64.js
1.4M
./www.icann.org/assets/application-4366ce9f0552171ee2c82c9421d286b7ae8141d4c034a005c1ac3d7409eb118b.js
1.3M
./www.digitalhealth.net/wp-content/plugins/event-espresso-core-reg/assets/dist/ee-vendor.e12aca2f149e71e409e8.dist.js
1.2M
./www.fresnobee.com/wps/build/webpack/videoStory.bundle-69dae9d5d577db8a7bb4.js
1.2M    ./www.ft.lk/assets/libs/angular/angular/angular.js

[6] About page bloat, one can pick just about any page and find from one to close to two orders of magnitude difference between the lynx dump and the full web page. For example,






$ wget --continue --page-requisites --timeout=30 \
    --directory-prefix=./test.a/ \
    https://www.newsweek.com/saudi-uae-war-themselves-yemen-1453371
. . .





$ lynx --dump \
    https://www.newsweek.com/saudi-uae-war-themselves-yemen-1453371 \
    > test.b





$ du -bs ./test.?
250793	./test.a
 15385	./test.b

X.Org is Still Not Dead: Oracle still developing it
Microsoft is getting ready to cause many employees to resign: Having already laid off many workers earlier this month, it now tries another approach
"Maybe the Problem is You": they probably felt like they had no choice because they really needed this Microsoft money
GNU OS, Powered by Hurd: Choice is good, as long as choices exist that respect the users' freedom
Gemini Links 15/08/2025: Leasehold, Slop Bubble, and Xobaqu: Links for the day
Links 15/08/2025: Flight Attendant Strike, Floods, and Tropical Storms: Links for the day
Links 15/08/2025: German Government Falls Short on Free Software, Russians Breach EU Systems: Links for the day
Microsoft is Still Losing Cyprus: The market share goes down, so share prices go up
Microsoft Accenture is in Trouble: For one thing, its debt doubled in a matter of months
News Will Slow Down and Slop Will Contribute to the Slowdown: In recent years every time there was some holiday or major break the number people who "came back" shrank
Upgrading IRC Network of Techrights: a new version of the daemon we've used since 2021 was released very recently
"Register Debate Series" About Microsoft in the UK is Controlled by Microsoft (US): The Register is run by Microsoft "Analysts", so the debate is doomed from the get-go
IBM is a Terrible Model for Red Hat: "Most likely caused by laying off too many people"
Microsoft Problems in Palestinian Territory and Israel: Microsoft stock (share price) goes up when market share goes down
Slave is Not a Bad Word, We Need to Use It Sometimes: Who does such exclusion of words benefit? What sort of expression will be deemed impermissible and subjected to CoC enforcement?
National Day of Action: "This Friday, August 15th, there is an organized, petition-based, protest of Wells Fargo in major cities across the US," Richard Stallman wrote
Our Gemini Editions Now Contain 100,000+ GemText Pages: Our Gemini Editions aren't small, even if Gemini Protocol is still the 'underdog'
The Relations Between the United States and Europe Deteriorate, Should Europe Continue to Rely on American Tech Giants?: The shallow notion that made-in-USA software is fairly safe for Europe to rely to is coming to a standstill
Techrights and Tux Machines Running as Usual During Vacations: No interruptions, maybe temporarily slowdowns
Gemini Links 15/08/2025: ADHD and "Random Weird Things": Links for the day
Over at Tux Machines...: GNU/Linux news for the past day
IRC Proceedings: Thursday, August 14, 2025: IRC logs for Thursday, August 14, 2025
"Article 52. PATENTABLE INVENTIONS" in the European Patent Convention: Some time tomorrow we'll have a complete local copy of the EPC
Serial Slopper (SS) Still at It, Still Misusing Plagiarism Tools and Cheatware for Images and Text About "Linux": All the slopfarms are a very big problem
Reddit Deletes Stuff, But Not for Being False or Misleading: Yet another one of those articles that speak of a man in his 50s as if he's terminally ill
Times of India and India.com Are Clickbait and LLM Slop: Google continues to reward bad actors
The More "Market Share" Microsoft Loses, The Higher the Shares Go: People joke about the same sort of thing in relation to IBM
To OIN, Software Patents Are Not a Problem: Had software patents ceased to exist, OIN too would cease to exist and its staff would be unemployed.
Microsoft's Bankruptcy in Russia is Only the Beginning: Due to politics it mostly makes sense that Windows is being phased out, also in part due to policy changes
Microsoft-Funded Publishers Lied to Us About Vista 10 and Now Advocate Us Owning Nothing: They want you to own nothing, but they also want you to buy a PC on which to become Microsoft's slave and they make it harder if not practically impossible to remove Windows
Articles Promoting and Celebrating Wayland Are LLM Slop: New example (100% slop)
European Patent Office (EPO) Reformation Project: It's a stain on the EU's reputation
The Register MS, Dominated by American Editors, Says UK Should be Run (Digitally) by Microsoft US: The Register MS is sponsored by American money, run by Americans, and its chief editor is a Microsofter from the US
Slopwatch: Google News and Other Slopfarms: Google News is rewarding sites that misuse LLMs and cheat the Web
Gemini Links 14/08/2025: Drought, Climate Experiments, and LLM Slop Considered Detrimental: Links for the day
Links 14/08/2025: Second-hand ThinkPad and Enhanced Surveillance on Chipsets from the United States: Links for the day
Moral Standards From the Masters of Linux: They get hung up on minor language issue and promote this crazy theory that racism will go away if only everyone spoke a little differently (no matter where he or she came from)
Links 14/08/2025: Data Brokers Hiding Opt-Out Pages From Google, "Fight Chat Control": Links for the day
FSF Infrastructure Under Constant Attack: The disconnect (literally) has had an effect on credibility
Feels Like The Register MS is Trying to Diversify a Bit: If The Register MS goes back to being The Register US (or UK), that will be a nice improvement
Gemini Links 14/08/2025: Reading Journal and LLM Fatigue Revisited: Links for the day
Over at Tux Machines...: GNU/Linux news for the past day
IRC Proceedings: Wednesday, August 13, 2025: IRC logs for Wednesday, August 13, 2025
Hopping From One Set of Buzzwords to the Next: Rotating hype and vapourware
Currys PCWorld Hates GNU/Linux Even Though It Runs the World: If more and more people choose to remove Windows, then Currys PCWorld will feel the financial impact of its dumb policies
Internet Relay Chat and Gemini Protocol Help Us Relive the Net of the Dial-Up Era: The kids were alright
The Register MS Takes More Money to Boost Slop Hype, This Time From Snyk, a Notorious FUD Source: At some stage or at some point they might even decide to stop doing so
"GPT-5" is Another Microsoft Dead Cat Trying to Bounce: The hype, the momentum (or the inertia) is wearing off
Microsoft Windows Losing Its Grip Near Turkey and Russia: The 'corridor' nations connecting Iran to Europe
Slopwatch: LinuxSecurity, Google News, and Serial Slopper (SS): The slop, the bad, and the ugly
Links 13/08/2025: The “Incriminating Video” Scam and Corruption in South Korea: Links for the day
Gemini Links 13/08/2025: Movie Memories and Mystery Machine Bus: Links for the day
"AI" Hype or LLM Slop is Not About Efficiency, It's About Lowering Standards: It does not seem like IBM is genuinely committed to the same goals (or commitments) as the original Red Hat
Links 13/08/2025: GitHub Trouble and Openwashing by Microsoft OSI With the Typical Buzzwords: Links for the day
If Free/Libre Software is Adding Trillions in Value to the European Economy, Then the European Commission Must Crush Software Patents: Further to what we wrote yesterday
Microsoft Swallows GitHub Losses: Only Microsoft knows how much money it has already lost on GitHub
Gemini Links 13/08/2025: Climate, Coffee, and Deploying Troops in Washington DC After Pardoning 1,000+ Insurrectionists in Washington DC: Links for the day
The Register MS Lowered MS Focus This Week: We hope The Register recognises its errors and tries to make up for them
Learning Ethics From Jeffrey Epstein's Enabler/Client/Ally, Coca-Cola, and Microsoft Accenture: Whatever merits vocabulary changes initially had are being tainted or obscured by later iterations, which tell us to avoid word like "normal", which apparently offend some people (so they argue)
Personal Attacks From Rust People Serve to Confirm They Have Lost the Argument: "The discussion I find around the net so far has no technical merit and centers around ad hominem"
Physical Meters and Purely Mechanical Meters Aren't Dumb; It's Dumb to Mock or Dismiss Them as Antiquated: I've learned a lot this week, both online and over the telephone
Over at Tux Machines...: GNU/Linux news for the past day
IRC Proceedings: Tuesday, August 12, 2025: IRC logs for Tuesday, August 12, 2025

Techrights Coding Projects: Making the Web Light Again

Recent Techrights' Posts