Bonum Certa Men Certa

What the LLM Scrapers Are Doing to Tux Machines

posted by Roy Schestowitz on Feb 27, 2025

Mist over Niagara Falls

Crossposted from Tux Machines

Earlier this month Jonathan Corbet published "Fighting the AI scraperbot scourge" (in LWN, or Linux Weekly News). The article became freely accessible to everybody earlier today. Corbet said "LWN content-management system contains over 750,000 items (articles, comments, security alerts, etc) dating back to the adoption of the "new" site code in 2002. We still have, in our archives, everything we did in the over four years we operated prior to the change as well. In addition, the mailing-list archives contain many hundreds of thousands of emails. All told, if you are overcome by an irresistible urge to download everything on the site, you are going to have to generate a vast amount of traffic to obtain it all. If you somehow feel the need to do this download repeatedly, just in case something changed since yesterday, your traffic will be multiplied accordingly. Factor in some unknown number of others doing the same thing, and it can add up to an overwhelming amount of traffic."

We have almost 250,000 pages and perhaps 300,000 objects. Some years ago scrapers became a pain in the arse (PITA), so we started converting everything to static. The transition was completed entirely in September 2023.

So far today it looks like we'll have served about 1.5 million requests at midnight. That's more than 50,000 per hour or 1,000 per minute. The server can cope with that, but for ordinary users the site feels slower as the queue of requests grows and is almost never vacant.

Blacklisting offensive IP address/blocks might be the last resort. As an associate put it last night, "the bots are killing dynamically generated sites. They are written by maliciously incompetent bumbling idiots with no regard for their impact on sites in any way. That includes complete disregard for copyright and other legal aspects."

Other Recent Techrights' Posts

Microsoft said “GitHub and its leadership team will continue its mission as part of Microsoft’s CoreAI organisation.” But it's just an empty shell created earlier this year.
In short, it's not too clear what Microsoft has just done except dumping GitHub - i.e. mostly a Web site that loses a ton of money (it always lost money) - into some mysterious new bucket
IBM Layoffs in MCC, or Marketing, Communications and Corporate Social Responsibility (CSR)
IBM and Microsoft inflate their share price by circular financing
The Register MS gets Lazy, Uses Slop
Unlike 3-D renderings or "Classic" CG, slop images aren't quite original and definitely not fair use
 
The Register's Slopfest
Remember when The Register UK (yes, UK) had better standards?
Latest Version of Windows (Vista 11) is a Failure 4 Years After Its Fake 'Leak'
Vista 11 became more scarce this month
Improving Our Archives
Our old archives are still accessed a lot. Making them better is well worth the investment.
Things One Learns as a Litigant in Person at the UK High Court
Don't fear the official manuals
Slopwatch: Lots of Fake Articles From Fake "Linux" Sites and About "Linux"
Google says it's committed to "AI" (it means slop, not AI); that seems like an excuse to dodge accountability
Links 19/08/2025: "Eavesdropping on Phone Conversations Through Vibrations" and Air Canada in Chaos
Links for the day
Gemini Links 19/08/2025: Niche Spaces and "AI Pasta Sauce"
Links for the day
Links 19/08/2025: "NASA Is Giving Up on Climate Change Science" and "Earth's Continents Are Drying Out at an Unprecedented Rate"
Links for the day
Phil Wyett evidence & Debian Zizian plagiarism, modern slavery tendencies
Reprinted with permission from Daniel Pocock
In Many Countries People Move Away From Vista 11
Vista 11 has been available for download for 4 years already, but adoption has been poor
Desktops/Laptops Fall to All-Time Lows in the UK, So Why Does British Media Quote a Famous Criminal on "End of the Smartphone Era"?
mobile usage (for Web access) has never been higher, based on an Irish surveyor, statCounter
The Groklaw Web Site Has Been Hijacked by Scammers
Groklaw.net isn't a safe site to access at this time
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Monday, August 18, 2025
IRC logs for Monday, August 18, 2025
Online Safety Act Does Not Tackle the Worst (and Biggest) Culprits
if our governments are serious about tackling online harms, then they need to look closely at GAFAM and social control media giants
Chat Control (1 and 2) in the European Union Sends the Wrong Message
This is an EU law
Slopwatch: Google News and Serial Sloppers (Fake Articles About "Linux")
Calling out the culprits
Gemini Links 19/08/2025: Digital Legacy and Chat Control
Links for the day
English Law Misused by Americans and Irishmen Against Brits is Unfair
There's always a way to improve existing laws
Overly Maximalist, Expensive, Localised Patent Law is Dooming Western Companies, Argue 3-D Printing Champions
We've long warned (over 7 years already!) that China's approach to patents will impress WIPO by gaming the totals but will doom the West
Links 18/08/2025: "Microsoft Store" Gets Increasingly Hostile, "Cracking Abandonware DRM"
Links for the day
Gemini Links 18/08/2025: Summer "Gone" and Web Reposts in Gemini
Links for the day
Microsoft's Windows in Gabon: Still Moving Down
What is this Unknown? Who knows...
Links 18/08/2025: LLM Reputation Damaged, Australia Catches Google Foul Play
Links for the day
Geeks Like GNU/Linux
The technical community seems to be consolidating and rallying around GNU/Linux
GNU/Linux is 486 in Ireland
4.86% that is
End of Reliable Media
it makes the world a worse place, it renders the Web a misinformation machine
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Sunday, August 17, 2025
IRC logs for Sunday, August 17, 2025
GitHub Won't Last Much Longer
Many things at Microsoft are going to go the way of the Skype (or "dodo"). GitHub will be among those.
We've Never Used Large Language Model (LLM)
we just never used an LLM
"Secure Boot" is a Security Problem, Not a Solution
These people don't try to improve security but to undermine security
Gemini Links 18/08/2025: Retro and Endless Escape from the WWW
Links for the day
Working Whilst Away From Home
Decades ago being away meant all sorts of problems associated with workflows and connectivity
The Next Version of Windows Will Always be the Best (for Microsoft)
It's worse and slower over time
"End of the Smartphone Era" According to Jeffrey Epstein's Key Enabler
They call it "sour grapes"
Links 17/08/2025: Strike Downs Air Canada, Postmortems of Putin's Red Carpet Summit
Links for the day
Links 17/08/2025: Slow Tools and Enshittification of YouTube
Links for the day
Don't Talk to Bullies
This serious matter is still being examined by British authorities
Links 17/08/2025: "The Performance of Power" and "My Undesirable Friends"
Links for the day
Growing Our Reach
Our goal was never "hits"
The Russian Vision of Technology
Russia's surveillance is very extensive
Sooner or Later Almost Everyone Will Know "AI" is Just a Go-To, Misused, Misapplied, and Grossly Overused Term of Liars and Con Jobs Who Ride a Ponzi Scheme
At the expense of people gullible enough to "invest" in this or take salaries/bonuses in the form of "stock" (tied to a Ponzi scheme)
The Register MS Has Begun Using Slop Images
It's not clear when it started; but it's definitely getting worse [...] Worst of all are 'articles' about slop that are themselves slop
Reddit Funded by Microsoft
Reddit is merely a filter and we knows who controls that filter (using money)
When It Comes to Technology, Mozilla and Firefox Are Illiberal
Last month in Planet Debian we saw one more person explaining to everyone how to "turn off" DRM in Firefox and hide the pop-up/s
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Saturday, August 16, 2025
IRC logs for Saturday, August 16, 2025
The Open Source Initiative Has Many Scandals, We'll Try to Summarise Them All
Open Source Initiative (OSI) hates facts
Open Source Initiative (OSI), Wikipedia, Molly De Blanc, and Censorship/Reputation Laundering
OSI is like SPLC. The old name remains, the mission changed
Gemini Links 17/08/2025: Misunderstanding "Geminiverse" and Let's Encrypt
Links for the day
Links 17/08/2025: Breaches, Layoffs, and Scams
Links for the day