Bonum Certa Men Certa

What the LLM Scrapers Are Doing to Tux Machines

posted by Roy Schestowitz on Feb 27, 2025

Mist over Niagara Falls

Crossposted from Tux Machines

Earlier this month Jonathan Corbet published "Fighting the AI scraperbot scourge" (in LWN, or Linux Weekly News). The article became freely accessible to everybody earlier today. Corbet said "LWN content-management system contains over 750,000 items (articles, comments, security alerts, etc) dating back to the adoption of the "new" site code in 2002. We still have, in our archives, everything we did in the over four years we operated prior to the change as well. In addition, the mailing-list archives contain many hundreds of thousands of emails. All told, if you are overcome by an irresistible urge to download everything on the site, you are going to have to generate a vast amount of traffic to obtain it all. If you somehow feel the need to do this download repeatedly, just in case something changed since yesterday, your traffic will be multiplied accordingly. Factor in some unknown number of others doing the same thing, and it can add up to an overwhelming amount of traffic."

We have almost 250,000 pages and perhaps 300,000 objects. Some years ago scrapers became a pain in the arse (PITA), so we started converting everything to static. The transition was completed entirely in September 2023.

So far today it looks like we'll have served about 1.5 million requests at midnight. That's more than 50,000 per hour or 1,000 per minute. The server can cope with that, but for ordinary users the site feels slower as the queue of requests grows and is almost never vacant.

Blacklisting offensive IP address/blocks might be the last resort. As an associate put it last night, "the bots are killing dynamically generated sites. They are written by maliciously incompetent bumbling idiots with no regard for their impact on sites in any way. That includes complete disregard for copyright and other legal aspects."

Other Recent Techrights' Posts

Netcraft's New Web Server Survey Shows Microsoft Down in Every Category
That Microsoft is still visible in
Slopwatch: Anti-Linux Garbage and Fake 'Articles' About GNU and Linux, Courtesy of Serial Sloppers and Slopfarms
Today there is a frustrating amount of FUD online that wasn't published by humans but instead generated by LLMs
 
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Thursday, February 27, 2025
IRC logs for Thursday, February 27, 2025
What the LLM Scrapers Are Doing to Tux Machines
So far today it looks like we'll have served about 1.5 million requests at midnight. That's more than 50,000 per hour or 1,000 per minute.
Links 27/02/2025: Google Clown Computing Layoffs and Slack Goes Down as Usual
Links for the day
Links 27/02/2025: The Engagement Rehab and Another New Zine
Links for the day
Links 27/02/2025: Microsoft Trying Ads as Sales Fall, Preserving Data From Social Control Media a Real Problem
Links for the day
Hiding Crimes Against Women (i.e. Reputation Laundering) by Misusing Inapplicable Privacy Laws From Another Continent
As it turns out, "privacy" does not cover hiding illegal activities and if public information exists to prove these illegal activities, then it's perfectly OK to share it
Zurich CEO suicide, Martin Senn proximity to Adrian and Diana von Bidder-Senn, Debian
Reprinted with permission from Daniel Pocock
Debian, CentOS, RHEL source code demise now linked, accelerated after invalid trademark judgment
Reprinted with permission from Daniel Pocock
Civil Society Should Demand Removal of People Who Sought Removal of Richard Stallman
Perhaps it's noteworthy that the FSF is now being attacked (again)
RTO for You, But Not for Me: How IBM's Managers Try to Disguise Layoffs as "Resignations" or "Retirements"
What ever happened to corporate ethics?
Links 27/02/2025: Conflict Updates, Hacks Caught Red-Handed Misusing Licence to Exercise Law to Submit LLM Slop to Courts
Links for the day
Gemini Links 27/02/2025: Fuzzy Frontiers and New Arrivals at Geminispace
Links for the day
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Wednesday, February 26, 2025
IRC logs for Wednesday, February 26, 2025
From Strangling Women to SLAPPing Journalists (Microsoft in a Nutshell)
We won't ever capitulate to Microsofters who strangle women
Always Doing This Site for Principles, Not Money
Pro bono
The Short Lifecycle of Twitter Outrage
The upside is that the "tempo" of social control media is so fast (to cause addiction or "engagement" as the pushers put it) that the persistence of lies in social control media is rather poor
Microsoft Devoured the Open Source Initiative (OSI), Now It's Just a Chain of Blunders
The Open Source Initiative (OSI) is against openness
Chronological Index of Techrights
The index was created after Alex Oliva expressed interest
IBM employee from Zurich, Switzerland arrested, jailed for tunnel mistake that may have arisen due to sign colours
Reprinted with permission from Daniel Pocock
The Free Software Foundation's Fund-raising Efforts Continue Unabated (and With Positive Results)
Perhaps the cherry on the cake is that Microsoft influence agents now try to attack the people who run the FSF, for merely have the 'wrong' views on political affairs
Links 26/02/2025: Microsoft's "AI Value" Bubble is Blowing Up, Starbucks in Trouble as Well
Links for the day
Rumour About IBM Layoffs in the UK
That was 2 hours ago
Links 26/02/2025: Science, Hardware, and Politics
Links for the day
Timeline of Microsoft's 2025 Crisis and Growing Panic
Microsoft already had 3 waves of layoffs this year (not even 2 months have passed)
Slopwatch: Another Offending 'Linux' Site Found (Fake Articles About "Linux"), Postgres/PostgreSQL/PSQL Targeted by FUD from LLMs
It's all slop, as one can suspect
IBM Consulting: Layoffs Already in Progress
"What are the Deep Blue Thought Leaders World becoming? A rubbish heap?"
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Tuesday, February 25, 2025
IRC logs for Tuesday, February 25, 2025