Bonum Certa Men Certa

What the LLM Scrapers Are Doing to Tux Machines

posted by Roy Schestowitz on Feb 27, 2025

Mist over Niagara Falls

Crossposted from Tux Machines

Earlier this month Jonathan Corbet published "Fighting the AI scraperbot scourge" (in LWN, or Linux Weekly News). The article became freely accessible to everybody earlier today. Corbet said "LWN content-management system contains over 750,000 items (articles, comments, security alerts, etc) dating back to the adoption of the "new" site code in 2002. We still have, in our archives, everything we did in the over four years we operated prior to the change as well. In addition, the mailing-list archives contain many hundreds of thousands of emails. All told, if you are overcome by an irresistible urge to download everything on the site, you are going to have to generate a vast amount of traffic to obtain it all. If you somehow feel the need to do this download repeatedly, just in case something changed since yesterday, your traffic will be multiplied accordingly. Factor in some unknown number of others doing the same thing, and it can add up to an overwhelming amount of traffic."

We have almost 250,000 pages and perhaps 300,000 objects. Some years ago scrapers became a pain in the arse (PITA), so we started converting everything to static. The transition was completed entirely in September 2023.

So far today it looks like we'll have served about 1.5 million requests at midnight. That's more than 50,000 per hour or 1,000 per minute. The server can cope with that, but for ordinary users the site feels slower as the queue of requests grows and is almost never vacant.

Blacklisting offensive IP address/blocks might be the last resort. As an associate put it last night, "the bots are killing dynamically generated sites. They are written by maliciously incompetent bumbling idiots with no regard for their impact on sites in any way. That includes complete disregard for copyright and other legal aspects."

Other Recent Techrights' Posts

Slopwatch: Brian Fagioli, Google News, and Other LLM Slopfarms
Why does Google News keep promoting these fake articles?
Links 29/10/2025: Amazon Kept "Data Center Water Use Secret", "Abuse of Power" Against Media
Links for the day
Gemini Links 29/10/2025: "My Hardware Specs" and "Goodbye Debian…"
Links for the day
EPO Cocainegate: Feedback and Clarifications
Part III will come out soon
Links 29/10/2025: "US Military Is Destroying the Planet Beyond Imagination" and Boat Strikes Deemed Unlawful
Links for the day
Quality Comes First (Techrights Search)
It's generally working already, but we wish to polish it some more
Techrights Party Countdown
Late next week we'll be holding a party near our home
European Parliament and Council Directive on Privacy is Vanishing
"edited / censored some time more recently"
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Tuesday, October 28, 2025
IRC logs for Tuesday, October 28, 2025
Slopwatch: The March of Slopfarms, From UbuntuPIT to Linux Journal and to Various Fake Sites Still Promoted by Google News
It's so worrying to see what the Web has become
Links 29/10/2025: CISA, Ukraine, and Amazon Problems
Links for the day
[Teaser] The EPO's Spokesperson, a Cocaine User, Fancies Young Women
How's that for "optics" in the EU and Europe's second-largest institution?
How Will António Campinos Respond to the EPO's 'Cocainegate'?
That's the same thing we saw and still see when the press deals with enablers and partners of Jeffrey Epstein
Join Us Now and Share the News - Part IV: There Cannot be Free Software Without Free Press and Free Information
One day, one can hope, more people will recognise that for Software Freedom we need free press and free thinkers
Join Us Now and Share the News - Part III: Principled Stance Is Never Cheap
Protecting the truth and insisting that the general public is made aware of things that really happened isn't cheap
Join Us Now and Share the News - Part II: Because Scarcity of Accurate Information Breeds Collective Ignorance
we too will strive to share information that's aggressively suppressed
Gemini Links 28/10/2025: More New Arrivals at Geminispace, xkcd on "Document Forgery"
Links for the day
Join Us Now and Share the News - Part I: Defence of the Truth
This year we make a very strong, firm statement for truth, even if that means explaining our work to the top media judge in the country
Links 28/10/2025: Meta and Fentanylware (CheeTok) Age-Restricted Down Under, "Britain Needs China’s Money"
Links for the day
Links 28/10/2025: Mass Layoffs at Amazon and Charter to Cut 1,200 Jobs
Links for the day
The Cocaine Patent Office - Part II: The Person Who Planted Paid-for Fake News for the European Patent Office (EPO) is a Cocaine User, Friend of António Campinos, Now on Record as Having Been Arrested
Background: High-level manager at the European Patent Office caught in public with cocaine, arrested
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Monday, October 27, 2025
IRC logs for Monday, October 27, 2025
Google News Drowning in Slop (and Slopfarms That Hijack About Half the Results)
Google News seems to be drowning in this stuff
Gemini Links 28/10/2025: "How to Maximize Your Positive Impact" and ASCII Art and Artist Attribution
Links for the day
PETA and Activism
Being staff or volunteer in PETA isn't easy
Big Blue, Huge Debt
debt will soar again
Links 27/10/2025: Mass Surveillance Sold as "AI", People Reluctant to Lose Physical Media
Links for the day
Parties and Milestones Again
we've begun putting up about 40 balloons
Techrights' 19th Anniversary: Bronze
Time to go back to preparing for this anniversary
Our Latest European Patent Office (EPO) Series Will Last Several Weeks, Will Ask the EPO Management and the European Union (EU) Very Difficult Questions
If nobody loses a job (or jobs) over this, then the EU basically became no better than Colombia or Nicaragua
Slopwatch: LinuxSecurity, UbuntuPIT, Brian Fagioli, and Google News
We focus on stories that are fake or LLM slop that disguises itself as "news" about Linux
Links 27/10/2025: Wikipedia Vandalism, Bruce Perens Opens up on Childhood
Links for the day
This Site Could Not be Done by LLMs Even If It Wanted to (Because It's Not a Parrot of What Other Sites Say)
LLMs have no knowledge or deep understanding
Microsoft is Disloyal Towards Its Most Loyal Employees
Against its most faithful enablers
19 Years, No Censorship
No factual information is ever going to be removed, more so if it is in the public interest
We Are Not a Conventional Site, That's Why They Hate (or Love) Us
Throughout the week this week we'll be focusing on the EPO
Following the Line of Cocaine All the Way to the Top
Even a million denials and spin-doctoring won't distract from the core issue
The Cocaine Patent Office - Part I: António Campinos Brought Corruption and Nepotism to the EPO, Then Came the Cocaine
High-level manager at the European Patent Office (EPO) caught in public with cocaine, the Office has some answering to do
Purchasing/Possessing Computers Isn't the Same as Controlling Computers
Let's strive to put computers back under the control of their users, no matter who purchased these (usually the users)
Gemini Links 27/10/2025: Alhena 5.4.3 and Fixing Bash
Links for the day
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Sunday, October 26, 2025
IRC logs for Sunday, October 26, 2025
Thankfully We've Made Copies of More Interesting Data From statCounter
If statCounter (the Web site or the 'webapp') vanished overnight, we'd still have something left of it
More Silent Layoffs at IBM/Red Hat
when the media counts such layoffs or presents tallies the numbers are very incomplete