What the LLM Scrapers Are Doing to Tux Machines

posted by Roy Schestowitz on Feb 27, 2025

Crossposted from Tux Machines

Earlier this month Jonathan Corbet published "Fighting the AI scraperbot scourge" (in LWN, or Linux Weekly News). The article became freely accessible to everybody earlier today. Corbet said "LWN content-management system contains over 750,000 items (articles, comments, security alerts, etc) dating back to the adoption of the "new" site code in 2002. We still have, in our archives, everything we did in the over four years we operated prior to the change as well. In addition, the mailing-list archives contain many hundreds of thousands of emails. All told, if you are overcome by an irresistible urge to download everything on the site, you are going to have to generate a vast amount of traffic to obtain it all. If you somehow feel the need to do this download repeatedly, just in case something changed since yesterday, your traffic will be multiplied accordingly. Factor in some unknown number of others doing the same thing, and it can add up to an overwhelming amount of traffic."

We have almost 250,000 pages and perhaps 300,000 objects. Some years ago scrapers became a pain in the arse (PITA), so we started converting everything to static. The transition was completed entirely in September 2023.

So far today it looks like we'll have served about 1.5 million requests at midnight. That's more than 50,000 per hour or 1,000 per minute. The server can cope with that, but for ordinary users the site feels slower as the queue of requests grows and is almost never vacant.

Blacklisting offensive IP address/blocks might be the last resort. As an associate put it last night, "the bots are killing dynamically generated sites. They are written by maliciously incompetent bumbling idiots with no regard for their impact on sites in any way. That includes complete disregard for copyright and other legal aspects." █

Other Recent Techrights' Posts

Microsoft XBox is Dying as More Retailers Stop Stocking It and Massive Layoffs Planned Again: Microsoft is circling down the drain
Linux and the Freedom Paradox: Linux is losing freedom if some external actors who only use Microsoft tools for development wrest control
Watch the FSF Party Live (via Livestream): It's in WebM format, which is widely supported by now
Advocacy of Software Freedom Changed, LUGs Became Less Relevant: The way we see it, support groups like LUGs sort of outlived their usefulness when it became easier to install GNU/Linux
Links 05/10/2025: Slow News Day and Wondering About the Canada Post Walkout: Links for the day
Gemini Links 05/10/2025: Telnet Debugging and The Programmer’s Brain: Links for the day
More Than "Just a Rumour": XBox Seems to Have Just Died: At this point, why would any studio out there target or partner with XBox?
How to Tell Your Community, Project or Company is Being Infiltrated by Saboteurs: How to identify nefarious social engineering
The Fortieth Birthday of the FSF Made Us Extremely Happy: It feels like the 'hacker community' is regrouping to discuss things and prepare for the next Big Challenge
Chat Control 2 Them, Not 2 U: Follow the advice of Dr. Patrick Breyer
Mozilla: Throw Away Your "Old" PC and Enable "Digital Rights Management (DRM)": This is heading in a bad direction
Controlling Our Computing for Another Forty Years: 40 years of freedom
Motivational Small Place to Run Large Sites: We deem this scenery motivational and inspiring
Techrights' Text Version (Daily Bulletin) Turns Five This Month: our plain-text bulletins are turning 5 this month
We'll Continue Covering the Moribund OSI and Other Dysfunctional if Not Hostile Institutions: Stefano Maffulli's departure is due to his defection and due to him failing the mission in pursuit of money (his salary)
Links 05/10/2025: Lufthansa Layoffs (4,000) and More Spotify Woes (Aside From Massive Debt): Links for the day
The Free Software Foundation's Livestream Has Ended, Video/s Might be Online Next: I've asked whether they'll upload video of some of the event; I still wait for an answer
The Register MS Does Not Know the Difference Between Microsoft GitHub and GitLab: At the time of writing (October 5) the article from "Thu 2 Oct 2025" remains uncorrected
"Bullshit Generators" (What RMS Calls LLMs) and Fake Images Already Target the FSF: Why does Google News promote fake articles about the FSF while omitting all the real ones?
Software Patents as a Bubble: Don't invest resources in hype; if you detect a bubble, run away from it
Links 05/10/2025: Political Leftovers, Climate Change, and Security Incidents: Links for the day
Over at Tux Machines...: GNU/Linux news for the past day
IRC Proceedings: Saturday, October 04, 2025: IRC logs for Saturday, October 04, 2025
For the Second Time in a Few Weeks Microsoft Lunduke Makes False Accusations Against Senior Red Hat Staff to Incite a Despicable 'Troll Army': Nothing that Microsoft Lunduke claims or says can be trusted
When Microsoft "Integrates" Something With "AI" It Means It's Losing Money and Is Generally Hopeless: how did Bing fare after 36 months of LLM slop being hyped up as "replacement" for search?
Most Certificates Don't Improve Security, They Mostly Increase Downtime (for No Good Reason): The 'Gemini sites' (capsules) are a growing force
The statCounter Site Has Data Integrity Problems: Maybe we'll get back to statCounter when its data becomes more "stable" again
10 Ways to Combat Software Patents: software patents are loathed also by proprietary software developers
"Just a Little Bit of Meat...": Free software "absolutism" is not a radical stance, more so if the only "radical" belief the user possesses is that he or she must be in control of his or her software, and by extension his or her computer
Compromised by NVIDIA Proprietary Library: Meanwhile in Boston there are "[r]oundtable talk with FSF volunteers (both in-person and online)"
Red Hat is Ignoring the Free Software Community, It's a "Fortune 1000" Vendor: Red Hat's blog also participates a lot in promoting of Wall Street's latest pump-and-dump "AI" scheme
Free Software Foundation Party Has Begun: We shall be focusing a lot on software patents today
Former Head of the Federal Trade Commission (FTC) Lina Khan Knows Whatever Microsoft Touches Will Die: Just like Skype (as recently as months ago) [...] When Microsoft grabs things, or when it buys things, it almost never ends well
Slopwatch: Fake Articles About LibreOffice in Austria and Wine 10.16: very short
Links 04/10/2025: "attempted Coup" Noted in Facebook, Russia Kills Journalists via Drones: Links for the day
Gemini Links 04/10/2025: Anesthesia and Baudpunk: Links for the day
How Software Patents Were Viewed or Their General Status Changed Over Time: A rough summary
Links 04/10/2025: "Privacy Harm Is Harm", Criticism Outlawed in US: Links for the day
Garmin Uses Linux for Some of the Garmin Products, Now It's Sued by Strava Using Software Patents: Software patents should never have been granted in the first place
Richard Stallman Will Give a Talk in Sweden in 6 Days: Dr. Stallman, despite his battle with cancer is still alive and mentally sharp
FSF Turns 40: We'll be focusing on patent-related topics this weekend
Over at Tux Machines...: GNU/Linux news for the past day
IRC Proceedings: Friday, October 03, 2025: IRC logs for Friday, October 03, 2025
Gemini Links 04/10/2025: Distro Hopping and "Part Time": Links for the day
We Are Turning 19 in One Month, FSF Turns 40 in 3 Hours (CET): For our anniversary next month we still have no concrete plans
Patent Docs (or PatentDocs) Learned the Wrong Lessons From the Death of TypePad: Had they gone ahead with an SSG, they'd become a lot more future-proof
USPTO Patent Bubble Already Imploding, After Decades of Artificial Inflation, Entire Offices Close for Good: we can deduce that financial pressures (lack of "demand" for monopolies) play a role
TikTok is Not Harmless (Being CheeTok in the US Will Advance Orange Agenda): Social control media isn't "fun and games"; it's a digital weapon that lets hostile groups or nations infiltrate others, then turn them against themselves
Andy Farnell and Helen Plews Explain What "Modern" Tech Does to Old People: Imposing terrible tech "religion" on people is not helping them
Tomorrow the Free Software Foundation (FSF) Turns 40 and Its Web Site is Still Slow Due to DDoS by LLM Slop Bots: For an advocacy group, uptime is important (for its message to remain accessible)
Slopwatch: Google News as a Firehose of LLM Slop About "Linux": Google News is really bad
Datamation, Where I Used to Publish Articles, Appears to Have Been Sold to TechnologyAdvice Only to Become a Slopfarm: I'd prefer to not associate with that site anymore
Links 03/10/2025: "NPR’s Economics Lessons Come With Neoliberal Spin" and Canada Post at Risk: Links for the day
Gemini Links 03/10/2025: Panic Attacks and Food Adulteration: Links for the day
Links 03/10/2025: Lawyers Caught Using LLM Slop Explain Why They Did It, LibreSSL 4.1.1 and 4.0.1 Released: Links for the day
FSF Board Grew 50% Since Last Year, Has New President, Turns 40 in Two Days: It's a good move for the FSF and - by extension - for software freedom
Links 03/10/2025: Conflicts, Death of TypePad, and TikTok/CheeTok Gives a Boost to Far Right Groups in Europe: Links for the day
Over at Tux Machines...: GNU/Linux news for the past day
IRC Proceedings: Thursday, October 02, 2025: IRC logs for Thursday, October 02, 2025
Slopwatch: Linux Journal, Google News, and LinuxSecurity: They carry on polluting the Web with fake articles
Gemini Links 02/10/2025: Kubernetes With FreeBSD and robots.txt: Links for the day