LLM Scrapers Are a Nuisance, But They're Also a Reminder It's Time to Make Your Site Static
Perhaps the best protection is the ability to endure surges
Do not use Clownflare. Do not believe the hype. Do not impose JavaScript "apps" (that exclude many legitimate users) to "protect" the site. There is another way!
Take a moment to read our story, based on bitter experiences dealing with rogue bots.
1.5 years ago we changed the back end of all the sites and left the US (where we had hosted from... since 2006; or since 2004 in the case of Tux Machines).
No more PHP.
No more MySQL/MariaDB.
WordPress? Sorry, wrong address. Wrong "department". Same for Drupal and MediaWiki.
Since then we've had almost no hassle associated with heavy loads. Heavy traffic... yes. But it didn't cripple the sites or deny access to them. At best it could slow things down a bit... temporarily.
In terms of CPU cycles, we estimate that since making the switch we use about 100 times less. It's not trivial to quantify because the routers still do a lot of work, but those are external to the server.
There has been a lot of debate this past month about LLM scrapers (pursuing the bubble, due to FOMO) taking their toll on sites. Many technical bloggers wrote about it, Wikipedia moaned about it, today we even saw this new video about it. The tide is fast turning against the "usefulness" of such bots; people won't tolerate them anymore. They won't pay (manhours, hosting bills and so on) for someone else's Ponzi scheme.
Identifying LLM bots isn't as simple as one might imagine; they don't identify themselves as such.... and moreover they use "reputable" IP blocks "on the cloud", so one might see "AWS" or some Chinese conglomerate, not some dodgy address in North Korea.
One preventive measure then is changing the architecture of one's site or other services. Make them more robust to surges; simplify everything. The work associated with such a migration will "pay for itself" (over time). █