Bonum Certa Men Certa

The LLM Ouroboros Phenomenon

posted by Roy Schestowitz on May 19, 2025,
updated May 19, 2025

An ouroboros in a 1478 drawing in an alchemical tract

Ancient Greek mythology came up with this concept of an ouroboros, wherein some animal - typically a snake for a feasible "IRL" (in real life) metaphor - eats itself by eating its own tail. We would not be the first to point out the analogy here for LLMs because an ouroboros is a good parable. This morning we catalogued two BSD and Linux sites complaining about desperate LLM scrapers staging a DDoS attack in pursuit of original, as in human-written, code or words. This isn't a new problem for us and in the past few days we served about half a million pages in Gemini Protocol, likely due to to LLM scrapers. It's obnoxious to say the least, but distinguishing benign from malicious (or worthless junk) requests is hard and a "moving target" (it's never enough as parasites learn to adapt).

This morning in IRC we made an assertion about LLMs and fake (slop) images. We also made several observations. Fact #1: over time slop gets worse (training set is like some blurry JPEG). Fact #2: People's "smell" for slop improves over time, as they 'train' on slop and can detect it based on prior encounters. Put 1 and 2 together.

Are LLMs bound to not only get worse but also more easily detectable by an increasingly sceptical general public? TheLayoff.com has just responded to this.

An associate opines that fact #1 (that slop gets worse over time) is exacerbated by the flood of slop on the Net being snarfed up by newer bots and mistaken for training data. "Thus the feedback loop I mentioned a long time back and which Andy wrote about in depth." (He was referring to Dr. Farnell's good writings about this dilemma - as he did several times in The CyberShow's blog)

To a certain extent my Ph.D. thesis (dissertation) covered this about two decades ago. The associate says that it's a "well-known problem from days of old".

There are several unique aspects to this, including validation bias. To me it seemed a bit related to but not the same as over-training because, as an associate explains, "overtraining is something else: too much data and the patterns become locked too tightly to the training set and less useful for new data".

For an LLM to scan online its own output serves to affirm the mistakes, or the errors, often euphemised as mere "hallucinations", which are innocent, not libellous, and by no means "intentional" and "harmful". Dr. Farnell and Dr. Kate Brown responded to this last October in "Radical disbelief and its causes".

In the context of my thesis (dissertation), a concern was raised about what we back then called "synthetic data" finding its way "back" into the training set. So when you check brain MRI scans (which is what we did back then) you must ensure you only ever deal with real data, not mock or manipulated data that can confirm your own biases and "fit into" the model that generated it in the first place (in generative mode). To use the analogy of text-based LLMs, your BS is "truth" if your input is your own BS (output/s) and it would be deemed accurate, based on you (opposite of the notion of peer review in science). The associate correctly points out, based on a scan of my thesis (dissertation), that the strings "overtraining" and "over-training" are not in the dissertation, but we used different terms back then.

A squat toilet (also known as an Eastern, Turkish, Iranian or Natural-Position toilet). This one is in Turkey

"An LLM Ouroboros of shit", as the associate dubs it, would be statistical models (such as PDMs or AAMs*) treating computer-generated images as something from "the real world".

The so-called "generative hey hi" (genAI) "bros" won't allow the media to talk about such issues, at least if they can downplay the issues and deny/misportray them (in the media). But it's a real and growing problem. Its magnitude likely grows quadratically, not linearly. Just like other bubbles (overabundance based around hype), don't expect linear implosions. When it's gone (poof!), it's gone.

____

* PDM and AAM need expansion in the explanatory sense, not just words (in the acronyms). PDMs go back several decades ago they were invented or pioneered by the people who tutored me. They use mathematical, statistical models to perform multidimensional analysis of data variations, based upon principal component analysis (PCA). AAMs are an extension but with textures, not only points. This is really old stuff; even AAMs are over 23 years old; now the mainstream media pretends those are some kind of "revolution".

Other Recent Techrights' Posts

Who Imitates Who? Plagiarist as Client (From Microsoft), 'Plagiarism' at the Law Firm?
let's revisit the subject
Links 10/06/2025: Jaws at 50 and US Democracy Crushed Very Rapidly (Martial Law Seems Imminent)
Links for the day
Abuse Inside the Polish Patent Office (UPRP) - Part VII: Washing Their Hands After Corruption and Abuse
"Tragedy or comedy?"
Culling Bad RSS Feeds of Bad Sites
Not throwing out the baby with the bathwater
 
IBM's CEO Roasted, Sizzled and Grilled for Dumb and Inconsistent Vapourware Promises
It looks like being a chronic liar is what it takes to lead the company once synonymous with computing
IBM's Goal Is Not (and Never Was) Computer Users' Freedom
More than 1.5 decades ago I found IBM to be an "ally of convenience" because of OpenDocument Format (ODF)
Wayland Shows the IBM/Red Hat Way of Doing Things
IBM is trying to 'kill' X
GitHub is Proprietary, Controlled by Microsoft, and GPL Violation Warehouse
"IRS tax filing software [will be] released to the people as free software" ... In general this is good news
Slopfarm Catastrophe
Seems like BetaNews (or BetaNoise) has just suffered a major data loss and restored the site from a week-old backup
Abuse Inside the Polish Patent Office (UPRP) - Part VIII: Illegal Working Conditions
How many people need to die for these people to get their massive salaries?
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Tuesday, June 10, 2025
IRC logs for Tuesday, June 10, 2025
Links 10/06/2025: Apple Hype and Physical Attacks on Bloggers
Links for the day
Gemini Links 10/06/2025: Loon Lake, Farming, and Forth
Links for the day
If 'Microsoft v Techrights' is Dealt With by a 'Microsoft Court' (or a Court Outsourced to Microsoft)
More on that later
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Monday, June 09, 2025
IRC logs for Monday, June 09, 2025
Gemini Protocol Turns Six in 10 Days From Now
If you haven't tried it yet, then give it a go today
Live as You Preach
technology is fast becoming dysphoric
Gemini Links 09/06/2025: Addition Addiction and Nitride
Links for the day
Links 09/06/2025: Science, Hardware Projects, and Democracy Receding
Links for the day
Computers Got Smaller, So GNU/Linux Got Bigger
Many people here recognise the lack of urgency (or need) to get expensive new laptops
BetaNews is a Plagiarism and LLM Slop Hub, the Chief Editor Isn't Addressing This Problem Anymore
SS Fagioli is basically a parasite leeching off or exploiting other people's work
Links 09/06/2025: Chaos in Los Angeles and Hurricane Season
Links for the day
GNU/Linux Grows at Windows' Expense and Microsoft Trolls Infest and Maliciously Target Articles About It
Microsoft is - and has long been - organised crime
They Say I'm Mr. Bombastic
They didn't take good lawyers
Links 09/06/2025: Windows TCO and Many Data Breaches
Links for the day
Abuse Inside the Polish Patent Office (UPRP) - Part VI: Political Stunts by Former President Edyta Demby-Siwek and the Connection to Profound Corruption at EUIPO
it's like a money-laundering operation where one politician rewards another at taxpayers' expense
Gemini Links 09/06/2025: Pipelines and Splitgate
Links for the day
Over at Tux Machines...
GNU/Linux news for the past day
IRC Proceedings: Sunday, June 08, 2025
IRC logs for Sunday, June 08, 2025