Search in 2025 (Age of DDoS Attacks Under the Guise of "AI" "Innovation")
Still work in progress
Earlier this year we spoke of a self-hosted, complete, Free software-based site search for Techrights (covering complete archives going back to 2006). We said completion by year's end would be feasible. That's still plausible.
Yesterday we tried hard to find an old video as we weren't able to remember the headline (it was from 2022). Then we tried Google, only to see it was "missing" pages from this site*; another search engine did help find it. We are also still developing and testing our own site-wide search, which has some remaining (and mysterious for now) bugs.
Despite the bugs, and in spite it not being accessible to the public just yet (partly because of bugs), we already use our search facility to search our archives and put together new articles. Example of a timely search:
It takes about 0.1 seconds to run this query against a large dataset and deliver all the results via the Web (from London to Manchester, hence relatively low network round-trip overhead). One common concern when things go "live" is that any random bot out there can execute queries, pumping up RAM and CPU usage, as happened when we used MediaWiki and WordPress (MediaWiki makes wrong assumptions about the nature of users and WordPress is easy to bombard using complex queries transmitted to the back end in bulk, not limited to searches; JOIN operations can be expensive). We saw several news sites shutting down their search facility altogether in recent years (we can only guess this was a factor if not the main reason/culprit).
We don't want to be more exposed to DDoS attacks. Rate-limiting only works if there's a low diversity of IP addresses (a DDoS has many addresses, hence the first D in the acronym); global limits may mean you give ample room for bots, shutting out the actual humans (legitimate users). █
____
* Google Search has gotten so much worse over time; it's not only full of (hence rewarding) SEO spam and easily manipulated by slopfarms. It also removed support for some of the most useful search features, cached pages are no longer accessible, and many "old" pages got dropped from the indices. In its genesis Google Search was useful for Linux-related searches (it had a portal devoted to it), nowadays it's a soup of garbage with a glorified brand and paid-for toolbar/search bar placements (many people still use Google Search because it's "already there" embedded in the GUI).

