948cc5f102085cd52f4d356b486c2586
Major Incident and Recovery
Creative Commons Attribution-No Derivative Works 4.0
THE good news is that Gemini is expanding faster than we predicted earlier this month. Lupa is now just 11 capsules short of 2,000 and yesterday we received some E-mails asking about Gemini downtime (we also got some inquiries over IRC, which means the Gemini capsule really matters to people).
"These things are inherently fragile; telling people to reduce the number of write operations is almost unreasonable because what good is a system you cannot use (or program) as you wish?"So why was it down? The short story is, it was a hardware failure. Not the fault of GNU/Linux or anything like that (in fact, credit to GNU/Linux for letting us fetch another complete backup of the entire system despite the whole file system being in read-only mode). There was no panic, just frustration, and based on what we heard about MicroSD-based (for boot) systems such an error was inevitable and almost predictable. The latest backup (before the "emergency" one was initiated) had been marked only a few days old (contents at most a couple of days behind).
All the services are now back online, the operating system was replaced by Debian 11, and the machine has twice as much storage space as before, which ought to permit us to do things we didn't even dare when space was tight. To reduce future downtime I also bought a spare disk (card actually) and will work on improving/reducing D-R time, as it's likely that a similar incident will happen later this year or next year. These things are inherently fragile; telling people to reduce the number of write operations is almost unreasonable because what good is a system you cannot use (or program) as you wish?
"We're hoping that tonight and tomorrow we can make up for the lost time..."Debian 11 is quite nice, but of course imperfect (perception is an impossibility). It's the first time I use Debian 11 (my wife, my sister and myself all use Debian 10 on our laptops) and maybe I'll get to write some positive things about it some time later this year (once I gain more experience/s with it).
We're hoping that tonight and tomorrow we can make up for the lost time; I hardly slept yesterday (stayed awake for about 20 hours straight, then just 4 hours of sleep) and we have a bunch of things lined up that I never managed to publish as restoring services (like IPFS and Gemini) was more pressing a task, more urgent a need.
The hardest part (to me personally) was having to go to Town for replacement components, knowing that few shops still exist (even fewer because of the pandemic) and the bigger shops are full of unmasked people who don't respect people's perimeter (it's not helping that our government likes to pretend COVID-19 is just some past event). ⬆