Title: Holiday 2023 Outage Post-Mortem Created: 2024-01-11 Modified: 2024-02-02 In the days leading up to the end of 2023, `zlg.space`'s primary hard drive began having errors and locking up the entire server. It began with simple unresponsiveness, and ended in a replacement drive, new OS, and a bumpy road of `/opt` self-compiled madness to bring the backup to the modern ages of PHP. The server is now on bigger, faster storage and an OS that won't get in the way if/when it breaks again, but it didn't come without an arduous test of patience. December 28th, 2023 =================== The problem happens for the third time since before Christmas. By the end of the year, I discovered via `smartctl` that the drive had experienced many hundreds of error events concerning the reallocation of damaged sectors. It would detect bad sectors, but either would not or could not tag them as bad and reallocate the data elsewhere. The hard drive was in a laptop-sized enclosure and had not been thrown, dropped, stepped on, or otherwise physically disturbed since October. At this point I determined it was some sort of unrecoverable hard drive damage. Before using SMART, I had tried to use `fsck` to repair filesystem errors, in case the problem was at that level. It didn't seem to help. January 1st, 2024 ================= I decided to order a new storage device in the same form factor; the enclosure's controller could correctly read and interact with other drives that I inserted, so I know it wasn't the enclosure which errored. This new drive is an SSD, so even if I put it into the same environment, it will be less prone to vibrations. January 9th, 2024 ================= The drive arrives and I start considering my options for moving forward. The primary candidates were Gentoo, KISS, or PiLFS. January 11th, 2024 ================== I tried one last time to get the drive to work, so I could attempt to get data off of it. I had performed a backup of the website's data a month or two prior, so not much data, if any, was lost. I guess there are perks to being the only active user of your own site, haha. The server was formerly using Ubuntu Server, which uses systemd and cloud-init to provision your server as if it were part of a fleet. This is entirely overkill for a singular Raspberry Pi server, and includes many more layers of complexity than I need. I spent more time trying to figure out the Ubuntu way than I did just using tools I already know. "Modern" Linux with its container app formats and cloud-init shit and XDG portals and incomplete Wayland specs are just annoying. Like it or not, those technologies are not the way forward for Linux. After reconfiguring the Pi's firmware to boot directly from USB, I chose to go back to Gentoo Linux for my server. I was leaning toward KISS Linux, but their Pi ecosystem isn't quite ready for what I need. Gentoo gets out of the way, has the tooling to pick and choose, and ebuilds are hackable for anyone who knows shell scripting. I really wish distros could get the experience right for developers, server owners, power-users, etc. January 12th, 2024 ================== Configuration and installation of Gentoo continued, and tweaks were put in place to reduce the number of writes that will be made to the SSD, to extend its life a little. Using a source-based distro is already going to be a mild problem when it comes to writes, but it can be mitigated. January 13th, 2024 ================== I fought against a chicken-and-egg problem created by the proprietary nature of the Raspberry Pi's Wi-Fi and Bluetooth, so I had to fetch distfiles for a few packages to bootstrap wifi. I could have just finished prepping ssh instead, but I want this machine to be adaptable to networking conditions. Also, who doesn't like watching the first boot? The data from the backup was only from `/srv`, the Nextcloud installation, and the Postgres database cluster. System configs were not backed up before the old drive failed, but a backup from 2021 should make most of the restoration a smooth process. Now that this is all on one device, making backups should be trivial in the future. Networking eventually was put into place, and after adding a few extras to get started, I now have a user accessible via SSH and the correct security needed for the admin account. We can begin the actual rebuilding! January 14th, 2024 ================== *The real Gentoo Linux starts here.* It feels wonderful to have a system ready to be hacked on! The process begins to put together a firewall, webserver, smarter log rotation, and install the software I was running before! fail2ban and logrotate are the first goals, which went off without any major problems. Mostly just small configuration and boot-persistence kind of stuff. The next major goals are lighttpd and certbot, which will open up most of this server's available software. I sure miss my Nextcloud... January 15th, 2024 ================== I had to work and handle some errands and chores today, so not as much time to do big things. I installed ddclient and got started on Gopher. Fun little story there. I went to install my Gopher server of choice, `sgopherd`, and learned that its only available version in Portage was out of date. "Hmm, weird. I guess I'll go ahead and update it," I thought. So I did what one does when you need to bump an ebuild, double-checked on current dev tools/standards, made sure things were good, and... I'm aware it's in *maintainer-needed* status, but if I'm the only person who cares about it in five years, I'm not bothering to interface with the community. I don't mind maintaining my own ebuilds, having prior active experience with the distribution. I have no plans to publish a repository/overlay, however. My prior experiences weren't a net-positive, so I will not be sharing my work. People don't think it be like it is, but it do. January 18th, 2024 ================== It didn't take overnight; instead it got hung up on `dev-lang/rust-bin` due to insufficient space in `/var/tmp/portage`. No worries, I updated the `package.env` file to use the SSD instead of RAM, cleared `$TMPDIR`, and moved on. It wasn't going to take all night anyway. Eventually, I orchestrated the dance of `lighttpd`, `lexicon`, my registrar's DNS servers, and `certbot` to get a new certificate through Let's Encrypt. Now I can hook it up to cron again and wash my hands of it. I'm glad that having a partial certificate chain, and enough time between issuances was enough to get me back on track and not need any extra configuration or tweaking. Later in the evening, `php-fpm` was brought up and, with it, the wiki is now online! Next in line are cgit/gitolite and mumble. Then we'll only have one service left to bring up... January 23rd, 2024 ================== I did some admin configuration behind the scenes to formalize system structure. It was smaller than I thought it was, which is a sign that I'm doing things correctly. Due to the way Gentoo packages webapps -- and my complete lack of need for virtual hosts and other crap their tooling does, like needing versioned directories -- I compiled cgit myself and just threw it in a folder somewhere. To be fair, Gentoo generally packages things in a decent manner, so when I have the time and patience I may revisit `webapp-config`. For now, I just set the vhost provider to lighttpd so it stopped yelling at me about apache2. January 25th, 2024 ================== Cgit was installed and hooked up to Gitolite. I verified that I could push commits, and Cgit was displaying the 'About' page correct, after telling it to use my own `cm2html` which calls `mistletoe` instead of `markdown`. January 26th, 2024 ================== It felt good to make some progress with relatively little effort, so I restored the Postgres cluster from a `.sql` backup and verified the data was present and ready to be interacted with. January 27th, 2024 ================== The database restore made MantisBT work almost immediately, so that was encouraging! The final task, Nextcloud, took a number of days before I succeeded. Nextcloud 23 supports PHP 7.3, 7.4, and 8.0. The Gentoo install was on 8.1, and none of the 7.x series was available. So, I needed a way to run Nextcloud *juuuuust long enough* to get to a Nextcloud version that supports PHP 8.1. So, I tried with a third party overlay to install PHP in an older SLOT, because Gentoo is cool and can do that. Sadly, that did not work. I'm not shy on the command line, so the journey began to build OpenSSL, PHP, and APCu outside of the Portage tree (generally not recommended). I installed things to `/opt`, customized a few things, and... partial success? `occ` didn't quite work yet but a `phpinfo();` page worked mostly fine. We're getting there! Across the 28th through the 30th, I wrestled with dependencies and other things that were needed, until eventually the `occ` command worked and I was able to upgrade to Nextcloud 24, which supports PHP 8.1! January 31st, 2024 ================== The circus continues as I tried to mitigate each new tiny problem that came up. The funniest took me some digging. Across the 6 or 7 rebuilds of PHP (each taking at least two hours), I learned that Nextcloud stores passwords with different crypto functions, and missing any of those functions makes login break! Why this isn't covered by its dependencies, I'll never know. BUT, be sure to enable `argon2` in PHP if your Nextcloud logins don't work! I ended up locking myself out and having to disable bruteforce detection for my local IP for a bit before I found and fixed the password issue. In the end, **all data was okay and the server is in better shape now than it was before the event**. But, because Nextcloud itself took nearly a week to get working, that is the first piece of software I'm targeting for removal should I have to do this dance again. There's no excuse for something like a webapp to be that difficult to put together and maintain as an admin. Conclusion ========== Ultimately, I could not determine the root cause of the SMART errors. The drive had operated for around 30,000 hours across 3.5 years. I'm not sure if that's a good run or not, for a dinky little laptop drive. Vibrations from loud speakers *could* create the physical damage needed to eventually render the drive useless, but without greater access to tooling and a means to test, I cannot form a confident conclusion. A few days after I discovered the problem, the drive failed to spin up. I don't know what any next steps would be for data recovery, but I'm confident that if anyone lost files, it was me, and only files or changes since December 1st, 2023. I'll hold onto it until I get the gumption to take a hammer to it. If anyone has ideas on further inspection before I destroy it, hit me up. We'll just go forward with what we have. I apologize for such a lengthy outage. I would have fixed it sooner if real life hadn't been so adamant at getting in the way of things that are important to me. Keeping Nextcloud upgraded will prevent me needing to manually compile things in the future. *The server is here, and that's all that matters.*