On 2004-05-04 (Tuesday), antah, skarnet.org's server, changed ISPs. Here is the e-mail I sent to skarnet.org's users a few days in advance:
Dear skarnet.org users, The skarnet.org server is switching ISPs. The current ISP is: Serveur Express (http://www.serveur-express.com/). It has been pretty reliable for the past three years, and working with them was a pleasure. The future ISP will be: Lost Oasis (http://www.lost-oasis.fr/). They are basically offering me the exact services that I need - hosting the machine and providing IP connectivity, and nothing more - which is much cheaper than the Serveur Express standard pack. I know the Lost Oasis guys, they love their work, they know I'm a total PITA whenever something's not right - I trust them to provide at least the same level of reliability as you have been used to. Schedule: * 2004-05-04, 09:00 GMT+2 (Paris time with DST) The skarnet.org server gets unplugged from the Serveur Express storage bay. I'm bringing it home. My registrar's DNS database gets updated. * 2004-05-04 to 2004-05-06 Hardware cleaning, hardware fixing. (The main disk has had DMA problems from the start; it didn't seem to impact reliability or performance, so I didn't bother. This will be a good time to get it solved.) Kernel upgrade, and so on. * 2004-05-06, 18:00 GMT+2 The DNS changes have propagated to the UltraDNS database. The skarnet.org server is up and running in the Lost Oasis storage bay. Consequences for you: Every service provided by skarnet.org (Web and FTP server, mailing-lists, individual POP3 mail storage, secondary DNS service, ...) will be completely unavailable for three days. Also, skarnet.org's IP addresses will change. The DNS database will be updated to fit the schedule as tightly as possible. Nevertheless, due to broken DNS caches all over the world, stale DNS data may still be lying around the world for 2 or 3 days after skarnet.org is up again. So you may experience some trouble during the first days. Everything should be back to normal after 2004-05-08. Of course, I will be unable to receive mail, too. Do not expect to be able to contact me in any way, or to hear from me, during those days. I apologize for the inconvenience, and hope that the change will allow me to keep skarnet.org running and meeting your needs for as long a time as you could wish. :) Your faithful sysadmin, -- Laurent
I had prepared for the worst by taking 3 days off and not making any plans for Tuesday and Wednesday night.
Boy, was I right.
Everything went along the plan. The folks at ClaraNet, Serveur Express's big hosting firm, were nice and helpful - real professionals. I brought Antah home without trouble. I was utterly amazed to see how little dust had got into the machine in three years. My home computers get more dust in one month! Gotta admit it: air conditioning in computer hosting rooms does work.
Then I proceeded to change the main disk, and all hell broke loose.
I've always had good karma with hardware. I've been in computers since 1995, and never had any serious hardware trouble. I've had numerous friends regularly complaining about how unreliable hardware is, and how much time they have to spend to fix some hardware deficiency, be it floppy disks, hard disks, RAM, CPUs, or anything. Of course, if you want reliability, you shouldn't buy a PC; but it's a truth that the PC hardware quality standards are surprisingly poor compared to what they should be, considering the knowledge that has been invested in the area. Anyway, I've gone through PC hardware history with few problems, to some people's amazement and envy. And on that day, the existence of my guardian angel was proven to me.
Antah's main disk was dead. Hopelessly, totally, definitely dead. I couldn't dd the partitions onto the new disk - the kernel would spurt out an endless stream of I/O errors, and the machine wouldn't boot on the new disk. Then, after a reboot, I saw that the dynamic glibc had been corrupted and I couldn't use simple tools such as mount.
(Antah doesn't use the glibc in normal operation, and every automated task only uses static binaries. Nevertheless, I keep a glibc handy for maintenance operations. I have proven my point, which is you don't need the glibc, or GNU software for that matter, to run a Linux-based server; now, I could go ahead and recompile everything using the diet libc, or even rewrite a lot of GNU software, but I don't think it's worth the time and effort. So I keep some convenience tools, glibc-based, just in case, and here was the case.)
Well. /bin, /sbin, /lib, /usr/bin,
/usr/sbin and /usr/lib were toast. Ouch.
So, using only my static tools (how happy I was to have written minutils!), I managed to tar as much of the partitions as possible, and write the archives to the new disk. This took time - Wednesday morning and afternoon - because some tars had to be tried two or three times to get as many files as they could. Just after the last archive was made, the whole directory structure collapsed (cd started to yield I/O errors).
And now the killer: there were, of course, a lot of corrupted or missing files in the archives. But none of them was important. In particular, /home and /package were untouched. I lost mainly the whole Debian GNU/Linux base (which I couldn't care less about), two months' worth of logs, a copy of ipsvd I had downloaded a few days before to study the code, and a static binary of some vi clone. Antah's disk had been ill, and apparently very ill, for a long time, but it patiently waited for me to dump all important data before dying. I had backups, of course, but I don't make them regularly and I don't back up everything - my startup script system, for instance.
So, yes, I guess I am lucky with hardware. And the time I had planned was just enough to do everything that was needed.
On Friday, everything went properly. The people at RedBus (Lost Oasis's hosting house) weren't half as helpful and professional as the ones at ClaraNet. In fact, the guy at the desk was a total asshole. But they left me alone in the machine storage room, so I could quietly put Antah there following Lost Oasis's instructions, which worked seamlessly. Except that RedBus doesn't seem to provide power cables, and I didn't know, so I had to steal one that was hanging around unused. If skarnet.org goes down, it's because some RedBus guy read that page and got his power cable back.
The DNS change propagation took one more day than expected, though. Part of it was my fault, and part of it was bad interface design from Gandi, my registrar (which is otherwise a pretty good registrar, and we French people are lucky to have it). Actually, to change a name server address, you need to fill in two forms: one to modify the whois database, and one to send to the parent DNS server (UltraDNS in the .org case). I didn't know, and it wasn't mentioned in the Gandi instructions, so I only filled in the first form on Tuesday, and the whois information was updated all right. On Saturday May 7th at noon, though, the skarnet.org NS field in the UltraDNS database hadn't changed, and Nicolas George pointed out that the delay was unusually long. So I took a closer look at the Gandi interface, hit myself on the head, and filled in the second form. In a couple of hours, the UltraDNS database was updated, and everything went fine afterwards.