Server Crash

Have’nt written much in my blog lately because I have been to busy with work. The provider where we placed our 19″ rack and provides us with electriciy and bandwith for our hosting platform has decided to upgrade the power facilities. As a cause of this, they would take the power off our rack at the night from 24/25 April 2007, at 1.50AM.

After shutting down all servers at night and turning them back on again, a problem occured with one of our webservers. It halted at the boot loader screen (Actually just before the daemon drawing). It seems as if the screen was not showing anything from that moment on but the server was still booting (FreeBSD5.3). After a while everything seemed to be working but it appears as if the server was started in single user mode. All services where down accept ssh. After starting most manually we tried to dig into the problem.

My collegue bayu from Indonesia started recompiling userland and I waited for that, after it was recompilled and rebooted, the screen still did not show the startup procedure however, after a while I was getting a login prompt. I went home as it was already 6:00 AM and went straight to bed when I got home, 6:35 AM.

At 9.00 I got a wake up call from my boss saying that the server was not reachable anymore. I went downstairs to check from my home pc and it was correct, nothing was working anymore and we where not able to login to the server, however it was still pingable. I went to the datacenter
Again the screen stopped just before the daemon should appear and I couldn’t do much else then booting from the cd-rom. Bayu told me the last thing he did was reinstalling exim because of some problems we had earlier with it.

We where able to boot into fix-it mode however nothing really helped. We took out the 2nd HD (RAID1) so that we had a copy and did a fsck on the first disk. It fixed some errors, tried booting it but unsuccesfull. We tried copying back an old kernel, but none worked. We decided to make a spare backup and where able to mount a drive from our backupserver using NFS, from fixit mode.

One very annoying problem was that fixit mode keeps giving errors (Sudden commands stop working and give a segmentation fault, core dump) once in a while and then had to be rebooted again and again and again.

It took very long to fix this, I went back home the next morning arround 10 AM, straight for bed while our guys from indonesia keept working on restoring the backups that I made on a different machine (We had to switch to a spare machine with debian 3.1). The complexity is that the old machine was BSD and the new one Linux, which might cause some compatibility issues.

At night we had everything back up and running, after that I have been busy fixing small issues with mod_evasive, firehol etc.

Leave a Reply