[Announce] SYN/HAK Infrastructure Outage is Over [12-12 thru 12-15]

Ricky Elrod ricky at elrod.me
Thu Dec 15 21:42:06 EST 2011


Warning: This message is pretty technical, and assumes familiarity with 
some basic Virtualization and general Linux technology.

SYN/HAK,
   As you might be aware, our website and mailing list infrastructure 
has been offline for most of this week. What started out as what would 
normally be a simple upgrade from RHEL 6.1 to 6.2 turned into a disaster 
caused by the physical server that our VM is on locking up in the middle 
of the upgrade.
   When the server came back, the VM's boot partition was corrupted 
because of the way RHEL kernels are installed. Rather than "upgrading" a 
kernel, the old one is removed, and the new one is installed. The lockup 
happened at a time when the old one was removed but the new one had not 
yet installed. Well, needless to say this is a bad state to be in.
   So while I was working on getting the VM back to a sane state, 
@phuzion was working with the dom0 (physical server) host to get things 
cleared up on that end, and figure out why the box kept dying. Finally 
he was able to convince them to replace the server, while keeping the 
disks, and I spent the majority of last night working on bringing us 
back online.
   After about 4 hours of hacking away, I wasn't able to get pygrub and 
the kernel on the VM to see each other. Meaning the DomU (virtual 
server) would not boot. At this point I asked an IRC friend if he would 
mind taking a once over, to see if I missed anything obvious. I VNC'd to 
him and ssh'd from him up to the dom0, and he took a look. He wasn't 
able to find anything different than I was, after some config editing, 
we still couldn't get the VM to boot.

   Finally around 5AM, by my friend's suggestion, I gave up on that 
idea, and proceeded to create a new VM to replace the old one. The plan 
was to spend a lot of hours copying our config files over and making 
things work again....when it hit me - only /boot is broken. My fix was 
literally "take /boot from the new VM and move it to a safe place", 
"take /boot from the former VM and move it somewhere", "copy over all 
files from the old VM to the new one", "move the new /boot back to the 
new VM's hard drive." -- this, combined with a lot of LVM hacking, 
chrooting, some config editing brought us back online. Obviously the 
disk UUID changed, so I had to boot the new VM back up as it was 
originally (fresh RHEL install) and find its UUID info in /etc/fstab. 
The virtual MAC address also changed, so I had to edit some network files.

   As of right now, after 7+ hours of work last night, things are back 
online, and seem to be functioning fairly well. Some things are still in 
a bit of a "weird" or "confused" state, and these will be worked out as 
they appear, but the VM is functioning, mailing lists are back, the wiki 
is editable, and I am happy to report that no data was lost. The first 
step in all of this was copying the old VM's hard disk image off-site, 
and running an fsck to ensure that it was okay.

   I asked in #synhak last night, and am asking again: Those with access 
to the VM (currently Trever and Chris/phuzion), please try to not do 
much editing/customizing on it directly over the next day or two, as I 
want to make sure that we are in a sane and stable state. End user 
services can be used as normal, so as far as the wiki and mailing lists 
are concerned, edit (or send) away!

   If anyone notices anything weird, please feel free to let me know 
either on IRC or in email, and I'll try and get it fixed right away. If 
it's not sensitive, feel free to respond directly to this message. As 
always, we gladly except donations of hosted infrastructure. I would 
love to get some redundancy here, so that something like this can't 
happen again.

   One last thing - messages sent to the mailing lists during the outage 
should have been queued and sent already when things came back up. If 
there are messages missing for whatever reason, please re-send them, as 
they are lost.

Hack on!
-re
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://synhak.org/pipermail/announce/attachments/20111215/f7813316/attachment.html>


More information about the Announce mailing list