[Announce] SYN/HAK Infrastructure Outage is Over [12-12 thru 12-15]
ricky at elrod.me
Thu Dec 15 21:42:06 EST 2011
Warning: This message is pretty technical, and assumes familiarity with
some basic Virtualization and general Linux technology.
As you might be aware, our website and mailing list infrastructure
has been offline for most of this week. What started out as what would
normally be a simple upgrade from RHEL 6.1 to 6.2 turned into a disaster
caused by the physical server that our VM is on locking up in the middle
of the upgrade.
When the server came back, the VM's boot partition was corrupted
because of the way RHEL kernels are installed. Rather than "upgrading" a
kernel, the old one is removed, and the new one is installed. The lockup
happened at a time when the old one was removed but the new one had not
yet installed. Well, needless to say this is a bad state to be in.
So while I was working on getting the VM back to a sane state,
@phuzion was working with the dom0 (physical server) host to get things
cleared up on that end, and figure out why the box kept dying. Finally
he was able to convince them to replace the server, while keeping the
disks, and I spent the majority of last night working on bringing us
After about 4 hours of hacking away, I wasn't able to get pygrub and
the kernel on the VM to see each other. Meaning the DomU (virtual
server) would not boot. At this point I asked an IRC friend if he would
mind taking a once over, to see if I missed anything obvious. I VNC'd to
him and ssh'd from him up to the dom0, and he took a look. He wasn't
able to find anything different than I was, after some config editing,
we still couldn't get the VM to boot.
Finally around 5AM, by my friend's suggestion, I gave up on that
idea, and proceeded to create a new VM to replace the old one. The plan
was to spend a lot of hours copying our config files over and making
things work again....when it hit me - only /boot is broken. My fix was
literally "take /boot from the new VM and move it to a safe place",
"take /boot from the former VM and move it somewhere", "copy over all
files from the old VM to the new one", "move the new /boot back to the
new VM's hard drive." -- this, combined with a lot of LVM hacking,
chrooting, some config editing brought us back online. Obviously the
disk UUID changed, so I had to boot the new VM back up as it was
originally (fresh RHEL install) and find its UUID info in /etc/fstab.
The virtual MAC address also changed, so I had to edit some network files.
As of right now, after 7+ hours of work last night, things are back
online, and seem to be functioning fairly well. Some things are still in
a bit of a "weird" or "confused" state, and these will be worked out as
they appear, but the VM is functioning, mailing lists are back, the wiki
is editable, and I am happy to report that no data was lost. The first
step in all of this was copying the old VM's hard disk image off-site,
and running an fsck to ensure that it was okay.
I asked in #synhak last night, and am asking again: Those with access
to the VM (currently Trever and Chris/phuzion), please try to not do
much editing/customizing on it directly over the next day or two, as I
want to make sure that we are in a sane and stable state. End user
services can be used as normal, so as far as the wiki and mailing lists
are concerned, edit (or send) away!
If anyone notices anything weird, please feel free to let me know
either on IRC or in email, and I'll try and get it fixed right away. If
it's not sensitive, feel free to respond directly to this message. As
always, we gladly except donations of hosted infrastructure. I would
love to get some redundancy here, so that something like this can't
One last thing - messages sent to the mailing lists during the outage
should have been queued and sent already when things came back up. If
there are messages missing for whatever reason, please re-send them, as
they are lost.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Announce