How Bad is it? Answer: Really Bad

Posted on November 21, 2007 at 11:04 pm by Jason Lee

Monday…
After another exciting day on the phone with tech support I am left wondering, when do you say enough is enough.

The morning started at 7am and I found out support didn’t open till 9am, I should have known it was going to be a long day.

The first 5 hours were spent calling every 30 minutes for updates from Intel.  While the support agents speak clear English, they weren’t exactly helpful.  They were only able to let me know that engineering was looking at the case and had requested remote access to our network.  Now they did note, engineering NEVER asks to remotely connect… wow wasn’t I excited to hear that.

A little after 1pm a support tech named Ruonan called my cell phone and said she would be taking the case.  I have to say she is probably one of the most pleasant tech support agents I have spoken with.  So Ruonan contacted Galen at LeftHand Networks and we started digging into the problem.  Galen connected to the SAN via SSH and noticed that the files that should be on the drives to create the configuration were missing… Hence we can’t find the data on the SAN.

Ruonan instructs us to then boot to slax pocket Linux and browse for the files on the DOM.  So we load slax Linux on a CD and boot from disk to create a bootable jump drive…  One item learned, SLAX boot iso doesn’t have an installer as the instructions state so we can’t make the USB jump drive bootable.  So Jeremie downloads DSL (D*** Small Linux).  We load that on to the jump drive.  All this monkeying around takes about 3 hours….  We boot up the SAN to find that the files are missing on the old DOM too…

So where does this leave us, according to Galen 80% chance of needing to rebuild the SAN.  So when do you wave good-bye to your data you know is on the SAN and start the backup recovery process?  Ugh, the though of restoring 13 virtual servers including a SQL server doesn’t give me warm fuzzies.

So at 5 pm I make the dreaded call… we are going to restore the servers from backup.  We contact Mark Moreno our SAN channel partner and the rebuilding of the SAN begins.

For the rest of the night we build and boot up virtual server templates and start the restore process.  At 1 am its clear we aren’t going to be up and running by morning so I tell Jeremie its time for us to head home.

Tuesday…
I arrive at the office at 7am and we continue to work on rebuilding the virtual servers. We did a system state restore to the print server and all the printers are back.. but the server isn’t stable… So we take the info about the print shares etc and start up a new template.   Around 9 am we have our Print Server back-online (minus 2 printers)…

I do have to say the restore process from our SonicWall CDP is really nice… We install the CDP client on the new Virtual servers and then right click on the directories and tell it to restore deleted directories and the process is under way. Need i say, no searching for tapes.  As soon as the data is back online, the directory is a watched folder again.

Around 2 pm the SAN is configured and ready for us to start restoring data. One bonus for the whole process, we configure the SAN with Load-balancing this time to utilize both NICs on the VM_Hosts and configure the mirroring and array in RAID 5 and gain almost a full TB of storage space.

We attempted to restore our SQL databases to the newly built SQL server but no luck.  A 2 hour call to Microsoft and a few tweaks on the permissions and using the cilconfig tool and we are able to remotely access to the SQL server.  But now the Desktop Authority application fails when we install… And of course DA support is closed.

While I am talking with MS about SQL, Jeremie starts a case with MS Support about restoring the system state of the file server.  We have the virtual server with the data restored but we need to use the system state to bring back the permissions and shares.  We called MS since we hadn’t had much luck with the system state restore on other servers earlier in the day, and our lives would be much easier if we could restore the file server properties.  So we snapshot the virtual server and MS helps restore the registry.. No luck on reboot one, but suddenly after reboots 2 and 3 file shares appear.  A little tweaking of the registry and we are back up… but minus permissions.  The tech tells  us that we will have to rebuild the permissions… So at least we got the shares back…So now that its 2 am I tell Jeremie its time for us to go home.

Wednesday…
A 7:15 am I call Scriptlogic and the tech points us to the fact we just need to do a clean install for DA and all is well… And once again our CDP works well… right click on the database in the client application on the SQL server and magically the DB is restored.  Logging into the DA console all our settings are there… we can now finish our Print server since we know the printer share names we need. 

It takes about 2 hours to restore the permissions to the file server but that is now back online.  So after 4 days we have our File, SQL, Print, Web, Antivirus servers back online.  The help-desk, SharePoint, Ghost and a few other less important servers will have to wait  until after thanksgiving weekend.

Our next steps, finish restoring the remaining servers, and review and evaluate the crisis.  This will include asking the question what can we do to prevent this from happening.

We found some issues with our backup process, primarily documenting configurations.  You don’t realize how much stuff  you store on your file server or in specific application databases until your file-server and SQL server are MIA.

One thing we will seriously look into: a storage space and process to backup the whole Virtual server.  If we had a backup of the whole VMDK (and other files) for each virtual server we could just restore the data to the offline server and this process would have been much quicker… Maybe a rack mounted NAS might be a good solution.  It needs to be off the SAN, but still on drives that are a Raid Array.

One thing we have learned, how to and that we should backup the configuration from the SAN.. Even though the support at Intel said that recovery of the SAN after a DOM failure would probably work if we had a backup of the config, they said that restore sometime fails too… But for the future, we have that config saved and and archived just in case.

I must say a few say thanks for the prayers!  Several including some Northwoods’ staff, JP and Ed called or emailed just to check in on us or say they were praying for us… very cool… this is what CITRT is all about.  I even had a vendor call and ask how they could help… Dean Lisenby from ACS is top notch in my book.  He calls after his work day is done on his way home to remind me that I have his cell phone number and can call for any reason…even just to check in… Dean isn’t your normal vendor.

I have to give huge Thanks to my staff.  Each person on my team has responded extremely well during this crisis; no task was trivial no matter if it was laughing at me when my thoughts are less than clear when they come out of my mouth at 1 am, fixing my forgetting to run a Virtual server as a service rather than under the local login account, answering the question of “when will I be able to….” for the rest of the staff, or unlocking doors after I leave my keys in the server room or my office, or the new server room, or in the bathroom, or.. well you get the picture.  Thanks Jeremie, Jim and Linda you 3 are an awesome team!!!!

Now I must sleep, 10 hours of sleep since Monday morning leaves a sleepy IT Director.

Posted in Church IT

Commentary

  1. Clif Guy

    11.22.2007 10:14 am

    That’s a cautionary tale for the rest of us. Glad you got through it okay, but that’s really awful. I’m starting to freak out wondering if our backup strategy is all it SHOULD be!

  2. David Szpunar

    11.26.2007 11:54 am

    Wow. What a story; glad you were able to get enough stuff working by Thanksgiving to actually take time off! Ahh…backups. We’ll be beefing up ours soon as soon as I figure out the best solution. Sounds like yours were better than some!

  3. [...] of our recent SAN crash we had to rebuild many of our virtual servers, one of those was our second Domain Controller.  I [...]

  4. [...] learned our strategy is sufficient to have a complete failure and not loose data, but we are still tweaking the system for the future…  We learned in our recovery we had [...]

Leave a response