MasterFlex Preventative Maintenance
Every Other Monday night we have our IT Work nights to service our network and Monday night was the night scheduled to replace the mid-plane on MasterFlex. We had worked with support to identify the cause of the random alerts MasterFlex was sending out and the engineers decided it was the sensors on the Mid-Plane.
So we powered down all the virtual servers and the hosts and took basically everything out of the chassis to prepare for the hardware swap.
The installation was fairly simple after we removed the 12 screws, installed the new mid-plane and installed the modules back in the chassis.
We restarted the chassis to find that the Mid-Plane was a older version of firmware and that the control module was down grading all the components to a previous version… then all went to chaos.
The StorageModule asked to enter safe mode to change the firmware (a process we had done before) but after the module cycled down it never came back online in the management console.
After several hours of tech support calls we identified the module wasn’t powering on, so a part was dispatched. We continued to talk with support for a little while longer trying to see if there was any resetting or anything of the module that could be done to have the part cycle back on but there was no success.
Around 2 am we were told that the case was escalated to engineering and we would need to wait until engineering contacted us.
For the next 3 1/2 hours Jeremie and I worked to move around virtual servers and recover a few backups to bring online the majority of our services.
When we went home at 5:45 everything (mission critical) but ACS and our Print server were online. We didn’t bring ACS back from backup because we would loose all of the contributions and other Monday AM processing that had happened after the previous nights backup and prior to our Monday night backup that had not yet happend.. and the Print server didn’t restore from the system state.
So after a few hours of sleep we talk with Intel and they dispatch a new StorageModule /Controller.
More to come as the process continues…..
Kirt Manuel
06.18.2008 8:57 pm
You should note just for bragging rights that that’s 5:45 am. That’s pretty cool for somebody from Ohio.
Jason Lee » Kudos to ACS OnDemand
06.24.2008 11:05 am
[...] our recent outage Dean Lisenby and the Team at ACS came thru for us [...]