prk | Blade Center Chassis Redundancy

Over the past years, the trend in data centre servers has been towards blades. Instead of buying each server as a self contained unit (power supplies, network interfaces, etc) you buy a blade centre chassis, put the modules you want in it (network, fibre channel, etc), then just plug individual blades into it.

There are various advantages to blades - consolidating power & cooling, overall management, higher density of servers per rack, etc. There's plenty of info out there if you Google for it.

With blades, as you now have many servers (14 in the IBM Blade Centre H Chassis) relying on the chassis, you ensure it's as redundant as possible - redundant power supplies, power feeds, network switches, fibre channel switches, etc. In theory, the worst that can happen is you lose a switch, or a power supply, and everything keeps working, perhaps at reduced capacity.

In practice, things can get much worse.

On Thursday afternoon, we lost contact with a number of servers in one of our data centre. Shortly afterwards, we received alerts from the VESDA fire warning system. These alerts quickly went from "possible smoke detection" to "smoke detection" to "fire detection" to "Get the f#$k out of that room, we're triggering the fire suppression system".

At that stage the FM-200 fire suppression system triggered, flooding the data centre with non oxygenated gas, putting out the fire, and rendering the room inaccessible to humans for several hours.

After the all clear was finally given by the fire department (they were in full self contained breathing apparatus whilst checking the room and letting the gas dispel), our guys were able to go in and investigate.

An entire Blade Centre Chassis was offline, and had a burning plastic smell coming from it. That's not a good smell from electrical equipment.

Of the chassis' four redundant power supplies, one was completely dead and the other three were blinking orange, and wouldn't power on. The inbuilt fibre channel switches, ethernet switches & management modules were all dead.

Our guys pulled all of the blades out, and de-racked the chassis.

This blade was the culprit. You can see the toasted power feed in the top left, and the expansion card in the bottom centre: