Entry tags:
Blade Center Chassis Redundancy
Over the past years, the trend in data centre servers has been towards blades. Instead of buying each server as a self contained unit (power supplies, network interfaces, etc) you buy a blade centre chassis, put the modules you want in it (network, fibre channel, etc), then just plug individual blades into it.
There are various advantages to blades - consolidating power & cooling, overall management, higher density of servers per rack, etc. There's plenty of info out there if you Google for it.
With blades, as you now have many servers (14 in the IBM Blade Centre H Chassis) relying on the chassis, you ensure it's as redundant as possible - redundant power supplies, power feeds, network switches, fibre channel switches, etc. In theory, the worst that can happen is you lose a switch, or a power supply, and everything keeps working, perhaps at reduced capacity.
In practice, things can get much worse.
On Thursday afternoon, we lost contact with a number of servers in one of our data centre. Shortly afterwards, we received alerts from the VESDA fire warning system. These alerts quickly went from "possible smoke detection" to "smoke detection" to "fire detection" to "Get the f#$k out of that room, we're triggering the fire suppression system".
At that stage the FM-200 fire suppression system triggered, flooding the data centre with non oxygenated gas, putting out the fire, and rendering the room inaccessible to humans for several hours.
After the all clear was finally given by the fire department (they were in full self contained breathing apparatus whilst checking the room and letting the gas dispel), our guys were able to go in and investigate.
An entire Blade Centre Chassis was offline, and had a burning plastic smell coming from it. That's not a good smell from electrical equipment.
Of the chassis' four redundant power supplies, one was completely dead and the other three were blinking orange, and wouldn't power on. The inbuilt fibre channel switches, ethernet switches & management modules were all dead.
Our guys pulled all of the blades out, and de-racked the chassis.
This blade was the culprit. You can see the toasted power feed in the top left, and the expansion card in the bottom centre:

A close up of the power feed, top down view:

A close up of the power feed, view from behind:

Whilst it's not as easy to see, the chassis was also toasted, where the blade plugged in, and its neighbouring slots:

The guys had to replace the entire chassis, and carefully test a few of the blades which were furthest from the burnt one to ensure they worked, in order to bring back services running on that chassis.
So despite things you might read, such as:"Based-upon actual lab tests and a thorough tear-down of IBM’s BladeCenter chassis, Clabby Analytics concludes that the IBM BladeCenter design provides a superior reliability/availability design..."
In particular comments like: "The IBM BladeCenter chassis provides dual power paths through the midplane. This redundant power plane design means that in the event that some sort of short incapacitates one plane, IBM BladeCenter can failover to a secondary plane... Because IBM’s chassis has a second power path, if one power backplane fails, the other can be used to deliver continuous power ― helping result in none of IBM’s 14 servers crashing. In high availability environments, this could prove to be a very big deal..."
What appears to be a short on one of the blades, blew that blade and the entire blade centre chassis.
In three months, this wouldn't have mattered - we'd have had the second chassis online in a separate data centre, the timing for the failure was impeccable.
Takeaways Messages:
IBM Blades can combust, spectacularly.
Whilst combusting, they can trigger data centre smoke alarms, fires and fire suppression systems.
While combusting, even with a suitable fire suppression system, they can take out an entire IBM Blade Centre Chassis, no matter how redundant you make it.
Do not rely on a single Blade Centre Chassis, no matter how redundant you spec it.
There are various advantages to blades - consolidating power & cooling, overall management, higher density of servers per rack, etc. There's plenty of info out there if you Google for it.
With blades, as you now have many servers (14 in the IBM Blade Centre H Chassis) relying on the chassis, you ensure it's as redundant as possible - redundant power supplies, power feeds, network switches, fibre channel switches, etc. In theory, the worst that can happen is you lose a switch, or a power supply, and everything keeps working, perhaps at reduced capacity.
In practice, things can get much worse.
On Thursday afternoon, we lost contact with a number of servers in one of our data centre. Shortly afterwards, we received alerts from the VESDA fire warning system. These alerts quickly went from "possible smoke detection" to "smoke detection" to "fire detection" to "Get the f#$k out of that room, we're triggering the fire suppression system".
At that stage the FM-200 fire suppression system triggered, flooding the data centre with non oxygenated gas, putting out the fire, and rendering the room inaccessible to humans for several hours.
After the all clear was finally given by the fire department (they were in full self contained breathing apparatus whilst checking the room and letting the gas dispel), our guys were able to go in and investigate.
An entire Blade Centre Chassis was offline, and had a burning plastic smell coming from it. That's not a good smell from electrical equipment.
Of the chassis' four redundant power supplies, one was completely dead and the other three were blinking orange, and wouldn't power on. The inbuilt fibre channel switches, ethernet switches & management modules were all dead.
Our guys pulled all of the blades out, and de-racked the chassis.
This blade was the culprit. You can see the toasted power feed in the top left, and the expansion card in the bottom centre:

A close up of the power feed, top down view:

A close up of the power feed, view from behind:

Whilst it's not as easy to see, the chassis was also toasted, where the blade plugged in, and its neighbouring slots:

The guys had to replace the entire chassis, and carefully test a few of the blades which were furthest from the burnt one to ensure they worked, in order to bring back services running on that chassis.
So despite things you might read, such as:"Based-upon actual lab tests and a thorough tear-down of IBM’s BladeCenter chassis, Clabby Analytics concludes that the IBM BladeCenter design provides a superior reliability/availability design..."
In particular comments like: "The IBM BladeCenter chassis provides dual power paths through the midplane. This redundant power plane design means that in the event that some sort of short incapacitates one plane, IBM BladeCenter can failover to a secondary plane... Because IBM’s chassis has a second power path, if one power backplane fails, the other can be used to deliver continuous power ― helping result in none of IBM’s 14 servers crashing. In high availability environments, this could prove to be a very big deal..."
What appears to be a short on one of the blades, blew that blade and the entire blade centre chassis.
In three months, this wouldn't have mattered - we'd have had the second chassis online in a separate data centre, the timing for the failure was impeccable.
Takeaways Messages:
IBM Blades can combust, spectacularly.
Whilst combusting, they can trigger data centre smoke alarms, fires and fire suppression systems.
While combusting, even with a suitable fire suppression system, they can take out an entire IBM Blade Centre Chassis, no matter how redundant you make it.
Do not rely on a single Blade Centre Chassis, no matter how redundant you spec it.
no subject
You have so much fun!