prk: (Default)
[personal profile] prk
Over the past years, the trend in data centre servers has been towards blades. Instead of buying each server as a self contained unit (power supplies, network interfaces, etc) you buy a blade centre chassis, put the modules you want in it (network, fibre channel, etc), then just plug individual blades into it.

There are various advantages to blades - consolidating power & cooling, overall management, higher density of servers per rack, etc. There's plenty of info out there if you Google for it.

With blades, as you now have many servers (14 in the IBM Blade Centre H Chassis) relying on the chassis, you ensure it's as redundant as possible - redundant power supplies, power feeds, network switches, fibre channel switches, etc. In theory, the worst that can happen is you lose a switch, or a power supply, and everything keeps working, perhaps at reduced capacity.



In practice, things can get much worse.

On Thursday afternoon, we lost contact with a number of servers in one of our data centre. Shortly afterwards, we received alerts from the VESDA fire warning system. These alerts quickly went from "possible smoke detection" to "smoke detection" to "fire detection" to "Get the f#$k out of that room, we're triggering the fire suppression system".

At that stage the FM-200 fire suppression system triggered, flooding the data centre with non oxygenated gas, putting out the fire, and rendering the room inaccessible to humans for several hours.

After the all clear was finally given by the fire department (they were in full self contained breathing apparatus whilst checking the room and letting the gas dispel), our guys were able to go in and investigate.

An entire Blade Centre Chassis was offline, and had a burning plastic smell coming from it. That's not a good smell from electrical equipment.

Of the chassis' four redundant power supplies, one was completely dead and the other three were blinking orange, and wouldn't power on. The inbuilt fibre channel switches, ethernet switches & management modules were all dead.

Our guys pulled all of the blades out, and de-racked the chassis.

This blade was the culprit. You can see the toasted power feed in the top left, and the expansion card in the bottom centre:



A close up of the power feed, top down view:



A close up of the power feed, view from behind:



Whilst it's not as easy to see, the chassis was also toasted, where the blade plugged in, and its neighbouring slots:



The guys had to replace the entire chassis, and carefully test a few of the blades which were furthest from the burnt one to ensure they worked, in order to bring back services running on that chassis.

So despite things you might read, such as:"Based-upon actual lab tests and a thorough tear-down of IBM’s BladeCenter chassis, Clabby Analytics concludes that the IBM BladeCenter design provides a superior reliability/availability design..."

In particular comments like: "The IBM BladeCenter chassis provides dual power paths through the midplane. This redundant power plane design means that in the event that some sort of short incapacitates one plane, IBM BladeCenter can failover to a secondary plane... Because IBM’s chassis has a second power path, if one power backplane fails, the other can be used to deliver continuous power ― helping result in none of IBM’s 14 servers crashing. In high availability environments, this could prove to be a very big deal..."

What appears to be a short on one of the blades, blew that blade and the entire blade centre chassis.

In three months, this wouldn't have mattered - we'd have had the second chassis online in a separate data centre, the timing for the failure was impeccable.

Takeaways Messages:

IBM Blades can combust, spectacularly.
Whilst combusting, they can trigger data centre smoke alarms, fires and fire suppression systems.
While combusting, even with a suitable fire suppression system, they can take out an entire IBM Blade Centre Chassis, no matter how redundant you make it.
Do not rely on a single Blade Centre Chassis, no matter how redundant you spec it.

(no subject)

Date: 2009-12-05 10:41 am (UTC)
transcendancing: Darren Hayes quote "Life is for leading, for not people pleasing" (Default)
From: [personal profile] transcendancing
wow - that's rather spectacular!

(no subject)

Date: 2009-12-05 03:18 pm (UTC)
rdm: (Default)
From: [personal profile] rdm
We've got a few of these at work... so that's a very scary story!

(no subject)

Date: 2009-12-05 03:28 pm (UTC)
From: [personal profile] discordia13
And to think the worst thing we have had go wrong with our servers is nests of spiders...

You have so much fun!

(no subject)

Date: 2009-12-06 12:24 am (UTC)
From: [identity profile] chrisb74.livejournal.com
Did the FM200 cause any other issues?

I miss all the fun. :)

(no subject)

Date: 2009-12-10 10:11 am (UTC)
From: [identity profile] chrisb74.livejournal.com
I don't remember... 2000-2001ish i guess.

(no subject)

Date: 2009-12-07 03:54 am (UTC)
japester: (Default)
From: [personal profile] japester
oh wow. That is a singularly impressive failure.

(no subject)

Date: 2015-01-16 11:18 am (UTC)
From: [personal profile] steven_m
Interesting read. I worked for State Government here in Australia for 9 years, serving in the IT department at one of Brisbane's largest hospitals (RBWH at Herston) We too had a similar incident, although not as bad as yours. Like you, we were running blade servers, but hundreds of them. Infact maybe over a thousand at the time (I never counted them all) but our data centre was the size of about 3 full size basketball courts. Our data centre was fitted with a particle detector - these things were super sensitive. I mean, if you walked near one, and say brushed your arm, the digital read out on the front would change. Amazing technology.

Anyway, every now and then, the particle detectors would go off and sound the alarms. Being in a hospital this meant an entire evacuation of the building. Wed be heading downstairs and the fire dept would be on their way up. This went on for a few weeks until we detected a faulty power board on one of the server racks. (Funny enough you could kind of smell it) but being in such a huge data centre, it was almost impossible to find.

Anyway, in the end we managed to find it and replace it, but the entire operation, took months, just because of one small electrical smell/burnout from a powerboard. Being a government department, there were VERY strict about OHS and put about 1200 people through a Fire Safety and Training course with this company Fire & Safety Aus. It must have cost them an absolute fortune, but surely worth it. I think after several weeks of everyone running up and down 13 flights of stairs we were well and truly over it!