[nylug-talk] rack mount UPS - need advice
alex@pilosoft.com
Fri Jan 13 04:03:08 EST 2006
On Thu, 12 Jan 2006, N.J. Thomas wrote:
> Perhaps this is opening a can of worms...but how "often" do data centers
> have power issues? I remember a few months ago there was a place
> somewhere near Wall St and Broadway that went down a couple of times,
> taking out either NYLUG or NYCBUG or the local PHP group for a couple of
> days. And off the top of my head, I remember Wikipedia's Florida data
> center going down and corrupting their MySQL database servers last year.
>
> I know that even a single outage is unacceptable, but when I walked into
> our colo after the first outage, I heard another person comment
> something along the lines of, "this happened twice a year ago as well".
Yep. Telehouse UPS#3 is cursed. Fails twice a year, at least. It fails
whenever it has to go from utility to bypass or generator. the circuit
monitoring voltage on bypass was miswired or broken - so UPS saw
"overvoltage" or "undervoltage" on bypass and refused to go to bypass,
instead interrupting the load (which is correct behaviour if the bypass
was actually over/under - it is better to interrupt the load than to
expose it to over/under voltage). Hopefully, its fixed now.
Now, whether this is *expected*, it is kind of tricky. There are outages
and there are outages. Anything caused by natural event may be excusable
if event is of such magnitude that it could not have been prepared for
without excessive cost. Outages due to *preventable* things are more
embarassing - as they suggest failure of management or policy. There are
always human factors involved, and some just can't be corrected (note the
EPO-caused failures). Also, while it may *look* simple to prevent some of
those outages, just remember that hindsight is always 20/20.
Also, keep in mind, that usually (and unfortunately), facilities don't
publish detailed reasons of outage (mainly, because often, it is very
embarassing), and thus learning from other people's mistakes is much
harder.
Also, keep in mind that the 'possible prevention' I have listed are just
that: they *possibly* could have prevented outages, but it is impossible
to know now.
Let me list some outages and near-outages that I remember for posterity
and then you can make your own decisions:
*) Verio outage in Boca Raton, Florida by Hurricane Wilma: After some
time in operation, fuel pump feeding 5 separate generators (technically,
Verio doesn't own them, genset farm is owned and ran by company who
manages the carrier hotel, Verio is one of the tenants) have failed.
Reasons are unclear but appears the fuel pump (one or all?) were outside
and suffered environmental damage from a hurricane.
*) Telehouse outage after 9/11: On 9/13, after two days of operation,
water pump cooling the genset has failed, and generators shut off due to
overheat. I was told that this overheat was partially caused by
debris/dust/etc from WTC.
*) ThePlanet, sometime in 2005: Faulty fuse in one of redundant UPS's
caused unit to trip. Load transferred to backup UPS, but tripped the
breaker on backup UPS in process. Power was going on and off as UPS's went
on and off bypass.
Possible prevention: possibly, incorrectly sized breakers.
Possible prevention: testing of UPS failover with a load bank
simulating full load prior to putting units in operation.
*) Equinix Chicago, sometime in 2005: after utility power was lost due to
a transformer fire, cables going from generator to UPS overheated and
shorted. UPS's were set up in paralleling configuration, and both UPS's
tripped (reason unknown).
Possible prevention: load-testing of generator with load-bank
that is sized to be similar to actual load.
Note that in the last two cases, paralelling (which should add redundancy)
was actually the cause of an outage.
*) Serverbeach Virginia, sometime in 2005: after utility power loss, ATS
did not operate as expected, so load was not transferred to generator. UPS
batteries don't last long, resulting in outage.
Possible prevention: regular testing of ATS
*) EV1, sometime in 2004: High power spike fried UPS rectifier, UPS went
to bypass (properly). Soon after that, brownouts caused PDUs to drop the
load in order to prevent damage due to undervoltage. Facility went to
generator power (but since UPS was inoperative, this resulted in another
quick power loss).
Possible prevention: eh, hard to hazard a guess.
*) Globix, sometime in 2002: EPO button (emergency power-off, required by
fire codes for every datacenter) was located very close to the door
release button. Hapless soul accidentally pressed the wrong button.
The same exact thing happened to Internap/Seattle mid-2005 (and couple of
other places/times that I just can't remember offhand - this is probably
*the* highest single cause of outages, fat engineers leaning on EPO
buttons ;)
In retrospect, prevention is simple: have glass-encased EPO (they are
permitted by fire code).
*) NAC, sometime in 2003: UPS caught on fire after condenser exploded,
resulting in FM200 release, and interruption of load. Possible prevention:
thermal imaging inspection. It may or may not have helped. Possible
prevention: addition of another static transfer switch in front of UPS.
(additional cost). Possible prevention: VESDA smoke-detection (additional
cost, not sure if it would have helped).
*) Near-outage: New Orleans, DirectNIC: After power was interrupted by
hurricane, datacenter remained on generator. Fuel was delivered shortly
before generator would have ran out of it.
Possible prevention: Eh, again, this is probably one of those that can't
really be prepared for.
*) Near-outage: EMC facility that is now owned by Pilosoft, after 9/11:
building water condensers failed due to dust/etc fallout from WTC. Running
on spot-coolers and fans until power was restored. Generator also had
dust, and only timely intervention kept it from failing.
Bottom line is, design of a datacenter is complex, some failures can be
prevented, some can't. That being said, 3 power outages in a month is an
evidence of reckless cluelessness. ;)
> > e) such UPS is already in Telehouse, and waiting for someone to pick
> > it up. When HOSUE power tripped the second time around, I brought a
> > bunch of those UPS's from my office to help out friends who are in
> > Telehouse. One was a spare, and left near the door to site3 for people
> > to pick up (and to make fun of Telehouse's incompetence).
> >
> > http://www.flickr.com/photos/telehosue/
>
> Yeah, I saw that as soon as you put it out. I running around frantically
> that night, so I didn't have much time to appreciate the humor, but now
> I get a kick out of it when I see it. The last time I was there, it was
> still sitting near the entrance. =-)
You want it. :) Special, 300$ :) NO JOKE! :)
More information about the nylug-talk
mailing list