Beta

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Data Center Power Failures Mount

timothy posted about 5 years ago | from the send-money-drugs-and-sealed-lead-acid-batteries dept.

Power 100

1sockchuck writes "It was a bad week to be a piece of electrical equipment inside a major data center. There have been five major incidents in the past week in which generator or UPS failures have caused data center power outages that left customers offline. Generators were apparently the culprit in a Rackspace outage in Dallas and a fire at Fisher Plaza in Seattle (which disrupted e-commerce Friday), while UPS units were cited in brief outages at Equinix data centers in Sydney and Paris on Thursday and a fire at 151 Front Street in Toronto early Sunday. Google App Engine also had a lengthy outage Thursday, but it was attributed to a data store failure."

cancel ×

100 comments

Sorry! There are no comments related to the filter you selected.

If only you had listened... (5, Funny)

BillyMays (1587805) | about 5 years ago | (#28601737)

I'm guessing that the majority of these were caused by leaks or spilled drinks. If only you guys had listened to me and gotten Zorbeez(tm)[SOAKS UP 10x ITS OWN WEIGHT!] [wikipedia.org] .

-B. Mays

Re:If only you had listened... (0, Offtopic)

sharkey (16670) | about 5 years ago | (#28602233)

Hell, the ShameWOW [youtube.com] has that beat.

Re:If only you had listened... (0)

Anonymous Coward | about 5 years ago | (#28602857)

Buy a ShamWOW! to fight Scientology. [skepchick.org] And my research (I.E. a quick google search and watching a couple product tests on the internet) shows that it works pretty well.

Re:If only you had listened... (0, Offtopic)

Puppet Master (19479) | about 5 years ago | (#28603297)

Billy, welcome back!!!

Re:If only you had listened... (0, Offtopic)

Stenchwarrior (1335051) | about 5 years ago | (#28603845)

First it was Ed, Farrah, Michael and then Billy. Now wide-spread power outages?? Not sure about you guys, but I smell Apocalypse brewing. Time to get right with Mr. Noodle and his Parmesan profits.

Re:If only you had listened... (1)

Schemat1c (464768) | about 5 years ago | (#28603911)

I'm guessing that the majority of these were caused by leaks or spilled drinks. If only you guys had listened to me and gotten Zorbeez(tm)[SOAKS UP 10x ITS OWN WEIGHT!] [wikipedia.org].

Even that wouldn't work. What you have here is your textbook Pepsi Syndrome and only a President in yellow booties can fix it.

Re:If only you had listened... (-1, Offtopic)

Anonymous Coward | about 5 years ago | (#28604989)

It was a bad week to be Billy Mays, too. He died.

"bad week to be a piece of electrical equipment" (0, Flamebait)

raddan (519638) | about 5 years ago | (#28601765)

Because out of all of the data centers in the world, there were problems at five? Riiiiight. Good reporting, Slashdot.

Can I sign up for broken water main notices here, too, or do I need to go to another website?

Re:"bad week to be a piece of electrical equipment (4, Funny)

Statecraftsman (718862) | about 5 years ago | (#28601799)

Yes, that's clearly Twitter territory.

Re:"bad week to be a piece of electrical equipment (1)

eln (21727) | about 5 years ago | (#28602125)

Not if the water company runs Linux!

Re:"bad week to be a piece of electrical equipment (0)

davek (18465) | about 5 years ago | (#28601899)

Quiet troll. Slashdot has broken several stories over the years and most of them started as these little coincidences. Go back to reading CNN if you want your news filtered.

Re:"bad week to be a piece of electrical equipment (2, Funny)

Anonymous Coward | about 5 years ago | (#28602113)

Indeed, 18465. And we shall get off your lawn as well.

Re:"bad week to be a piece of electrical equipment (1)

stim (732091) | about 5 years ago | (#28609133)

If by "broken several stories" you mean "posted links to stories that were broken elsewhere" than you may have a very good point.

Re:"bad week to be a piece of electrical equipment (5, Interesting)

Anonymous Coward | about 5 years ago | (#28602595)

Because out of all of the data centers in the world, there were problems at five? Riiiiight. Good reporting, Slashdot.

Can I sign up for broken water main notices here, too, or do I need to go to another website?

100+ million people daily are "serviced" by these 5 data centers.

Company's such as authorize.net where COMPLETELY unavailable for payments to hundred of thousands of webmasters sites (ya know the people who make money)

If you don't think this is serious news then you are still living at home.

Ya that's what I thought.

Re:"bad week to be a piece of electrical equipment (2, Insightful)

afidel (530433) | about 5 years ago | (#28603065)

authorize.net are apparently complete idiots, if they are that large and all their equipment is in one datacenter then that's bordering on insane. Heck, my little company of under 1k employees has two facilities. Anyone who's should be running a site with 100k+ customers knows better.

Re:"bad week to be a piece of electrical equipment (1)

Alpha830RulZ (939527) | about 5 years ago | (#28603779)

The Fisher Plaza story is big. I happened to be walking by right after it happened, noticed the generators running and went, 'Hm-m-m". We've toured their facility in the past, and wanted to use them, but they didn't have capacity at the time. They seemed first rate. If a first tier provider can have this happen...

Re:"bad week to be a piece of electrical equipment (1)

ae1294 (1547521) | about 5 years ago | (#28605043)

by right after it happened, noticed the generators running and went, 'Hm-m-m

You're some kind of witch aren't you? You broke my internet!

BURN THE WITCH!!!

check this out!!! (-1, Troll)

Anonymous Coward | about 5 years ago | (#28601867)

Eric felt his scrotum contract in its latest desperate attempt to keep his testicles warm. This hospital, wherever it was, was damned drafty.

It didn't help that the nurses on his floor, who had been treating Eric like a complete bitch, liked to keep the air conditioning cranked up. Or was it just his room? He noticed they pulled their cardigans and sweaters around them only when they came to see him.

"Nurse! Nurse!" Eric shouted. "Excuse me, nurse?!"

Eric heard a chair creak, followed by footsteps coming down the hall. They were quick around here, one of the only good things Eric had yet noticed. Perhaps it was because of his celebrity status.

"Yes?" the nurse said, crossing her goose-pimpled arms.

"Nurse, it's damn cold in here," Eric said. "And I think my pain medication is wearing off. Can I have some more pills?"

Her beady eyes, set atop wrinkled, puffy cheeks, lasered him in his bed. This was the sixth time Eric had shouted for her since her shift began. She didn't know him well but she was definitely starting to hate him.

"Oh! And my urinal needs emptied!" Eric added.

The nurse pursed her lips and folded her arms without breaking eye contact, "get fucked" in body language.

Eric smiled a crooked, leering grin at her and winked in a bid to charm her into emptying his piss. The nurse wondered if he was about to have another seizure.

She picked up Eric's chart, flipped through it, and replaced it.

"Mr. Raymond," the nurse said, "you're not due for more pain medication for two more hours."

Eric's mustache, orange and drooping, twitched.

"Do you need your bandages looked at?"

Eric shifted in his bed, stiff and uncomfortable. He slowly, awkwardly, stretched his hospital gown down over his knees.

"Nooo, no, no I don't," Eric said. "My bandages are just fine."

"Fine then," the nurse said. "I'll get your urinal. Do you need anything else?"

Eric watched as the nurse lifted his urinal carefully off of his lunch tray. It was completely full1,000 cubic centimeters, one full quart of piss and mounding at the top.

The nurse stifled a gag as she slowly made her way into the restroom.

"This damn IV has me swimming!" Eric called after her with a quick laugh.

He heard her pouring his urine into the toilet and felt the urge to go again. It had been dark brown, viscous, and smelled to high heaven like sick wet meat. He really hoped whatever they had him on was working.

She returned from the restroom and replaced Eric's urinal.

"I'll be back when it's time for your medication," she said. "Dinner is in an hour."

With that she left until, she knew too well, the next time Eric grew bored or irritated.

Feeling as anxious as ever, Eric reached for billywig [catb.org] , his blueberry iBook [apple.com] , which had finally charged. He hit the start button and watched Yellow Dog Linux [fixstars.com] slowly crawl off of the hard drive into RAM.

Thank god this hospital had wifi. Thank god he had an Airport card in his iBook.

http://www.google.com/search?q=brown+piss [google.com]

"Nope."

http://www.google.com/search?q=my+piss+is+brown [google.com]

"Hmm Nope."

http://www.google.com/search?q=my+piss+is+brown+std [google.com]

"Nope."

http://www.google.com/search?q=my+piss+is+brown+and+smells+like+rotting+meat+std [google.com]

Eric was having no luck. The more he optimized his Google searches, he noted with alarm, the less relevant his search hits became.

foul smelling like decay meat and at times like grated yam. this odor ... and fifth day i see dirth brown dischargeAbnormal discharge from the nipple .... the air asking what that rotten meat smell was...and the consequent search ... So, my UA (urine analysis) came back abnormal

"Jesus Christ!" Eric muttered to himself as he squinted at his iBook's twelve inch screen. "I don't think I have anything coming out of my nipples!"

Making sure his iBook was steady, he gingerly squeezed his left pectoral.

"Nope."

Eric command-tabbed back to vi, where he was typing "RFI on brown piss that smells like rotting meat" to post to his blog [ibiblio.org] , when there was a knock at the door.

"Mr. Raymond?"

It was the nurse.

"There's someone here to see you."

Finally, company! A hacker mind like Eric's was not used to boredom. He needed plenty of Iranian hackers [trollaxor.com] to chat with, a cave full of LARP buddies, or, optimally, a Linux party [trollaxor.com] . Not the sanitation of lonely, well-lit hospital.

A second later the door opened again and in walked not Eric's LARP troop or Linux party, but something far less arousing: a New Jersey state police officer.

"Eric Raymond?" the officer asked. He was 6'2" and built like the Mack trucks he probably ticketed on a daily basis.

"Yes, sir, that's me, officer," Eric stammered. He hated being dominated.

"You're under arrest for lewd conduct, public indecency, and conspiracy to solicit," the officer said. The tone in his voice told Eric not to interrupt. "You have the right to remain silent. Anything you say"

Eric's mind wandered. He had to call his wife. She was his attorney and had dealt with this sort of thing before. He had to keep this quiet.

Eric decided then and there to be as cooperative as possible.

"Do you understand these rights, Mr. Raymond?"

"Yeah, sure," Eric said. "But I'd like to share info about the other party involved in this incident."

"Go ahead?" the officer said, not expecting Eric's offer.

"The other party," Eric said, "is a man named Emad, an Iranian hacker, quite possible in this country illegally. His email address is emad.opensores@gmail.com [mailto] and his AIM handle is iran2hax0rc0ck [aim] ."

"Any idea who the other parties involved were?" the trooper asked, taking his notepad out.

"Other parties? There were no other parties. Just Emad and I."

"Mr Raymond," the trooper said, "you were the victim of sexual assault last night."

Eric's left eye twitched. It was usually him, with his Glock and Jägermeister, in charge of the proceedings. Not the other way around. He felt so powerless.

"You'll be arraigned upon your release from the hospital. Do you understand that?"

"Sure," Eric said, "but why do you think there were other parties? It was just Emad and I the entire time."

"Mr. Raymond," the trooper said while replacing his notebook, "our crime lab extracted the DNA of two other people from your wounds."

Eric sweated, cold and salty, and his world spun. Who else had been there?

"Also," the trooper said, producing a plastic bag, "do you know what this is?"

He handed the object to Eric, who turned it back and forth. It reflected the room's lights weakly through the baggie.

"It's Ubuntu," Eric said softly.

"Ubuntu? What's that?" the trooper said.

"It's a Linux distribution," Eric said unhelpfully. "Where did you get it?"

He noticed the version number on the CD face as he passed it back to the trooper. 9.10Karmic Koala.

The trooper looked away before he spoke.

"The doctors removed it from deep inside your ass."

Correlation proves causation (0)

Anonymous Coward | about 5 years ago | (#28601915)

Installing a generator or UPS causes an accident sooner than you'd experience without having a generator or UPS. Safety measures cause accidents. See, not(correlation does not imply causation). We slashdotters have always known this to be true. Heretics beware!

Wrong (1)

GameboyRMH (1153867) | about 5 years ago | (#28606953)

Safety measures drastically reduce the chance of accidents, while being unprepared, especially if it's just a brief period of unpreparedness, greatly increases the chance of an accident. This makes you wonder if the safety measures were really worth it, but at least you won't have any accidents as long as you remain prepared.

Long live the cloud...long live the cloud.... (-1, Troll)

Anonymous Coward | about 5 years ago | (#28601917)

Just keep having your stupid cloud fantasies, you pathetic dweebs waiting for the Singularity to lift you from your humdrum jobs. Maybe if you focused at least some of your life around what it means to be a individual sentient mind and be content with that--instead of slaving away at the abstractions of 1s and 0s--the need for 'the Cloud' would evaporate and we would even consider this a story.

Damn you Michael Bay! (4, Funny)

StaticEngine (135635) | about 5 years ago | (#28601939)

"A blown transformer appears to be the culprit"

I'd heard the new movie was crude, but I didn't realize how crude it actually was!

Re:Damn you Michael Bay! (1)

Neanderthal Ninny (1153369) | about 5 years ago | (#28602021)

The good quote from the Transformers movie character Ratchet: Wow... that was tingly!

Re:Damn you Michael Bay! (1)

tnk1 (899206) | about 5 years ago | (#28602161)

I guess Megan Fox's character is upgrading.

Re:Damn you Michael Bay! (1)

jd2112 (1535857) | about 5 years ago | (#28602847)

Well, one review (CNN, I think) described Transformers as "Robot Porn".

Re:Damn you Michael Bay! (1)

shentino (1139071) | about 5 years ago | (#28604755)

Didn't ThePlanet recently have an outage for the exact same reason?

Methinks that electrical standards are falling behind the demand created by computing resources.

XO Communications Genesis Hosting (1)

newgalactic (840363) | about 5 years ago | (#28602005)

We had an outage today. Our servers are hosted with Genesis Hosting, which suffered an outage from their ISP; XO Communications in Chicago. Anyone know what happened?

Re:XO Communications Genesis Hosting (3, Funny)

socsoc (1116769) | about 5 years ago | (#28602453)

Sure, I heard that Genesis Hosting suffered an outage from their ISP; XO Communications in Chicago.

Re:XO Communications Genesis Hosting (1)

newgalactic (840363) | about 5 years ago | (#28609971)

My sweet, sweet inter-tubes. You are a cruel and fickle Mistress

Outages (2, Interesting)

Solokron (198043) | about 5 years ago | (#28602049)

Outages happen more than that. We have been in several data centers, ThePlanet and The Fortress both have had major outages in the last two years which has affected business.

Re:Outages (4, Interesting)

JWSmythe (446288) | about 5 years ago | (#28602421)

    I've had equipment and/or worked in many datacenters over the last decade or so. I've worked with even more clients who have had equipment in other datacenters.

    I've only experienced 3 power related outages that I can think of.

    One was a brownout in that area, which cooked the contactors that switched between grid power and their own DC room.

    One was an accident, where a contractor accidentally shorted out a subpanel, and took out about a row of cabinets. I was there for that one. I saw the flash out of the corner of my eye, and by the time I turned my head, he was just flying into the row of cabinets.

    One was a mistake in the colo, where there was a mislabeled circuit, so they cut power to 1/3 of one of our racks.

    There have been even more outages related to connectivity problems. With one major provider who was just terrible (and is now out of business), they had a fault about once a week or less. Every time we called, they said "there was a train derailment that cut a section of fiber in [arbitrary state], which effected their whole network." It was funny at first, but annoying when we started questioning them about why there was no news about all these train derailments. We had to make up our own excuses for the customers, because we couldn't keep telling them the BS story the provider gave. We were smart about it though, and at least had decent excuses, and the whole staff knew which BS story to give for a particular day. The sad part was, we had a T3, and that was huge at the time.

    At my last job, they wanted a full post-mortum done on any fault. If a customer across the country suffered bad latency or packet loss, it was our job to find out why and "fix" it. The management wouldn't accept that there are 3rd party providers who handle some of the transit. So, we'd call our provider demanding it to be fixed (which they couldn't do), and then call the broken provider (who hung up since we weren't their customer), and then got reamed by the boss because we couldn't fix it. Delay tactics worked best after a while. If you're "investigating" a problem long enough, and hold the phone up to your ear enough, the problem will likely be fixed by those who really can. We'd still log a ticket with our provider, because the boss would eventually call the provider referencing the ticket number, and find out there was still nothing that could be done.

    There's pretty much guaranteed to be a fault of some sort between two points on the Internet every day. All anyone can really do is make sure it isn't with your own equipment. That's something I always did before calling to complain about anything. It's embarrassing to hear "did you reboot your router?" and that turns out to really be the problem.

    The only real solution to this is, redundancy. Not just in one facility, but across multiple facilities. If you spread things out enough, sure an isolated problem will effect some people, but not everyone. You want a service to be reliable, redundant machines in each datacenter is the only way to go. When I was running the network (and everything technical) at one job, a datacenter outage wasn't a concern, it was just a minor annoyance. I filed a trouble ticket, and told them to call me when it was fixed. We'd demand reimbursement on the outage time, and made them handle the difference on our 95th percentile bandwidth charges at the end of the month. I wasn't going to take a hit on the bill just because they had an outage in a city, and my other cities had to take the traffic during the outage. When your bill is measured in multiple Gb/s, you have a little more say in how they handle the billing. :)

Re:Outages (1)

Muad'Dave (255648) | about 5 years ago | (#28606845)

... which cooked the contactors that switched between grid power and their own DC room.

I read that as contractors. I apparently saw 'contractor' in the next sentence and did the switcheroo. I was going to call you callous for using the term 'cooked'.

FYI, arc flash [wikipedia.org] is not something to be taken lightly (no pun intended). It's dangerous as all get-out in high voltage panels that have a lot of available fault current. A typical 480V, 20kA fault [wikipedia.org] can release the same amount of energy as 1.5 lbs of TNT.

Re:Outages (1)

JWSmythe (446288) | about 5 years ago | (#28613323)

    I figured some people would make that mistake. Some others (like you) actually know the difference. Ya, I'd prefer to not have a cooked contractor. They kinda smell, and it tends to make other contractors not want to work with you. :)

    I was very happy to have not been at the site when it happened, but I did have to fly up to check all of our equipment.

    About half our servers wouldn't come back online, either due to power or networking faults. The network switch somehow lost it's saved configuration, and due to that, all my normal means of getting back in remotely were gone. It was a mess that took a few hours to clean up. Being redundant though, it wasn't catastrophic, I just don't like leaving sites down, just in case the next disaster happens.

    When I got there, a few people told me it hadn't been pretty. I guess there was a good bit of chaos between the time it happened, and the time I arrived. They were very pleased that I wasn't screaming. :) What did I care? We remained operational with one of DC's completely off the map for a couple hours, and half operational for the following 8 hours. I got on the only direct evening flight, so I could get up there and assess the damage. My return was the morning direct flight back. Ahh, nothing like flying across the country, spending the night in the DC, and flying home in the morning. I was actually done by 1am, so I had to entertain myself for several hours til I could catch my flight home.

Re: DC Outages (0)

Anonymous Coward | about 5 years ago | (#28604313)

cheap, fast, reliable

pick one

Re: DC Outages (1)

rbrausse (1319883) | about 5 years ago | (#28605125)

it seems you had some bad experiences lately.

under normal circumstances you can always have 2 out of this 3 - regardless of which topic we are speaking (datacenters, code quality, cars*, ... - just name it)

*) and not even a bad analogy :)

Re: DC Outages (1)

Colourspace (563895) | about 5 years ago | (#28605211)

where is badanalogyguy when you need him?

Re:Outages (0)

Anonymous Coward | about 5 years ago | (#28612051)

Outages happen more than that. We have been in several data centers, ThePlanet and The Fortress both have had major outages in the last two years which has affected business.

I've been a hosting customer since 2000 with a large national firm.

We've had exactly one outage that resulted in downtime of more then 30-60 minutes. They released extensive information about the incident, what they were doing to fix it, and what steps were being put in place to prevent future incidents.

Be Redundant! (5, Insightful)

drewzhrodague (606182) | about 5 years ago | (#28602063)

Anyone seriously oncerned about their web applications, will have redundant sites, and a way to share the load. Few people pay attention to the fact that DNS requires geographically disparate DNS servers *, such that even in the event of a datacenter fire (or nuclear attack), there will still be an answer for your zone. Couple this with a few smaller server farms in separate places, and there won't be any problems. I went to look it up on wikipedia, but didn't find out where it is required for authoritative DNS servers to be in separate geographic regions. Where did I read this, DNS and BIND?

Re:Be Redundant! (5, Informative)

W3bbo (727049) | about 5 years ago | (#28602333)

The DNS RFCs advise that zone nameservers should be in separate subnets. Specifically RFC 2182 recomends that secondary DNS services be spread around geographically.

Re:Be Redundant! (2, Funny)

Firehed (942385) | about 5 years ago | (#28602357)

In the event of a nuclear attack, you probably have more pressing issues to deal with than your server uptime.

Re:Be Redundant! (0)

Anonymous Coward | about 5 years ago | (#28604399)

In the event of a nuclear attack, you probably have more pressing issues to deal with than your server uptime.

I'd rather spend my time keeping my servers up than worrying about a whole bunch of crap I can't control.

Re:Be Redundant! (1)

rdnetto (955205) | about 5 years ago | (#28604491)

But then how will you know who is attacking you, and where to go? Not to mention how to best shield yourself from radiation...

Re:Be Redundant! (4, Interesting)

JWSmythe (446288) | about 5 years ago | (#28602505)

    Be nice, people don't read the books nor RFC's any more.

    At the biggest operation I ran, I had redundant servers in multiple cities, and DNS servers in each city. If we lost a city, it was never a big deal, other than the others needing to handle the load. With say 3 cities, a one-city outage only accounted for a 16.6% increase in the other two. Each city was set up to handle >100% of the typical peak day traffic, so it was never a big deal. I don't think we ever suffered a two-city simultaneous failure, even though we simulated them by shutting down a city for a few minutes. Testing days were always my favorite. I loved to prove what we could or couldn't do. I peaked out one provider in a city once. We had the capacity as far as the lines went, but they couldn't handle the bandwidth. It was entertaining when they argued, so I dumped the other two cities to the one in question, and they were begging me to stop. "Oh, so there is a fault. Care to fix it?"

    I could quantify anything (and everything) at that place. I could tell you a month or so in advance what the peak bandwidth would be on a given day, and how many of which class of servers we needed to have operating to handle it. I classed servers by CPU and memory, which in turn gave how many users and how much bandwidth each could do. I only wanted our machines to every peak out at 80%, but sometimes it was fun to run them up through 100%. I set the limits a little low, so we could run at say 105% without a failure.

    Such information let us know if we had a server problem, before we knew we did. I'd notice a server was running 10% low, and that really means that it is going to fail. We'd watch for a little while, and it would. :) We'd power it down, and leave it in the datacenter until we had another scheduled site visit.

Re:Be Redundant! (1)

Henry V .009 (518000) | about 5 years ago | (#28604245)

With say 3 cities, a one-city outage only accounted for a 16.6% increase in the other two.

You're mistaken. Each of the other two cities would see their load increase 50%.

Re:Be Redundant! (1)

JWSmythe (446288) | about 5 years ago | (#28612087)

Normal Operations:
City 1 - 33.3%
City 2 - 33.3%
City 3 - 33.3%

City 1 stops:
City 1 - 0%
City 2 - 33.3%
City 3 - 33.3%

Other cities take up the slack:
City 1 - 0%
City 2 - 33.3% + 16.6% = 49.9% (mol)
City 3 - 33.3% + 16.6% = 49.9% (mol)

If you are really bent about the missing 0.2%, you can work it out in fractions instead. :)

City 1 - 0/3
City 2 - 1/3 + 1/6 = 3/6 = 1/2
City 3 - 1/3 + 1/6 = 3/6 = 1/2

I have a better time adding and rounding decimals in my head. Since nothing on the Internet is actually perfect, 1% off on an estimate isn't catastrophic. :) In real life, we'd frequently end up with something like a 33% 32% 35% split or a 48% 52% split on a city failure. There were plenty of days where it came in right on 33.3% 33.4% 33.3% and 50% 50%. There were a whole stack of factors involved that were out of our control. I could encourage things around a little better, but trying to force it rather than encouraging it could lead to trouble, but we already had our ways to mitigate that (lots of redundancy) :)

Re:Be Redundant! (1)

Henry V .009 (518000) | about 5 years ago | (#28621323)

Yes. I knew how you calculated it. You're just confused about what you're calculating.

"City 2 - 33.3% + 16.6%" -- note that the load on City 2 just increased 50%. If it were making 33 widgets and had to make another 16 widgets, it's now making 50% more widgets.

To be more explicit, each city in your analysis is taking up 16.6% extra of the total load, which means that each city is individually seeing a 50% increase in its own load.

Re:Be Redundant! (1)

JWSmythe (446288) | about 5 years ago | (#28624033)

    Well, ya, 50% more than it had been handling before. I see.

    My concern was the percentage of totals. That was divided up by the available servers, and then each server was calculated to know it could handle. We had generally 3 classes of servers. They could be considered "old", "not so old", and "new". :) We didn't quite class them like that, but it would be a decent evaluation. If "old" could handle 1x, "not so old" could handle x2, and "new" could handle 3x. It wasn't uncommon for us to run a "new" server as 2x, and the lower two as 1x, but the capacity was there for the top two classes, should we need it.

    It gets complicated until you see it working. I don't know if they're still using my methodology. I know things have changed since I left, and they aren't asking me any questions. Either I documented it well enough, it was intuitive enough, or they just gutted it and started over.

Re:Be Redundant! (0)

Anonymous Coward | about 5 years ago | (#28605271)

I set the limits a little low, so we could run at say 105% without a failure.

Wow, so you could turn it up all the way to 11! Rockin', dude!

Re:Be Redundant! (1)

JWSmythe (446288) | about 5 years ago | (#28612225)

Hehe.. Umm.. Well.. Ya, I guess we could. :) We never referred to it as that though. :)

    I did turn them up to "lets see when smoke comes out", but somehow we never had the magic smoke released. Except that one new years, but an old server seemed like a great launch platform for fireworks. :) It was flat, hard, and cost a few thousand dollars. What better to put explosives on. :)

Re:Be Redundant! (1)

mcrbids (148650) | about 5 years ago | (#28602585)

It's required that you have two name servers when you register a domain name.

Physical separation is not required. It's just good practice. (I do, in separate cities on different ISP networks) Having separate nameservers in different geo regions is implicit because you have to register at least two for each domain name. I've seen some people game this by having a single nameserver with two IP addresses, which strikes me as the height of stupidity, but it's not happening on my watch.

Re:Be Redundant! (1)

raju1kabir (251972) | about 5 years ago | (#28605485)

I've seen some people game this by having a single nameserver with two IP addresses, which strikes me as the height of stupidity

If everything referenced by the DNS records (web and email services or whatever) is hosted on the same machine as the name server, then it isn't particularly stupid. It's just a small operation that has a single point of failure; redundant DNS isn't going to change that.

Re:Be Redundant! (1)

vlm (69642) | about 5 years ago | (#28605939)

If everything referenced by the DNS records (web and email services or whatever) is hosted on the same machine as the name server, then it isn't particularly stupid. It's just a small operation that has a single point of failure; redundant DNS isn't going to change that.

WITH the single exception I know of, that incoming email will bounce with something like "domain not found" if there is no DNS response at all, vs if there is DNS but the MX record servers can't be reached it'll silently retry. Some totally brain-dead MTAs will bounce, but anything remotely usable will transparently retry later and no one will know it happened.

And it's not so much a "small operation", as a non-relevant risk. People have a certain expectation of how (un-)reliable email is, due to filtering, and/or just plain ole magic. As long as my colo'd server is dramatically more reliable than their expectation of email reliability, then increasing the server reliability by a microscopic amount at the expense of more complicated design, would be misprioritized wasted effort. Put that effort into improving spam filtering instead, etc.

Re:Be Redundant! (1)

raju1kabir (251972) | about 5 years ago | (#28606095)

WITH the single exception I know of, that incoming email will bounce with something like "domain not found" if there is no DNS response at all, vs if there is DNS but the MX record servers can't be reached it'll silently retry.

Common myth but quite untrue (try it for yourself). If there is no response from any DNS server then it will be considered a temporary failure and delivery attempts will continue at intervals in the background just as if the MX target(s) were not responding.

Only if a server can be reached and returns status NXDOMAIN will delivery abort immediately.

It has to be so. Otherwise any MTA that delivers outbound mail would start bouncing everything in its queue any time it suffered a connectivity interruption and could not reach its own DNS forwarder.

Re:Be Redundant! (1)

Guspaz (556486) | about 5 years ago | (#28602801)

While geographic diversity is certainly an excellent goal, it's not always that simple. My ISP's network core was located in the Peer 1 suite at 151 Front (whose UPS caused the fire). Power was cut to Peer 1's suite, but not the rest of the building (151 Front has independent power/cooling/etc. per-suite to the extent where each tenant is responsible for getting their own solution).

Redundant power sources could have mitigated the issue had there not been a fire; running two independent circuits to critical equipment that passes through different UPS, different PDUs, different generators, and different utilities.

Even for those on a budget, geographic diversity isn't necessarily difficult, even within the same company. Many companies have multiple locations; my VPS provider, Linode, has colo space in virtually all corners of the continent, about as far apart as you can get without going overseas. Getting a second VPS at a geographically distinct location could be a cheap way to provide failover if getting something from a different provider isn't financially feasible.

Re:Be Redundant! (0)

Anonymous Coward | about 5 years ago | (#28603371)

...whose UPS caused the fire...

Shouldn't that be a UFS?

Re:Be Redundant! (2, Interesting)

ls671 (1122017) | about 5 years ago | (#28603507)

Best solution for big outfits is to have at least this setup:

1) One party being the main contractor. This party doesn't do ANY hosting per say but only manages the fail-over strategy, doing the relevant testing once in a while.

2) A second party being involved in hosting and managing data centers.

3) A third party, completely independent from party 2, a competitor of 2 is preferable, which also does hosting and manages data centers.

It is the same principle when you bring redundant internet connectivity to a building :

1) Have the fiber from one provider come into the building from, say, the north side of the building.

2) Have a competitor, unrelated business wise, that doesn't use the same upstream providers bring his fibers in from the South side of the building.

Putting all your eggs in the same basket by dealing with only one business entity constitute a less robust solution.

Re:Be Redundant! (1)

vlm (69642) | about 5 years ago | (#28605991)

1) Have the fiber from one provider come into the building from, say, the north side of the building.

2) Have a competitor, unrelated business wise, that doesn't use the same upstream providers bring his fibers in from the South side of the building.

3) Discover that both fiber runs connect to the same L.E.C. vault 100 feet away and then run parallel the whole way back to the same central office, and/or they both are carried on the same SONET ring just connected to different ADMs (which would at least give you ADM redundancy).

Seriously though, step 3) is get a copy of the DLR / CLR of the local loop, and have someone analyze them. Of course how the circuit is designed is not necessarily how it is actually routed, which is even funnier.

Everyone in the telco business has heard the story about the general/admiral/CEO being told by the sales weasel how everything is redundant and it'll never fail, so the totally disbelieving general/admiral/CEO walks up to the fiber frame with scissors/bolt cutter/wire cutter and then ... Now that is probably the only real way to know if it'll work.

Re:Be Redundant! (1)

jra (5600) | about 5 years ago | (#28608837)

President Jimmy Carter.

Nuclear attack evac test.

Lots of embarassed people.

It is *epically* difficult to get and keep true physical diversity.

Re:Be Redundant! (1)

DNS-and-BIND (461968) | about 5 years ago | (#28604219)

It wasn't me!!

Re:Be Redundant! (1)

sjames (1099) | about 5 years ago | (#28612297)

On the other hand, if your servers are all down, is there a lot of point to people knowing their IP address?

One secondary DNS per location is just fine. If there is just one datacenter, then use 2 DNS servers there.

It is recommended, but not required. The server police won't take your hardware away if you don't do it.

No preventative maintenance? (5, Insightful)

Neanderthal Ninny (1153369) | about 5 years ago | (#28602201)

My wild guess is they are deferring preventative maintenance on these data centers so we are seeing these major outages now. Fire suppression, UPS, transfer switches, generators, distribution panels, transformers, network gear, server, storage devices and other gear will fail if you don't maintain them properly. As loads increase, the equipment will fail earlier and my guess the people have pushed the limit of this equipment beyond they the lifespan of load rating.

Re:No preventative maintenance? (0)

Anonymous Coward | about 5 years ago | (#28602473)

Microsoft and Oracle licenses aren't cheap! Try to convince the boss that you need to spend those extra millions of dollars for hardware and licenses that are there just in case your datacenter that hasn't had a problem in over 10 years needs disaster recovery.

Re:No preventative maintenance? (2, Insightful)

ls671 (1122017) | about 5 years ago | (#28603589)

I would not try to convince him. Just write a memo describing the issues without sounding alarmist. It is up to the boss to evaluate the risks and to take the decision. Once you have written your memo, you are basically covered.

Now could be a good time to write this memo, just remember not to sound alarmist, just describe the possible issues although the risk is slim. You could say that you have been inspired by recent events in big data centers ;-))

As per licensing issues, call your Oracle/MS representatives, they offer special deals for fail-over sites. This will be a good point to mention in your memo (cost).

Power Fail Often (2, Interesting)

blantonl (784786) | about 5 years ago | (#28603111)

Frankly, if data centers are going to proclaim their redundancy, they should test by power failing the entire data center once every two weeks at a minimum. A data center that goes down twice in a month would get ahead of any issue pretty fast. Lessons learned from the staff and the management are very valuable.

The marketing messaging:

"We power fail our data center every two weeks to ensure our backups work..."

Sound scary? Just think about the data center that has never been through this process. at that point, the wet paper bag you tried to market your way out of dried rather quickly and you are now faced with the prospect of slapping around inside of a zip-lock.

Re:Power Fail Often (1)

aaarrrgggh (9205) | about 5 years ago | (#28606717)

Semi-monthly pull-the-plug tests would reduce reliability. Monthly load tests on generator and a battery monitoring system ensure electrical reliability quite effectively. Only the most inadequate facilities fail to do this.

The larger problems come from improper change control, a lack of scripting, or an abnormal failure mode. Lack of testing and maintenance is a real problem, and in data centers it is far too often that it is caused by the IT team not understanding the risks of inaction. If you have an accumulation of cobwebs and dust in your switchgear due to lack of maintenance, it is only a matter of time before a failure.

Re:Power Fail Often (1)

Critical Facilities (850111) | about 5 years ago | (#28608767)

Frankly, if data centers are going to proclaim their redundancy, they should test by power failing the entire data center once every two weeks at a minimum.

I disagree. If you perform a full load test of your facility every 2 weeks (or heaven forbid more frequently) you will be buying LOTS and LOTS of UPS batteries. Not to mention putting additoinal wear and tear on your generators, transfer switches, UPS Modules, Control Cabinets, etc.

You are correct, data centers should do "pull the plug" tests, but not as frequently as you suggest, otherwise they'll effectively be reducing their availability by introducing more risk to the equation.

Are they over drawing the power out the units? poo (0)

Joe The Dragon (967727) | about 5 years ago | (#28602241)

Are they over drawing the power out the units? poor battery that blow up? Not having the right wire gauge? Not cooling the power buses and switches?

Downside to consolidation (5, Insightful)

Anonymous Coward | about 5 years ago | (#28602441)

Surprise surprise...there's a downside to consolidation. Hey morons, the internet was invented as a means to ensure redundant communications paths given nuclear warfare. The old central switch (physical switching) was seen as too cumbersome and vulnerable. Now that we have wonderfully redundant communications, and have done away with most of the downsides of physically distributed systems, morons are building logically centralized systems.

NEWSFLASH - Redundant communications and physical virtualization do very little for you if you build a logical mainframe.

Truly distributed systems must be physically AND logically DISTRIBUTED with redundant comms paths in order to gain the full benefits of decentralization. (e.g. Distributed isn't distributed if all your authentication is done at one site or all your traffic must pass through .)

Re:Downside to consolidation (1)

Narcocide (102829) | about 5 years ago | (#28602515)

Someone mod this coward up.

Re:Downside to consolidation (1)

ls671 (1122017) | about 5 years ago | (#28603691)

done !! oups, posting seems to have voided my mod ;-((

Re:Downside to consolidation (0)

Anonymous Coward | about 5 years ago | (#28604027)

Moron.

Re:Downside to consolidation (2, Insightful)

ls671 (1122017) | about 5 years ago | (#28603645)

>> "morons are building logically centralized systems"

I have worked with such a moron doing architecture on a big government project ;-)) unbelievable...

His argument was that "The government likes centralized systems" ;-))

Re:Downside to consolidation (1)

DNS-and-BIND (461968) | about 5 years ago | (#28604201)

1) The internet wasn't redundant, ARPANET was redundant. The internet hasn't been able to withstand a nuclear attack since it was put online.

Putting all your eggs in one basket is nothing new under the sun. You ever see Ma Bell's idea of a "redundant" circuit? Two wires in the same condiut. But at least Ma Bell was doing it out of thriftiness and laziness, not ignorance and superstition.

Re:Downside to consolidation (2, Insightful)

BBCWatcher (900486) | about 5 years ago | (#28605351)

No, I think you have it exactly backwards, or at least you're missing an important nuance. It's really, really expensive to duplicate everything across two (or more) data centers. And it's full scope increase in IT costs: most or all cost categories increase. We're talking more than double the costs, in round numbers. Beyond the cost, it's very hard technically to recover hundreds or thousands of servers simultaneously or even near-simultaneously, because you are typically trying to recover not hundreds or thousands of atomistic, independent servers but all the moment-in-time state and functional dependencies among servers. Very, very difficult, which also means hugely expensive and prone to error. Unfortunately, service interruptions are also extremely expensive. What to do?

You could just buy a pair of mainframes, one at site one and the other (configured with reserve capacity, which is lower cost) at site two. (More only if you need the capacity. Then they just operate like a single machine.) That all works really, really well. As in, credit card holders would have no clue that site #1 just burned to the ground -- the credit cards still keep working. That particular form of consolidation makes disaster recovery a relative breeze. DR is just thoroughly baked into the DNA of such equipment, and the very computing model itself supports rapid recovery. (Down to zero interruption effectively and zero data loss, if that's what you need. Or, in DR lingo, RPO and RTO of zero.)

The critical nuance here is if you only consolidate sites, which a lot of businesses have done, you're reducing business resiliency, ceteris paribus. Yes indeed, if you merely forklift your hundreds or thousands of servers into a smaller number of data centers and do basically nothing to consolidate applications, databases, operating system images, etc., onto better DR-protected assets, then disaster recovery will be much tougher and much more expensive. Site-wide disasters will be more disastrous. The game-changer (otherwise known as re-learning time-tested lessons :-)) is if you untangle the mess and do real consolidation onto a much smaller number of robust, well-protected servers with some decent DR investments and realistic rehearsals. That'd be mainframes and mainframe IT discipline, basically, or at least something that resembles mainframes (if such a thing exists).

Former critical power field engineer here... (5, Interesting)

asackett (161377) | about 5 years ago | (#28602519)

... saying that it's time to reconsider cost cutting measures. In 15 years in the field I never saw a well designed and well maintained critical power system drop its load. I saw many poorly designed and/or poorly maintained systems drop loads, even catching fire in the process. One such fire in a poorly designed and poorly maintained system took the entire building with it, data center and all. The fire suppression system in that one was never upgraded to meet the needs of the "repurposed space" which was originally a light industrial/office space.

Qld Health datacentre disaster (1, Interesting)

Anonymous Coward | about 5 years ago | (#28602549)

See story of Qld Health datacentre disaster on ZDnet recently:
http://www.zdnet.com.au/news/hardware/soa/Horror-story-Qld-Health-datacentre-disaster/0,130061702,339297206,00.htm

Even worse... (5, Informative)

Anonymous Coward | about 5 years ago | (#28602583)

I'm one of the guys that services the security system in Fisher Plaza. The damn sprinklers killed half my panels near the scene. Turns out they use gas suppression methods in the data centers, not so much in the utility closets. And the city of Seattle REQUIRES sprinklers throughout the building, even right over the precious, precious servers. In defense of the staff there however, they do not keep them all charged 24/7. Other then that, I have no more info, as they're pretty locked down.

Re:Even worse... (1)

DNS-and-BIND (461968) | about 5 years ago | (#28610113)

Stupid city of Seattle, prioritizing fire safety over property. The nerve! At least the staff knows better than fully trained fire professionals and judged that the system was unnecessary.

Re:Even worse... (1)

evil-merodach (1276920) | about 5 years ago | (#28610709)

We had the largest data center in Seattle and believe me we did NOT have sprinklers in our data center. Saying that the city required them sounds like a cop-out to me. Our disaster recovery plan was pretty solid with off-site recovery several thousands of miles away within minutes. Unfortunately, we did not have a disaster recovery plan for being seized by the federal government and sold to a competitor.

Good (1)

jd2112 (1535857) | about 5 years ago | (#28602817)

I work for a company that makes high-end datacenter power systems, this should be good for business once the trade rags the CxOs read report on the millions and millions of lost business.

Or at least it will keep the sales staff busy writing up quotes that will be rejected for being too expensive (although much less than the cost of a prolonged outage...)

Re:Good (1)

afidel (530433) | about 5 years ago | (#28603179)

Emerson or APC?

So the real question is... (3, Interesting)

Dirtside (91468) | about 5 years ago | (#28602833)

...what is the normal (historical) rate of data center power failures, and how does the recent spate compare? Five in a week sounds severe, but what's the normal worldwide average? I can imagine that with thousands of data centers around the globe, there's likely a serious failure occurring somewhere in the world once every couple of days.

Re:So the real question is... (0)

Anonymous Coward | about 5 years ago | (#28606205)

five in a week sounds severe? five in a week sounds like what caused that guy who babbles aloud incoherently to go off his nut.

The solution is Geographical diversity (1)

bdwarr6 (1592537) | about 5 years ago | (#28603037)

This is why you should look into company's with geographical diversity such as Ubiquity (http://www.ubiquityservers.com) or various other companies in the data center market.

Re:The solution is Geographical diversity (1)

symbolset (646467) | about 5 years ago | (#28603553)

An individual system will always eventually fail. A single part, a server, a power system, a communication system. We have an architecture of related systems that can survive the isolation of one group or large swaths of systems with no impact on the persistence or availability of data.

We call it life.

Been through too many of these. (4, Insightful)

Velox_SwiftFox (57902) | about 5 years ago | (#28603057)

"Major" data center or not, the one your company employing you at the time is using is the important one.
In my experiences, data center backups fail about a third the time power is interupted somewhere.

Servers in an Oakland California center were the victim of the loss of one of three power phases, while the monitoring that would have switched over to the diesel generators was looking at the power level of other phases. UPS systems ran out of power. An extra level of redundancy in the form of rack mount UPSes allowed servers to shut down properly despite the data center's loss of routing.

Data center #2 was the victim of a simple power outage and immediate failure of the main data center UPS system. According to a security guard I talked to, "it exploded". The diesel backup never had a chance to start.

Then the doubly-sourced Power Distribution Unit supplying a rack at a third ISP failed in a way that turned off both sources supplying the servers.

Hint: Add an extra level of UPS redundancy and safe shutdown software daemons, at least. Multiple data centers if you need more nines.

Rackspace in Dallas (4, Informative)

Thundersnatch (671481) | about 5 years ago | (#28603323)

We're a Rackspace customer in their DFW datacenter. This is the third power-related outage they've had in the last two years at that supposedly world-class facility.

The first wasn't really their fault: truck driver with health condition runs into their transformers. Generators kick in, but chillers don't re-start quickly enough. Temps skyrocket in minutes, emergency shutdowns. Maybe the transformes should have had some $50 concrete pylons surrounding them?

The second outage was the result of a botched generator upgrade.

This latest outage was the result of a botched UPS maintenance.

None of the outages was long enough to trigger our failover policy to our DR site, but our customers definitely noticed.

While their messaging has been very open and honest about the problems, and the SLA credits have been immediate, we pay them nearly $20K per month. Nedless to say, we are shopping, and looking into a "multiple cheap colos" architecture instead of "Tier-1 managed hosting". Nothing beats geographic redundancy.

Re:Rackspace in Dallas (5, Informative)

zonky (1153039) | about 5 years ago | (#28603659)

That isn't quite right, re: their 2007 outage.

It wasn't a power issue as such, but the way their chillers reponded to two quick power fluctuations in succession:

This is what they said:

Without notifying us, the utility providers cut power, and at that exact moment we were 15 minutes into cycling up the data centerâ(TM)s chillers. Our back up generators kicked in instantaneously, but the transfer to backup power triggered the chillers to stop cycling and then to begin cycling back up againâ"a process that would take on average 30 minutes. Those additional 30 minutes without chillers meant temperatures would rise to levels that could irreparably damage customersâ(TM) servers and devices. We made the decision to gradually pull servers offline before that would happen. And I know we made the right decision, even if it was a hard one to make.

Re:Rackspace in Dallas (1)

Josh Wieder (1490355) | about 5 years ago | (#28610219)

Have you considered Atlantic.Net as your collocation provider? We have a Data Center in Central Florida and our pricing is competitive. Give me an email at joshw atlantic.net for a quote. Josh Wieder Atlantic.Net

Re:Rackspace in Dallas (1)

RichINetU (1593073) | about 5 years ago | (#28611387)

We're a Rackspace customer in their DFW datacenter. This is the third power-related outage they've had in the last two years at that supposedly world-class facility.

The first wasn't really their fault: truck driver with health condition runs into their transformers. Generators kick in, but chillers don't re-start quickly enough. Temps skyrocket in minutes, emergency shutdowns. Maybe the transformes should have had some $50 concrete pylons surrounding them?

The second outage was the result of a botched generator upgrade.

This latest outage was the result of a botched UPS maintenance.

None of the outages was long enough to trigger our failover policy to our DR site, but our customers definitely noticed.

While their messaging has been very open and honest about the problems, and the SLA credits have been immediate, we pay them nearly $20K per month. Nedless to say, we are shopping, and looking into a "multiple cheap colos" architecture instead of "Tier-1 managed hosting". Nothing beats geographic redundancy.

Thundersnatch - I'm sorry to hear you've had the same types of troubles over the past few years at Rackspace. I can't blame you for feeling burned by it, you're paying a good amount to have many 9's of uptime because whatever you're running online is clearly critical to your business. It sounds like this is definitely something we can help you out with. The company I work for (INetU) focuses on working with businesses who run critical operations online. Do you have some time for a call? My contact info is: Rich Giunta Sr. Solutions Consultant INetU Inc. 988-664-6388 x109 rgiunta@inetu.net Best Regards, Rich

it was my fault (1, Funny)

bgd73 (1300953) | about 5 years ago | (#28603585)

my pc. it is 3400mhz and all the data center host sites of my interest. for the first time in 21020 hours of perfect runtime the ups saved it in a series of northeast thunderstorms coniciding with the outages around the globe. the permeating physics were too much for the world, reverberating to an unintentional master of perfect float, hardly busy, waiting to send a message. Believe it or not.

Ugh. I just finished re-watching, "Die Hard 4". (0, Offtopic)

Fantastic Lad (198284) | about 5 years ago | (#28604751)

It was a terrible movie, not the kind I like to watch anyway, but for some reason I felt compelled to view the damned thing twice in two days.

The Big Bad Threat in the film was all about something called a Fire Sale, ("It All Has To Go"), where the population's fear level is spiked up into a panic by a group of bad guys deliberately crashing the national infrastructure by way of hacking all the most important computer systems. --All to create a giant distraction so that the stock market could be plundered by thievesssssss! The story is weirdly in keeping with the theme of this Slashdot article.

Consider: Everything is a metaphor in this big old world of ours where matter and energy are based on nothing more than space and the vague notion that there is something which exists. With no matter to speak of, the whole of reality is little more than a hologram, and that being the case, the power of thought and awareness holds about the same amount of substance, if not more. --The subconscious is connected quite well to the whole affair, and events of some magnitude like today's server outages will tend to send ripples through reality so that poor shmucks like me find themselves watching in fascination stupid movies they hate without knowing why.

All I know for sure is that Bruce Willis was a lot more fun to watch when he was playing opposite Cybill Shepherd.

-FL

Sunspots, Anyone? (3, Interesting)

Craig Milo Rogers (6076) | about 5 years ago | (#28605331)

All these data centers failed at roughly the same time as the sunspots returned [space.com] , but that's just a coincidence, right?

Buy Chinese; save money up front, but... (0)

Anonymous Coward | about 5 years ago | (#28606035)

pay down the road. Transformers going out? Guess where it was built? Until ppl quit buying inferior made products, we will see more and more of these issues come up.

UPSs cause more failures than they prevent! (1)

nmg196 (184961) | about 5 years ago | (#28606159)

I'm almost thinking of taking UPS out of the loop here. They cause nearly all the downtime we have. It would be better to just let the machines power off rather than allowing the UPSs to CAUSE the machines to be taken offline. At least if the UPS isn't in circuit, the machines power back up again when the power comes back, but if there's a fault with the UPS or it's batteries, then the machines stay offline until the batteries have been replaced.

Why the hell the idiots that design UPSs seem to think it's a good idea to prevent them turning on if they sense a problem with the batteries is beyond me. Why not let the machines power back up but just make a loud beeping noise until the batteries are fixed. Don't they realise that most of the time the UPS will only properly test the batteries when there's an actual power cut? On APC units (and most others) the periodic self test function uses your SERVERS as the test load! So if the batteries can't deliver the current, your servers get turned off just due to a routine TEST! Why can't they fit an internal dummy load like a small ceramic heater or something - it's only on about 5 seconds so it won't even get hot.

Yes, APC, I'm talking to you. I've even switched suppliers thinking it must only affect APC units, but it seems all others I've tried have the same issues.

Re:UPSs cause more failures than they prevent! (1)

C_Kode (102755) | about 5 years ago | (#28608231)

The UPS isn't the issue and taking it out of the loop would be absolutely dumb. The issue is proper maintaining, testing, most of all the design of HOW it's installed. (scheme if you will)

We had a power failure (UPS failure) at our backup facility (Peer1 in lower Manhattan) The problem was, they had a UPS, but if that UPS failed, there was nothing behind it. What happen was the UPS failed and the passthough controller burnt up during the failover and took the entire wing down. They have replaced the old UPS (with new 2 bigger daisy chained UPS) and have assured me that what happen won't happen again. (we shall see)

Re:UPSs cause more failures than they prevent! (1)

Critical Facilities (850111) | about 5 years ago | (#28609105)

I'm almost thinking of taking UPS out of the loop here.

That would be suicide. If you think that all of the pieces of gear in a data center could go down and then come back up with no problem, I may have a bridge that you can buy for a very reasonable price.

Why the hell the idiots that design UPSs seem to think it's a good idea to prevent them turning on if they sense a problem with the batteries is beyond me. Why not let the machines power back up but just make a loud beeping noise until the batteries are fixed. Don't they realise that most of the time the UPS will only properly test the batteries when there's an actual power cut?

Oh where to start. First of all, enterprise level UPS Systems (not the little "shoe box" APC unit under your desk) do not shut down on battery issues. At worst, during a catastrophic failure, they will trip to bypass. If properly arranged in a 2N or 2N+1 configuration, your Critical Load will migrate to an alternate, redundant UPS System just as a precaution. If there are battery issues, the data center operators will know it long before the UPS modules register any alarms (or they're not doing their job). Battery PMs are just as important as generator, transfer switch, static switch, and other PMs.

You are also mistaken that the only time batteries can be tested is during an outage. If the Preventive Maintenance regimen is thorough, there should be full battery discharge testing in addition to quarterly and semi annual battery PMs looking for specific gravity and internal resistance along with cell voltages, and various other components. In other words, there should be no surprises. True, you can't rule everything out, but you can reduce chance and surprises by a HUGE margin if you're vigilant and thorough.

Re:UPSs cause more failures than they prevent! (1)

asackett (161377) | about 5 years ago | (#28616327)

Ain't it amazin' that all those UPS engineers whose workdays are nothing but thinking about the design of systems all arrive at the same conclusion and refuse to start their machines if the battery is failed? It's just incredible that these guys with EE degrees come out of college smart enough to do right things and somehow get it so wrong. Oh wait... I'm being sarcastic, which isn't very nice. I'd never suggest that the idiot isn't them, might instead be you, because that would be bad for my karma.

Re:UPSs cause more failures than they prevent! (1)

jabelli (1144769) | about 5 years ago | (#28616815)

That's because you keep buying the cheap "Back-UPS" home computer crap to run servers on. I'm using a 15-year-old Smart-UPS 600. I've replaced the batteries once. When the original set wore out, it didn't refuse to power up, it complained about the batteries with das blinkenlights and the warning beeper until they were replaced.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?
or Connect with...

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>