×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Car Hits Utility Pole, Takes Out EC2 Datacenter

timothy posted more than 3 years ago | from the nothing-can-go-wr dept.

The Internet 250

1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

250 comments

Murphy's law (1, Redundant)

pwnies (1034518) | more than 3 years ago | (#32203336)

Whatever can go wrong will rings pretty true here. Makes for an exciting day of work for them though I suppose; unlike yours truly.
*Goes back to reading /.*

Re:Murphy's law (0, Offtopic)

linuxgeek64 (1246964) | more than 3 years ago | (#32203354)

Semicolons are for connecting two separate but related independent clauses. Can "Unlike yours truly." stand on its own? No, because it's improper to begin a sentence with a conjunction. Nice try, but you still fail to grammar.

Re:Murphy's law (5, Funny)

turing_m (1030530) | more than 3 years ago | (#32203468)

Nice try, but you still fail to grammar.

This is why I long ago resolved to never, ever, ever correct someone else's grammar on slashdot. The risk in inadvertently failing to grammar is unacceptable.

Re:Murphy's law (0, Offtopic)

Fex303 (557896) | more than 3 years ago | (#32203628)

Um.... Whooosh?

Using 'grammar' as a verb is one of those linguistics jokes I love. (Actually, I love all linguistics jokes.) My usual explanation for when I've done a grammar edit on my posts (on forums which support it) is 'Edit: I'm don't grammar'.

A key clue to this being a joke is the use of the word 'fail' which these days is often associated with LOLcats. Those damn cats have raised the use of deliberately bad grammar to an artform in and of itself.

Re:Murphy's law (0, Offtopic)

onionman (975962) | more than 3 years ago | (#32203636)

Nice try, but you still fail to grammar.

This is why I long ago resolved to never, ever, ever correct someone else's grammar on slashdot. The risk in inadvertently failing to grammar is unacceptable.

Here's a wacky thing: the plural form of someone else is actually someone's else which I only discovered one day when the the spell checker kept underlining else's. I'm not correcting you, by the way. My own grammar and spelling are atrocious, so I nearly always fail to grammar. I just thought I'd point out an oddity of the language in case anyone else found it humorous.

Re:Murphy's law (1)

morty_vikka (1112597) | more than 3 years ago | (#32203730)

Here's a wacky thing: the plural form of someone else is actually someone's else .

Ah, I can see the reason for your disclaimer about not having good grammar. "Someone else's" isn't plural, it's possessive! Still an interesting fact though.. does it mean the possessive form of someone else is someone's else? Looks pretty wrong to me...

Re:Murphy's law (1, Offtopic)

DarkTempes (822722) | more than 3 years ago | (#32203760)

I think automated spell checking is a poor way to learn grammar and that such tools are frequently wrong.

A quick review makes me suspect that the correct possessive form is still someone else's. (Sources: a dictionary [reference.com] , a writing guide [essayinfo.com] , and a google test)

Re:Murphy's law (0, Redundant)

The Hatchet (1766306) | more than 3 years ago | (#32204032)

Fine grammar is just a formality. Language is a wonderful, ever changing tool. We can use it however we please. Sure, some mistakes are terrible, and accidental, but that does not mean that grammar need be as valuable as gold. We doth need remember that language cannot be controlled without losing that which it is used to create.

We can say things like "I don't grammar" and they convey meaning just as well as saying "I don't pay great attention to or check my grammar" It might sound a bit off, but it does what language is meant to do: convey meaning. The faster and better we can convey meaning, the better we are language-ing. So indeed the phrase "I'm don't grammar" may be terribly flawed, but it conveys meaning quicker and more efficiently than the other statement, so it can easily be said that it has fulfilled its purpose better than the grammatically correct phrase.

Every time I meet a grammar nazi in person I spend about half an hour giving them a speech on why they should go to hell.

Also: I might note that /. comments are terrible for correcting grammar, using crappy comparisons, and crappy attempts at being condescending. It is so much, it often covers up or ignores the important points of a debate. It is just as bad as watching intelligent debates degrade to anger or degrade to moronic babble. I would seriously like to see more focus on what is important, and less on this kind of crap, as a general rule. Maybe then someone could learn something besides how to be better at being a useless, progress impeding grammar nazi.

I suggest you cease and desist. Then we can all get on with our lives.

Re:Murphy's law (1)

The Hatchet (1766306) | more than 3 years ago | (#32203956)

It is efficient for money-making to centralize everything, especially consumer services/money. But when you put all your eggs in one basket, only one basket needs to be broken to scramble all eggs.

Every day we put more in the cloud. Every day we have less power. What happens when everything is in the cloud? all you need is a truck and a utility pole and hell breaks lose. You can then do as you please while you wait for people to remember how to function without working electronic devices. Imagine the field day organized criminals will have when the police move to the cloud.

Re:Murphy's law (1)

fuzzyfuzzyfungus (1223518) | more than 3 years ago | (#32204078)

There are both economies and diseconomies to centralization. The real issue(in many "cloud" cases), is that some of the things that could be economies of centralization are being skipped in the mad rush for low costs, and since everything is hidden under a shiny layer of web APIs, people don't notice in time.

In this case, for instance, the cost per server, or per unit work done, to have Real Serious Redundant Power(batteries, generators, multiple utility links, etc.) plummets as the number of servers in a given location increases. As long as people keep in mind that "cloud" equals "buzzword for a set of methods of reducing the transaction costs of outsourcing a variety of IT functions" rather than "magic place where computations are done by happy computrons and packets are carried by unicorns" and ask the appropriate questions, the average reliability of power, bandwidth and other useful stuff should go up. If they fail to keep that in mind and start falling into the stupid(but seductive, and not exactly discouraged by vendors) assumption that "the web API makes it look super easy, so it must be super reliable", market forces will quickly drive the lowest common denominator down to a pretty grim level of service.

Re:Murphy's law (4, Insightful)

JWSmythe (446288) | more than 3 years ago | (#32204382)

    Funny thing, I thought "cloud" computing means that you're placed into an automatically redundant network of machines, so if there's a site wide outage it didn't interfere with the operations.

    Now I see that Amazon's definition of "cloud" simply means "hosting provider". I guess in this case it means hosting provider with no DC power room, N+1 generators and regular testing to ensure the fallback systems actually work.

    That kind of reminds me of a company (who will remain nameless) who did tape backups, but never verified their tapes. When the data was lost, a good percentage of the tapes didn't work.

    I worked near a good datacenter. Out on smoke breaks late at night, you could hear them test fire their generators once a week. I was in there helping someone one night during a thunderstorm that sounded like it would rip the roof off, when I heard the generators spin up. The inside of the datacenter didn't miss a beat. When I left an hour later, I saw that there was no power (street lights, traffic lights, and normally illuminated buildings) for about 1/2 mile around it. The power company had it fixed by morning though. When I came back in the morning, everything was fine. Well, except my workstation in the office that didn't have redundant power.

Re:Murphy's law (0)

Anonymous Coward | more than 3 years ago | (#32204072)

So when the cloud fails.....Does it evaporate?

Farmville updates on Facebook stop (5, Insightful)

kriston (7886) | more than 3 years ago | (#32203340)

And, as a result, Farmville/Mafiawars updates on Facebook temporarily stop.
Nothing of value was lost.

Re:Farmville updates on Facebook stop (-1, Troll)

Anonymous Coward | more than 3 years ago | (#32203426)

Speak for yourself. I was unable to upload my self-bukkakke pics to Facebook for a whole hour!

Cmdr Taco

Put critical power infrastructure underground (-1, Offtopic)

Anonymous Coward | more than 3 years ago | (#32203348)

They need to put power lines and transformers that support data centers and possibly national security in locking waterproof underground vaults, as they so in southern california new home developments.

Where's your cloud now? (4, Funny)

TooMuchToDo (882796) | more than 3 years ago | (#32203352)

"The cloud" doesn't solve everything. Film at 11.

Re:Where's your cloud now? (1)

Anonymous Coward | more than 3 years ago | (#32203404)

Amazon EC2 is just a xen VM, not true cloud computing. Discuss amongst yourselves.

Re:Where's your cloud now? (0)

Anonymous Coward | more than 3 years ago | (#32203568)

'cloud computing' is completely meaningless.

Re:Where's your cloud now? (0)

Anonymous Coward | more than 3 years ago | (#32203532)

"The cloud" doesn't solve everything. Film at 11.

Hey, you. Get off of my cloud.

Re:Where's your cloud now? (1)

GaryOlson (737642) | more than 3 years ago | (#32203680)

But a thick cloud with high density can cover up a lot of ugly infrastructure no one wants to see. Just ask the people who live in San Francisco.

The poll... (1)

Dayofswords (1548243) | more than 3 years ago | (#32203362)

The poll goes perfect with the story.

The cloud is nice, but unreliable, it is.

Re:The poll... (1)

Jeian (409916) | more than 3 years ago | (#32204218)

It has nothing to do with "the cloud", other than that the datacenter affected happened to host one. It could've been a dedicated server and it would've had the same problem.

It's failure on multiple levels (5, Insightful)

GilliamOS (1313019) | more than 3 years ago | (#32203366)

Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.

Re:It's failure on multiple levels (0)

Anonymous Coward | more than 3 years ago | (#32203546)

Civilized countries bury their cables underground.
Only the US seems to use poles anymore.

Re:It's failure on multiple levels (4, Interesting)

GaryOlson (737642) | more than 3 years ago | (#32203656)

Most Americans these days are over-pampered self-absorbed malcontents. If the poles are not out in front where crews can service without going on property -- or even using predefined right of ways -- too many people complain or sue for negligible property damage.

Where I grew up, the power poles ran on the property lines behind and between the houses. Once, lightning took out the transformer on the power pole [great light show and high speed spark ejection] ; and people were willing to take down the fence, put the dogs in a kennel, and remove landscaping which had encroached on the power pole so the crew could replace the transformer and other service. Today, I expect everyone shows up with a digital camera to document "property damage" to file for compensation for landscaping which has illegally encroached on the equipment.

Many places various issues prevent burying the power cable: high water table, daytime temperatures which do not cool the ground -- and the power cables, or even fire ants.

Re:It's failure on multiple levels (0)

MichaelSmith (789609) | more than 3 years ago | (#32203716)

Civilized countries bury their cables underground.
Only the US seems to use poles anymore.

You need to get out more.

Re:It's failure on multiple levels (4, Insightful)

OnlineAlias (828288) | more than 3 years ago | (#32203552)

You said it. They failed to test. I design/run datacenters, and have had exactly this kind of thing happen recently. No outage, hardly anyone even noticed. My most critical stuff runs active/active out of multiple data centers...you could nuke one of them and everything would still be up.

I'm actually a little blown away that the all powerful Amazon could possibly let this kind of thing happen. They are supposed to be pro team, a power failure is high school ball.

Re:It's failure on multiple levels (1)

Itninja (937614) | more than 3 years ago | (#32203710)

No outage, hardly anyone even noticed.

So, how is this different? A teeny, tiny percentage of users even noticed this and no data was lost. It's foolish to think one's data center is immune to outages (power or otherwise) from time to time, no matter how well it's designed. But apparently this is the latest in several outages over the past few weeks which is kind of like amateur hour.

Failure is often not a boolean (5, Interesting)

mcrbids (148650) | more than 3 years ago | (#32204070)

For years, I co-located at the top-rated 365 Main data center in San Francisco, CA [365main.com] until they had a power failure a few years ago. Despite having 5x redundant power that was regularly tested, it apparently wasn't tested against a *brown out*. So when Pacific Gas and Electric had a brownout, it failed to trigger 2 of the 5 redundant generators. Unfortunately, the system was designed so that any *one* of the redundant generators could fail and there wouldn't be any problem.

So power was in a brownout condition, the voltage dropped from the usual 120 volts or so down to 90. Many power supplies have brownout detectors and will shut off. Many did, until the total system load dropped to the point where normal power was restored. All of this happened within a few seconds, and the brownout was fixed in just a few minutes. But at the end of it all, there was perhaps 20% of all the systems in the building shut down. The "24x7 hot hands" were beyond swamped. Techies all around the San Francisco area were pulled from whatever they were doing to converge on downtown SF. And me, 4 hours drive away, managed to restore our public-facing services on the one server (of four) I had that survived the voltage spikes before driving in. (Alas, my servers had the "higher end" power supplies with brownout detection)

And so it was a long chain of almost success of well-tested, high-quality equipment that failed all in sequence because real life didn't happen to behave like the frequently performed tests did.

When I did finally arrive, the normally quiet, meticulously clean facility was a shambles. Littered with bits of network cable, boxes of freshly-purchased computer equipment, pizza boxes, and other refuge were to be found in every corner. The aisles were crowded with techies performing disk checks and chattering tersely on cell phones. It was other-worldly.

All of my systems came up normally; simply pushing the power switch and letting the fsck run did the trick, we were fully back up and all tests performed (and the system configuration returned to normal) in about an hour.

Upon reflection, I realized that even though I had some down time, I was really in a pretty good position:

1) I had backup hosting elsewhere, with a backup from the previous night. I could have switched over, but decided not to because we had current data on one system and we figured it was better not to have anybody lose any data than to have everybody lose the morning's work.

2) I had good quality equipment; the fact that none of my equipment was damaged from the event may have been partly due to the brownout detection in the power supplies of my servers.

3) At no point did I have any less than two backups off site in two different location, so I had multiple, recent data snapshots off site. As long as the daisy chains of failure can be, it would be freakishly rare to have all of these points go down at once.

4) Even with 75% of my hosting capacity taken offline, we were able to maintain uptime throughout all this because our configuration has full redundancy within our cluster - everything is stored in at least 2 places onsite.

Moral of the story? Never, EVER have all your eggs in one basket.

Re:It's failure on multiple levels (4, Informative)

fractalVisionz (989785) | more than 3 years ago | (#32203596)

It seems you didn't RTFM. Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.

Re:It's failure on multiple levels (4, Insightful)

TubeSteak (669689) | more than 3 years ago | (#32204252)

Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.

Sounds like Amazon's tech monkeys didn't do their job when they received the hardware from the factory.
Or is it normal to just plug in mission critical hardware and not check that it is setup properly?

"We have already made configuration changes to the switch which will prevent it from misinterpreting any similar event in the future and have done a full audit to ensure no other switches in any of our data centers have this incorrect setting," Amazon reported.

I guess TFA answered that question.
If they're smart, they'll be creating policies for those types of audits to be done up front instead of after a failure.

Re:It's failure on multiple levels (1)

ToasterMonkey (467067) | more than 3 years ago | (#32203732)

From the summary I gathered the problem was with the mechanical switch that disconnects external power when the generators are brought online, not a lack of capacity. Still requires testing, but it isn't going to be done often because isn't this the result when the power doesn't transition smoothly?

Re:It's failure on multiple levels (1)

profplump (309017) | more than 3 years ago | (#32204000)

It is, but you can test during pre-defined maintenance windows when downtime is expected, or you can migrate active services to other hosts and leave these running as backups during the test, so that a failure does not bring down the primary.

Re:It's failure on multiple levels (1)

omglolbah (731566) | more than 3 years ago | (#32204182)

It is much better to have a scheduled test with people ready to take care of any issues that may or may not pop up than to have a piece of equipment fail at a random time with few prepared...

I sure as hell would rather have a blip every now and then than knowing that the system might fail catastrophically when something unexpected happens..

Re:It's failure on multiple levels (1)

DogDude (805747) | more than 3 years ago | (#32203906)

Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.

That's why Google is locating all of their datacenters near natural power sources and is a registered utility agent whatchamajigger. I think that they agree with you.

Cloud is a poor metaphor anyway (1)

dbIII (701233) | more than 3 years ago | (#32203962)

It's also completely expected by those not sold on pure science fiction.
All it can take is a backhoe in the wrong place at the wrong time or an anchor cable dragging to cut you off from the very real single or two bits of infrastructure that people fantasise is their bit of the "cloud".

Obvious solution (4, Funny)

nebaz (453974) | more than 3 years ago | (#32203376)

Utility poles clearly need countermeasures. Hellfire missiles and such. That'll teach 'em to mess with a poor defenseless pole.

Re:Obvious solution (2, Informative)

binarylarry (1338699) | more than 3 years ago | (#32203394)

Think of the poor strippers man!

stupid mods, trickz are for kidz (0)

Anonymous Coward | more than 3 years ago | (#32203718)

Funny != insightful

Re:stupid mods, trickz are for kidz (3, Interesting)

Coopjust (872796) | more than 3 years ago | (#32203810)

Often, mods will give a funny post "insightful" instead of "funny" because it gives the user positive karma (whereas funny does not affect karma). Not a use intended by CmdrTaco, I'd imagine, but it's a common practice.

Re:Obvious solution (0)

Anonymous Coward | more than 3 years ago | (#32204174)

"poor defenseless pole" and not one reply about Hitler or Nazi Germany? for shame!

An untested DR plan is a worthless DR plan (3, Interesting)

realmolo (574068) | more than 3 years ago | (#32203418)

Seriously, Amazon screwed up in a fairly major way with this.

What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?

Answer: Nothing. You'd be surprised how may US small-to-medium sized business are one fire/tornado/earthquake/hurricane away from bankruptcy. I'd bet it's over 80% of them.

Re:An untested DR plan is a worthless DR plan (1)

FictionPimp (712802) | more than 3 years ago | (#32203524)

The place I work just had the exact same problem. DC caps went bad and nobody noticed. Power went out and the backup didn't have enough juice to let the batteries kick in and move to the generator. At least I don't feel really bad now, just bad.

Re:An untested DR plan is a worthless DR plan (3, Insightful)

Albanach (527650) | more than 3 years ago | (#32203712)

Seriously, Amazon screwed up in a fairly major way with this.

What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?

What on earth leads you to suggest they don't have working disaster recovery? The experienced some disparate power outages and say they're implementing changes to improve their power distribution.

I've hosted in data centers where the UPS was regularly tested, yet on a real live incident switchover failed. Even though the UPS did come up there was a brief outage shutting down all the racks. Each rack needs brought back online one at a time to prevent overloading. Immediately you're looking at significant downtime.

I've hosted in another data center where someone hit the BIG RED BUTTON underneath the plastic case, cutting off power to the floor.

I'm sure Amazon could have done thing better and will learn lessons. That's life in a data center.

Nonetheless, Amazon allow you to keep your data at geographically diverse locations. As a customer you can pay the money and get geographic diversity that would have mitigated. If you don't take advantage of that, you can hardly blame Amazon for your decision.

Re:An untested DR plan is a worthless DR plan (1)

crazybit (918023) | more than 3 years ago | (#32204216)

What on earth leads you to suggest they don't have working disaster recovery?

The fact that their service was partially cut due to a power failure. We know accidents DO happen and power failures DO happen, like the explosion in The Planet's power control room [1]. The guy's at Amazon cloud should be prepared for predictable problems like a power outage, specially when one of their selling arguments is service continuity.

[1] [datacenterknowledge.com]

Re:An untested DR plan is a worthless DR plan (1)

TubeSteak (669689) | more than 3 years ago | (#32204284)

I've hosted in data centers where the UPS was regularly tested, yet on a real live incident switchover failed. Even though the UPS did come up there was a brief outage shutting down all the racks. Each rack needs brought back online one at a time to prevent overloading. Immediately you're looking at significant downtime.

Doesn't the "U" in "UPS" stands for "Uninterruptible"?
Soooo.. Forgive my ignorance, but how does hardware hooked up to a UPS have "a brief outage"?

Re:An untested DR plan is a worthless DR plan (1)

AK Marc (707885) | more than 3 years ago | (#32204046)

Same with data backups. People just put in untested redundancy, a backup program that says "completed" and live happy. At least until the first time something fails.

Testing costs time and money. It's easier to point to a job status that says "completed" or an invoice for the right pieces and say "it was the vendor's fault."

Re:An untested DR plan is a worthless DR plan (1)

lena_10326 (1100441) | more than 3 years ago | (#32204144)

Amazon has an insane amount of redundancy with dozens of physical data centers spread over the world. They regularly perform game day disaster scenarios taking out entire data centers to test the recovery of the infrastructure and Amazon applications.

In this instance, you'll note only a few clients were impacted because a switch had incorrect configuration. There is not much you can do about some types of human errors, which can come from all sorts of unexpected angles. Regardless, a number EC2 nodes were lost but were replaceable with EC2 nodes in other data centers. If clients lost data then it was due to clients not following the principle of building in redundancy into their applications. Amazon can only implement redundancy in the infrastructure, not client applications. Amazon advises EC2 customers not to build in single node dependencies into their apps. This cannot not be made more clear in their documentation and support.

You are very ignorant regarding your speculation about Amazon's infrastructure.

UPS's (4, Interesting)

MichaelSmith (789609) | more than 3 years ago | (#32203446)

The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.

It also blew the power supply on an alphaserver and put a nice burn mark in the breaker panel. So the UPS guy comes out and he doesn't have two of the right sort of fuse. Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc. So we got going in the end.

Re:UPS's (1, Insightful)

e9th (652576) | more than 3 years ago | (#32203648)

Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc.

Strips of steel with holes in them? You're kidding, right?

Re:UPS's (2, Informative)

MichaelSmith (789609) | more than 3 years ago | (#32203694)

Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc.

Strips of steel with holes in them? You're kidding, right?

No. It would be 50*15*5 mm steel with a 10mm hole drilled in each end. A bolt goes through each hole into a threaded attachment point.

Now that you mention it I recall that a four inch nail is good for 100A slow blow but thats cylindrical so it conducts nicely. You'd think the rectangular cross section would not conduct quite as well (sharp corners, etc) but maybe it is also tuned for the desired current. A little saw cut half way between the holes would do that.

Re:UPS's (1)

dbIII (701233) | more than 3 years ago | (#32203990)

Steel is an incredibly bad conductor as metals go - just think of a blacksmith safely holding a bit of steel that is red hot at the other end if you've never done that yourself. Electrical conductivity is related to thermal conductivity.

Not really (4, Informative)

Sycraft-fu (314770) | more than 3 years ago | (#32203702)

All a fuse is is a piece of metal that will melt fairly quickly when a given amount of current is passed through it. Idea being that it heats up and melts before the wires can. So, the bigger the current, the more robust the metal connecting it. A 100A fuse is usually a fairly large strip of steel.

Now I'll admit that just grabbing an approximate size of steel and placing it in as the GP did isn't going to yield a nice precise fuse. It may have been too high a current. However, it'd work for getting things running again and probably provide a modicum of protection in the event of a short.

Re:UPS's (1)

grcumb (781340) | more than 3 years ago | (#32203776)

Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc.

Strips of steel with holes in them? You're kidding, right?

Yeah, so what? I mean what could possibly go wro

Re:UPS's (1)

fenix849 (1009013) | more than 3 years ago | (#32203782)

He must be because all the other fuses are ceramic or glass containing a specific thickness wire made from some sort of conductive material that will melt at a given amount of current..

The melting bit is optional if you don't mind being killed by your toaster.

Re:UPS's (1)

mystik (38627) | more than 3 years ago | (#32203802)

Funny -- Something almost exactly like that happened @ my datacenter last year. Some 'licensed' electrician accidentally shorted something in the main Power junction, which took the whole damn thing offline, generators, batteries and all. We were down 1+ hrs while they had to have the techs come on site to ensure that things were safe and online. Meanwhile, a small group of admins just outside w/ pitchforks (myself included) were waiting for the all clear to swarm the datacenter to get our equipment back online ....

Re:UPS's (2, Informative)

omglolbah (731566) | more than 3 years ago | (#32204206)

Just be glad nobody got killed...

Shorting out something in a main power junction could easily have created a fairly nasty fire...

Re:UPS's (1)

seanadams.com (463190) | more than 3 years ago | (#32204194)

shorted the white phase to ground

What the hell is the "white phase"? Unless I am missing some newfangled data-center lingo, you are talking about the neutral, which is not a "phase" at all, and could never produce such a fault current when "shorted" to ground since it is already tied to ground at the panel. Am I missing something?

Re:UPS's (2, Informative)

MichaelSmith (789609) | more than 3 years ago | (#32204386)

shorted the white phase to ground

What the hell is the "white phase"? Unless I am missing some newfangled data-center lingo, you are talking about the neutral, which is not a "phase" at all, and could never produce such a fault current when "shorted" to ground since it is already tied to ground at the panel. Am I missing something?

You have three actives (red, white, dark blue here in .AU), a neutral and an earth. The wikipedia page says different countries have different color codes so maybe that is the confusion.

Unreasonable expectations (4, Interesting)

KGBear (71109) | more than 3 years ago | (#32203470)

I expect this is just a scaled up version of the problems I deal with every day. And I'm sure I'm not the only one. Users have grown so dependent on system services and management has grown so apart from the trenches that completely unreasonable expectations are the norm. Where I work for instance it's almost impossible to even *test* backup power and failover mechanisms and procedures because users consider even minor outages in the middle of the night unacceptable and managers either don't have the clout or don't understand the problem well enough to put limits to such expectations. As a result often times the only tests such systems get happen during real emergencies, when they are actually needed. I don't know how, but I feel we should start educating our users and managers better, not to mention being realistic about risks and expectations.

Re:Unreasonable expectations (1)

SheeEttin (899897) | more than 3 years ago | (#32203910)

users consider even minor [test] outages in the middle of the night unacceptable

...and this is why we have redundancy.
Test the backup hardware. Works? Switch over to it, test the main hardware. Works? All good, no (or negligible) downtime.

Re:Unreasonable expectations (1)

omglolbah (731566) | more than 3 years ago | (#32204214)

Yes, but what the management probably worries about is "what if the redundant system fails while you are testing the primary?".

So they wont let us lowly engineers do the test... opting instead of the chance of a disaster...

I'm glad I work in the oil business... safety is ALWAYS the most important thing... since any failure will be horribly expensive :-p

Hurrr, durrr (1, Informative)

planetoid (719535) | more than 3 years ago | (#32203504)

Stop building those things so fucking close to the roads, maybe?

Re:Hurrr, durrr (2, Insightful)

MichaelSmith (789609) | more than 3 years ago | (#32203516)

Stop building those things so fucking close to the roads, maybe?

What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.

Re:Hurrr, durrr (5, Funny)

plover (150551) | more than 3 years ago | (#32203952)

What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.

That's also the fastest way to get rescued off a desert island or out in the woods, and why you should always carry a piece of fiber in your pocket. Should you get stranded, you simply bury the fiber, and some asshole with a backhoe will be along in about five minutes to cut it. Ask him to rescue you.

Transfer switches suck? (2, Interesting)

pavera (320634) | more than 3 years ago | (#32203528)

The DC that my company colos a few racks in had this same thing happen about a year ago (not a car crash, just a transformer blew out). But the transfer switch failed to switch to backup power, and the DC lost power for 3 hours.

What is up with these transfer switches? Do the DCs just not test them? Or is it the sudden loss of power that freaks them out vs a controlled "ok we're cutting to backup power now" that would occur during a test? Someone with more knowledge of DC power systems might enlighten me...

Re:Transfer switches suck? (1)

jroysdon (201893) | more than 3 years ago | (#32203944)

I don't get why you wouldn't have dual-redundant power supplies on all devices (routers, switches, servers), one on transfer switch A and the other on transfer switch B, each connecting to different backup power sources. Further, these should be tested on a regular basis (at least monthly). Test transfer switch A on the 1st, and transfer switch B on the 15th.

Seems like a design flaw here and/or someone was just being cheap.

Re:Transfer switches suck? (1)

aaarrrgggh (9205) | more than 3 years ago | (#32204136)

The problem usually isn't the transfer switch itself, but how it works with everything else. Transfer switches usually only really fail with contact damage.

Cascading failures are a bigger problem for most co-lo's, as they try and maximize infrastructure utilization to a fault.

Restoring power can be quite difficult.

Re:Transfer switches suck? (2, Interesting)

Technonotice_Dom (686940) | more than 3 years ago | (#32204358)

I don't get why you wouldn't have dual-redundant power supplies on all devices (routers, switches, servers), .... [snip]

Seems like a design flaw here and/or someone was just being cheap.

It would be the latter. The AWS EC2 instances aren't marketed or intended to be high availability individually. They're designed to be cheap and Amazon do say instances will fail. They provide a good number of data centres and specifically say that systems within the same zone may fail - different data centres are entirely independent. They provide a number of extra services that can also tolerate the loss of one data centre.

Anybody who believes they're getting highly available instances hasn't done a basic level of research about the platform they're using and deserves to be bitten by this. Anybody who does know the basics of the platform will know the risks and will be able to recover from a failure, possibly even seamlessly.

Re:Transfer switches suck? (2, Interesting)

Renraku (518261) | more than 3 years ago | (#32204362)

It's not really the DC power system that's the issue.

The people are the issue.

Example: You're the lead technician for a new data center. You request backup power systems be written into the budget, and are granted your wish. You install the backup power systems and then ask to test them. Like a good manager, your boss asks you what that will involve. You say that it'll involve testing the components one by one, which he nods in agreement with. However, when you get to the 'throw the main breaker and see if it works' part and he realizes that this one test might make them less than 99.99999% reliable if it fails, he disagrees and won't approve the testing.

I can see where they're coming from here. They don't want downtime. They just aren't thinking far enough ahead. Ten minute test downtime or hours of unmitigated downtime. I abso-fucking-lutely guarantee you that the technicians will be blamed. Not management.

Oil's Well (2, Insightful)

Aeonite (263338) | more than 3 years ago | (#32203530)

It's a good thing that oil rigs are better managed than data centers. Who knows what might happen if one of them ever had a problem like this?

Re:Oil's Well (1)

omglolbah (731566) | more than 3 years ago | (#32203630)

Yep, but what is important to keep in mind is that if an oil rig has to shut down for a day due to a power issue the oil will still be in the ground.
The company might lose money due to having promised a certain supply (especially with gas!) but the resource is not lost.

In a datacenter the uptime is all there is. Value is lost.

The oil rigs in Norwegian waters are fairly secure when it comes to power faults. If the system cannot guarantee power it goes into a shutdown sequence to set everything in a "safe" position.
Shutdowns are actually not that rare. Minor shutdowns of parts of a platform or refinery is not very dramatic. It is just a case of getting the bugger up and running in a safe way.

I'd give details but unfortunately I am under NDA :-p

Re:Oil's Well (1, Informative)

Anonymous Coward | more than 3 years ago | (#32203992)

I do believe there is a wooshing sound you missed. He is referring to the BP gulf oil spill. Although that was not caused by a power failure.

Re:Oil's Well (1)

DNS-and-BIND (461968) | more than 3 years ago | (#32204052)

Who knows what might happen if one of them ever had a problem like this?

*WHOOSH* He's not talking about the BP oil spill.

Re:Oil's Well (0)

Anonymous Coward | more than 3 years ago | (#32204100)

Uhh, Big time Whoosh!

Re:Oil's Well (0)

Anonymous Coward | more than 3 years ago | (#32203720)

If a car takes out an oil rig we've got bigger problems than just some spilled oil.

Oh noes (0)

Anonymous Coward | more than 3 years ago | (#32203580)

Is it just me or is the placement of a lot of recent links aggravating? Shouldn't the link be from "Amazon cloud computing data center lost power" and not the bit about a utility pole getting struck?

I think it should be, but I'm no Angus Mickleburger.

I'm confused (3, Funny)

OverlordQ (264228) | more than 3 years ago | (#32203676)

Why couldn't they just get power from the cloud?

Re:I'm confused (1)

martin-boundary (547041) | more than 3 years ago | (#32204190)

Why couldn't they just get power from the cloud?

The Indian who usually does the Rain and Lightning Dance was on vacation.

Re:I'm confused (1)

bpcomp (1811274) | more than 3 years ago | (#32204224)

Why couldn't they just get power from the cloud?

They didn't have the right security key attached to the kite.

Re:I'm confused (0)

Anonymous Coward | more than 3 years ago | (#32204262)

Because it isn't a stormcloud.

Redundancies, Redundancies (0)

Anonymous Coward | more than 3 years ago | (#32203740)

That is why datacenters should have (and my company does) dual-upses, dual-transfer switches, and dual-generators. They also should not load any circuit over 50% to ensure a cascading failure won't happen if power is lost on one side.

Re:Redundancies, Redundancies (4, Insightful)

mirix (1649853) | more than 3 years ago | (#32204008)

Redundancy costs money. If it costs more than downtime, you don't get it.

Totally Unexpected (1)

NicknamesAreStupid (1040118) | more than 3 years ago | (#32204088)

Who, while driving through a cloud, would ever expect to hit a utility pole? Clouds do not have utility poles. Now, tule fog has utility poles. That is not why they call it 'tule' (not a nickname for utility, but for a grass), but many a utility pole has been unduly undone because someone drove through the tule fog and into the utility pole.

If Amazon is going to put utility poles in its 'cloud', then they are really in a fog. Call it fog computing.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...