Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Amazon EC2 Failure Post-Mortem

Soulskill posted more than 3 years ago | from the ted-tripped-over-a-power-cord dept.

Cloud 117

CPE1704TKS tips news that Amazon has provided a post-mortem on why EC2 failed. Quoting: "At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."

cancel ×

117 comments

Sorry! There are no comments related to the filter you selected.

Oops (0)

Anonymous Coward | more than 3 years ago | (#35973574)

So basically they unleashed all the traffic on a poor little network that couldn't handle it. Somebody dun goofed...

Re:Oops (2)

Whalou (721698) | more than 3 years ago | (#35973944)

Kudos to Amazon for rapidly explaining, in length, what happened.

Unlike some other company... *cough* Sony *cough*

Re:Oops (2)

DrXym (126579) | more than 3 years ago | (#35974344)

Sony hasn't fixed their issue. Kind of hard to have a post mortem while the solution is still ongoing. There has plenty of extrapolation and bullshit in the information vacuum surrounding the attack though. So when things return to normality it would be in their interest to provide a decent technical overview of what happened, the safeguards that were there before, why they failed and what steps have been made since to improve things.

Re:Oops (1)

atisss (1661313) | more than 3 years ago | (#35974564)

Is Sony dead yet?

Re:Oops (1)

postbigbang (761081) | more than 3 years ago | (#35974466)

It was good that they were forthcoming, as competitors are both breathing down their necks, and also looking at their own infrastructure for possible race conditions that would crater post-failure storage isolation(s).

They also admitted but don't seem to get the message that their focus has been on developing novel customer solutions-- NOT keeping the core infrastructure bulletproof. Loose-and-fast rather than unrelenting QA will cause Amazon a lot of pain; it'll be hard to trust them until they can prove their infrastructure and multi-zone storage architecture and clustered instances work together given a broad spectrum of failure modes.

In English: they took their eye off the ball because the sales department distracted core QA functionality-- and it blew up, and badly, and expensively.

Re:Oops (1)

datapharmer (1099455) | more than 3 years ago | (#35974742)

They don't care, they are making too much money off of spammers and script kiddies to worry about reliability. Blocking their ip ranges reduced attempts on my servers by a significant percentage, and their abuse involves asking the customer what happened.... it is pretty clear what happened is they were running a spam network for xyz erection pills; cut them off already. I have a list of about 7 hosting companies that if they could be disconnected from their peers internet spam and related sites would plummet within a week. Amazon is on that list. Oh, can't beat the captcha? Pay turkers 5 cents to fill them out for you...

Re:Oops (1)

darth dickinson (169021) | more than 3 years ago | (#35974606)

Or Google...

I realise this is "News for Nerds"... (3, Funny)

Haedrian (1676506) | more than 3 years ago | (#35973576)

But can I get an understandable car analogy here?

Re:I realise this is "News for Nerds"... (4, Informative)

MagicM (85041) | more than 3 years ago | (#35973604)

Instead of closing off one lane of highway for construction, they closed off all lanes and forced highway traffic to go through town. The roads in town weren't able to handle all the cars. Massive back-ups ensued.

Re:I realise this is "News for Nerds"... (0)

Anonymous Coward | more than 3 years ago | (#35973650)

The bridge broke and all the cars fell into the river. None of them can be recovered.

Re:I realise this is "News for Nerds"... (1)

Anonymous Coward | more than 3 years ago | (#35973832)

What I've learned from Pontifex and other bridge building games, it's not really that hard to build bridges. Common sense goes long way in seeing how much it can handle. I don't know how it is with everyone, but I've always somehow "seen" in my head exactly how physics function. This made it great for me when playing baseball (not the US version), because I could directly see when you should hit the ball, how hard and where it's going to land. Same thing when I was catching the balls.

Re:I realise this is "News for Nerds"... (1)

A Big Gnu Thrush (12795) | more than 3 years ago | (#35974304)

There's so much wrong with this post, I don't even know where to begin, but what I really want to know is... ...there's a non-US version of baseball?!?

Re:I realise this is "News for Nerds"... (1)

datapharmer (1099455) | more than 3 years ago | (#35974786)

Thank you, wondering that myself. I know it is played other places, but aren't the rules very similar??

Re:I realise this is "News for Nerds"... (1)

x*yy*x (2058140) | more than 3 years ago | (#35975732)

h ttp://en.wikipedia.org/wiki/Pesäpallo (slashcode breaks the link so copypaste)

Re:I realise this is "News for Nerds"... (0)

Anonymous Coward | more than 3 years ago | (#35974726)

I've always somehow "seen" in my head exactly how physics function.

Cool.Maybe you could clear up a small matter for the rests of us? It's called the "Theory of Everything", and certainly someone that sees exactly how "physics function" [sic] shouldn't have any problem.

Re:I realise this is "News for Nerds"... (1)

by (1706743) (1706744) | more than 2 years ago | (#35977624)

Indeed -- such visionary powers would've come in handy on some of my old quantum psets!

I suspect the comment was referring to simple classical mechanical systems. I do find it fascinating that on a windy day at the beach, I can throw a baseball well over a hundred feet and have the other person catch it without needing to move (and I'm not particularly coordinated). Granted, with such lax accuracy, the relevant calculations aren't too tricky, but I still find it neat that humans (and other animals) have such a good intuitive sense of (classical) mechanics.

Re:I realise this is "News for Nerds"... (0)

Anonymous Coward | more than 3 years ago | (#35974144)

So the intertubes *are* like a big truck...

Re:I realise this is "News for Nerds"... (1)

BigSlowTarget (325940) | more than 3 years ago | (#35974296)

Damn that's close. It's freaky how almost anything can be expressed as a car analogy.

Re:I realise this is "News for Nerds"... (1)

davidbrit2 (775091) | more than 3 years ago | (#35974474)

Yeah. It's kind of like how you can bolt just about any part onto just about any car if you think it through enough.

Re:I realise this is "News for Nerds"... (1)

steelfood (895457) | more than 3 years ago | (#35975192)

You mean they shut down the tubes and shit got clogged?

Re:I realise this is "News for Nerds"... (0)

Anonymous Coward | more than 3 years ago | (#35973610)

Yeah, take the highway, you aren't welcome in this town.

Re:I realise this is "News for Nerds"... (0)

Anonymous Coward | more than 3 years ago | (#35973612)

Cars on a 2 lane highway were directed onto the sidewalk.

Re:I realise this is "News for Nerds"... (4, Funny)

kingsqueak (18917) | more than 3 years ago | (#35973792)

Instead of the usual commuter rail line, we've had to do some maintenance causing us to provide a single Yugo as transport for the NY morning rush.

After packing 25 angry commuters into the Yugo we left a few hundred thousand stranded on the platform, ping-ponging between the parking lot and home, completely confused how they would get to work.

In addition to that, unfortunately the Yugo couldn't handle the added weight of the passengers and the leaf springs shattered all over the ground. So the 25 passengers we initially planned for were left trapped, to die, inside of the disabled Yugo. They all starved in the days it took us to realize the Yugo never left the station parking lot.

We are sorry for any inconvenience this may have caused and have upgraded to AAA Gold status to prevent any further future disruptions. This will ensure that at least 25 people will actually reach their destinations should this occur again, though they may need to ride on a flat-bed to get there.

Re:I realise this is "News for Nerds"... (1)

kriston (7886) | more than 3 years ago | (#35973882)

That analogy is just like Irwin Allen's movie "The Towering Inferno."

Re:I realise this is "News for Nerds"... (1)

Xserv (909355) | more than 3 years ago | (#35974072)

I think I just peed in my pants. My cubemates (who we lovingly refer to each other as cellmates) poked their heads around walls wondering what I thought was so funny. Bravo!

Voltron (1)

Kamiza Ikioi (893310) | more than 3 years ago | (#35974136)

But can I get an understandable car analogy here?

15 cars tried to transform into Voltron [wikia.com] but instead turned into Snarf [cracked.com] .

Re:I realise this is "News for Nerds"... (2)

yakovlev (210738) | more than 3 years ago | (#35974194)

Traffic was diverted from a major highway onto a 2-lane road. This caused the buses to run late.

Because the buses were running late, everyone decided to take their own car to work. This further increased the amount of traffic on the tiny road.

The cops figured out that everyone was on the wrong road, and diverted traffic onto another freeway. However, by this point everyone was already taking their cars, so diverting to the other freeway didn't completely fix the problem.

All this traffic indirectly caused minor traffic problems in neighboring cities, because all the traffic cops in those cities were covering the traffic nightmare in this city.

Eventually, the cops got everyone to stop getting on the roads, and piecemeal managed to get people where they were going, which eventually cleaned things up.

Re:I realise this is "News for Nerds"... (1)

atisss (1661313) | more than 3 years ago | (#35974618)

mod parent up

I crap on all of you (-1, Troll)

slasher234 (2080290) | more than 3 years ago | (#35973584)

And you suck goatse asshole

That doesnt explain anything (1)

Anonymous Coward | more than 3 years ago | (#35973592)

That only explains the loss in availability of the AWS service. It in no way explains why the data is destroyed and unrecoverable

Re:That doesnt explain anything (1)

ruiner13 (527499) | more than 3 years ago | (#35973636)

It could be that in the process of isolating the problem, they rebooted servers that (due to network problems) may not have been able to fully replicate their local changes.

Re:That doesnt explain anything (0)

mysidia (191772) | more than 3 years ago | (#35973818)

It could be that in the process of isolating the problem, they rebooted servers that (due to network problems) may not have been able to fully replicate their local changes.

In other words.... someone executed an improper "problem isolation" procedure........

Re:That doesnt explain anything (2)

Darth_brooks (180756) | more than 3 years ago | (#35973904)

"At 12:30 PM PDT on April 24, we had finished the volumes that we could recover in this way and had recovered all but 1.04% of the affected volumes. At this point, the team began forensics on the remaining volumes which had suffered machine failure and for which we had not been able to take a snapshot. At 3:00 PM PDT, the team began restoring these. Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state."

AOL's 19-hour outage (0)

kriston (7886) | more than 3 years ago | (#35973646)

Whom else is reminded of AOL's 19-hour outage in 1996? Routers misconfigured to send data to the wrong place, cascading into failure?

Re:AOL's 19-hour outage (-1)

Anonymous Coward | more than 3 years ago | (#35973736)

Whom else is reminded of AOL's 19-hour outage in 1996? Routers misconfigured to send data to the wrong place, cascading into failure?

When you try to use a word like "whom" because you think it makes you sound smart, and then you use it incorrecty, the effect is quite the opposite.

Re:AOL's 19-hour outage (-1)

Anonymous Coward | more than 3 years ago | (#35973752)

Whom else is reminded of AOL's 19-hour outage in 1996? Routers misconfigured to send data to the wrong place, cascading into failure?

When you try to use a word like "whom" because you think it makes you sound smart, and then you use it incorrecty, the effect is quite the opposite.

Whomsoever doest thou thinkest thou art? Thy pedantry is showing.

Re:AOL's 19-hour outage (0)

Anonymous Coward | more than 3 years ago | (#35974078)

Whom else is reminded of AOL's 19-hour outage in 1996? Routers misconfigured to send data to the wrong place, cascading into failure?

When you try to use a word like "whom" because you think it makes you sound smart, and then you use it incorrecty, the effect is quite the opposite.

Whomsoever doest thou thinkest thou art? Thy pedantry is showing.

Hey Kriston, is this really easier than admitting you fucked up?

Re:AOL's 19-hour outage (2)

Mister Fright (1559681) | more than 3 years ago | (#35973750)

No one. No one else remembers AOL.

Re:AOL's 19-hour outage (1)

datapharmer (1099455) | more than 3 years ago | (#35974836)

Not that would admit it in public. ;-)

Isn't the point of a secondary network... (1)

gtvr (1702650) | more than 3 years ago | (#35973702)

... to be able to handle loads if the primary fails?

Re:Isn't the point of a secondary network... (1)

Anonymous Coward | more than 3 years ago | (#35973784)

I think the secondary network is use to deal with the little overage you get at peak times.

Like most of the time a T1 may be fine and can deal with bandwidth, but sometime your backup ISDN comes up when the bandwidth is a little more then a T1 along can hold.

Re:Isn't the point of a secondary network... (0)

Anonymous Coward | more than 3 years ago | (#35973788)

FTFA:

The secondary network, the replication network, is a lower capacity network used as a back-up network to allow EBS nodes to reliably communicate with other nodes in the EBS cluster and provide overflow capacity for data replication. This network is not designed to handle all traffic from the primary network but rather provide highly-reliable connectivity between EBS nodes inside of an EBS cluster.

Re:Isn't the point of a secondary network... (1)

ae1294 (1547521) | more than 3 years ago | (#35973796)

... to be able to handle loads if the primary fails?

No it's so marketing can make redundancy claims for 1/100 the cost of true redundancy.

Re:Isn't the point of a secondary network... (2, Informative)

mysidia (191772) | more than 3 years ago | (#35973844)

... to be able to handle loads if the primary fails?

No. That's the point of the redundant elements and backup of the primary network.

The secondary network they routed traffic to was designed for a different purpose, and never meant to receive traffic from the primary network.

Re:Isn't the point of a secondary network... (1)

natet (158905) | more than 3 years ago | (#35975994)

... to be able to handle loads if the primary fails?

No. That's the point of the redundant elements and backup of the primary network.

The secondary network they routed traffic to was designed for a different purpose,
and never meant to receive traffic from the primary network.

For example, management, monitoring, and logging traffic.

Amazon issues 10-day service credit (4, Interesting)

kriston (7886) | more than 3 years ago | (#35973806)

Dear AWS Customer,

Starting at 12:47AM PDT on April 21st, there was a service disruption (for a period of a few hours up to a few days) for Amazon EC2 and Amazon RDS that primarily involved a subset of the Amazon Elastic Block Store (âoeEBSâ) volumes in a single Availability Zone within our US East Region. You can read our detailed summary of the event here:
http://aws.amazon.com/message/65648 [amazon.com]

Weâ(TM)ve identified that you had an attached EBS volume or a running RDS database instance in the affected Availability Zone at the time of the disruption. Regardless of whether your resources and application were impacted, we are going to provide a 10 day credit (for the
period 4/18-4/27) equal to 100% of your usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone. This credit will be automatically applied to your April bill, and you donâ(TM)t need to do anything to receive it.
You can see your service credit by logging into your AWS Account Activity page after you receive your upcoming billing statement.

Last, but certainly not least, we want to apologize. We know how critical the services we provide are to our customersâ(TM) businesses and we will do everything we can to learn from this event and use it to drive improvement across our services.

Sincerely,
The Amazon Web Services Team

This message was produced and distributed by Amazon Web Services, LLC, 410 Terry Avenue
North, Seattle, Washington 98109-5210

Re:Amazon issues 10-day service credit (0)

Anonymous Coward | more than 3 years ago | (#35974120)

Hmm. But if you were suffering an OUTAGE -- wouldn't the 100% usage be almost zero?

Re:Amazon issues 10-day service credit (1)

StikyPad (445176) | more than 3 years ago | (#35975402)

Nice. You know what Sony offered me for disclosing all of my information?

They sent me an e-mail which pointed out that, by law, I can get one free credit report per year, and they encouraged me to take advantage of that to look for any fraudulent activity.

Re:Amazon issues 10-day service credit (1)

kriston (7886) | more than 3 years ago | (#35976950)

I don't know what bothers me more: the outage itself or the alternative codes Amazon used for punctuation in that email that made my post look messed up only after I posted it.

Amazon EC2 outage analysis (1)

doperative (1958782) | more than 3 years ago | (#35973812)

"Last Thursday’s Amazon EC2 outage was the worst in cloud computing’s history .. I will try to summarize what happened, what worked and didn’t work, and what to learn from it. I’ll do my best to add signal to all the noise out there" link [rightscale.com]

The Cloud (1)

ae1294 (1547521) | more than 3 years ago | (#35973814)

So we now know that the promise of the cloud is a lie. How long before we get a new buzz word for turning over all of our data to the new Internet Barron's because they know what is best?

Re:The Cloud (1)

cryfreedomlove (929828) | more than 3 years ago | (#35973996)

So we now know that the promise of the cloud is a lie. How long before we get a new buzz word for turning over all of our data to the new Internet Barron's because they know what is best?

How does this event lead to the conclusion that 'the promise of the cloud is a lie'? Be specific.

Re:The Cloud (1)

sjames (1099) | more than 3 years ago | (#35976876)

The marketing promise, that is. We know it because according to the hype, the cloud means you are NEVER down and your data is ALWAYS safe.

I'm sure there will be a few "no true cloud" marketing fallacies running about though.

Re:The Cloud (1)

TrevorDoom (228025) | more than 3 years ago | (#35975264)

"The Cloud" has always been nothing more than marketing buzz. All "The Cloud" is are physical servers running a hypervisor and running your machine instances as VMs.
There's still people, switches, routers, firewalls, servers, and storage that are used to build "The Cloud."

This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.

Re:The Cloud (1)

ae1294 (1547521) | more than 3 years ago | (#35975762)

This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.

But profitable....

Re:The Cloud (1)

tlhIngan (30335) | more than 3 years ago | (#35976192)

"The Cloud" has always been nothing more than marketing buzz. All "The Cloud" is are physical servers running a hypervisor and running your machine instances as VMs.
There's still people, switches, routers, firewalls, servers, and storage that are used to build "The Cloud."

This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.

Actually, that's the whole point of doing things "in the cloud" versus just using a webhost or a colo facility. At the latter you provisoin your machine as how you think it'll be used and manage it as you would any other piece of IT equipment locally. If it dies, you go down and replace it. If you get hit by a link from /., your server gets slow. If its the holidays your server crashes, etc. You could overprovision your services by buying extra servers to handle the overflow, but then you're paying lots of extra money to handle the few instances when you get heavy traffic. Since you're leasing hardware and/or physical space, you're paying for that all the time. Depending on their size, provisioning extra services can take hours or days.

Whereas, had it been "in the cloud" at Amazon or something, if your website gets slow because someone discovered your product and posted it on /. or did the social networking thing, you can spin up a new instance immediately (for just a few more dollars) and rake in the cash. At the end of the week when traffic drops off, you destroy the new instance and pay just for the computation you used. Should the datacenter suffer a power or network outage, unless you have a spare in another datacenter, you're hosed.

And yes, the cloud should make things impervious to hardware failure, power outages, network outages, etc. After all, you've decoupled the datacenter from the servers itself. If a physical EC2 box dies, everything should be movied automagically - since you're only dealing with a VM not attached to any particular hardware, it shouldn't matter that your server now isn't hte same one as yesterday. Network and power outages the same - your server should be floating amongst the datacenters that your provider has since you're paying for the CPU cycle, not for the server or physical space.

Alas, what happened here was Amazon wasn't decoupled enough.

At least they admit it (4, Insightful)

jesseck (942036) | more than 3 years ago | (#35973838)

I commend Amazon for providing us with this information. Yes, bad things happened, and data is gone forever. Amazon knows what happened and why, and I'm sure they will implement controls to prevent this again. I doubt we'll hear as much from Sony, though.

Re:At least they admit it (4, Insightful)

david.emery (127135) | more than 3 years ago | (#35974030)

We all benefit from these kinds of disclosures, I remember Google posting post-mortem analyses of some of their failures. Even Microsoft provided information on their Sidekick meltdown. This does seem to be the 'typical' melange of a human error and cascading consequences.

Someone first said, "You learn much more from failure than you do from success." If nothing else, it's the thesis of the classic Petrosky book, "To Engineer is Human: The Role of Failure in Successful Design" http://www.amazon.com/Engineer-Human-Failure-Successful-Design/dp/0679734163 [amazon.com] (If you haven't read this, you should!!)

And I'm also reminded of a core principle from safety critical system design, that you cannot provide 100% safety. The best you can do is a combination of probabilistic analysis against known hazards. As a Boeing 777 safety engineer told me, "9 9's of safety, i.e. chance of failure 1/10 ^-9, applied over the expected flying hours of the 777 fleet, still means a 50-50 chance of an aircraft falling out of the sky." That kind of reasoning also applies to the current Japanese nuke plant failure...

Re:At least they admit it (1)

ccady (569355) | more than 3 years ago | (#35975186)

No, that kind of reasoning does not apply to nuke plant failures. There are not millions [straightdope.com] of nuke plants running each day. There are only 442 [euronuclear.org] nuke plants. If we cannot secure 442 plants from having disasters, then we need to do something else that does not cause disasters.

Re:At least they admit it (1)

david.emery (127135) | more than 3 years ago | (#35975292)

So how intense an earthquake, at what distance from the plant, and how high a tsunami should we plan for next time???

Re:At least they admit it (1)

david.emery (127135) | more than 3 years ago | (#35975306)

p.s. Wikipedia says there are 923 B777's out there, about 2x the number of nuke plants. http://en.wikipedia.org/wiki/Boeing_777 [wikipedia.org]

1% meltdown rate... (0)

Anonymous Coward | more than 3 years ago | (#35975842)

apparently we can't secure them from disasters.

5 reactors of the 442 have melted down. That is over 1% catastrophic failure.

Re:At least they admit it (1, Informative)

LWATCDR (28044) | more than 3 years ago | (#35976318)

You have not taken a statistics course have you? You can have one airplane and have it fall out of the sky or you can 1,000,000 and never have one crash and both systems could 9 9's safey. This is the risk of failure it isn't destiny.
So to combat the FUD.
1. So far the death toll from the event is 18000. Death toll from radiation so far 0.
2. The nuclear plant didn't cause the disaster the earthquake and tsunami that followed it did.
3. People died in cars, buildings, on the street, and in a dam that also collapsed.
4. A lot of the lives where lost because of a failure of a sea wall.

So by your logic we really need to replace cars, buildings, streets, dams, and sea walls first since they all have caused so many deaths. Might I suggest a cave? Oh and no fire because that is also too risky. And keep away from those sharp stones as well.

Re:At least they admit it (0)

Anonymous Coward | more than 3 years ago | (#35975190)

I've worked in or alongside IT operations for more than a decade, and this is far more complete a post-mortem than I've ever seen produced internally or by a vendor.

This level of disclosure is precisely what ends up building trust.

I'm glad to see that they're committing to make it easier for customers to understand how to leverage the capabilities of AWS which contribute to increased availability. I can't help but think that some of the companies who use AWS do so "because it's cheap", and not "because it allows better use of capital". It's easier to justify salaries for architects when you're talking about a $2M capex project than when someone whipped out a credit card, put a prototype in the cloud, and it became the production system.

Re:At least they admit it (1)

Draknor (745036) | more than 3 years ago | (#35975962)

As a Boeing 777 safety engineer told me, "9 9's of safety, i.e. chance of failure 1/10 ^-9, applied over the expected flying hours of the 777 fleet, still means a 50-50 chance of an aircraft falling out of the sky."

This doesn't even make any sense -- what am I missing? A 50-50 chance of falling out of the sky? I'm assuming that's hyperbole, but I'm not grasping the concept here.

For what its worth, the wiki article (linked in another post) indicates the 777 has been involved in 7 "incidents", although the only fatality was a ground crew worker who suffered fatal burns during a refueling fire.

Re:At least they admit it (0)

Anonymous Coward | more than 3 years ago | (#35976150)

He means there's something like a 50-50 chance that the plane will fall out of the sky (i.e. suffer some sort of emergencat least once in its expected total lifetime. Of course it will make tens of thousands of successful flights before that happens.

Re:At least they admit it (0)

Anonymous Coward | more than 3 years ago | (#35976690)

Shouldn't that be 1/10^9 or 1*10^-9? 1/10^-9 seems like a pretty big chance of failure...

Re:At least they admit it (1)

tompaulco (629533) | more than 3 years ago | (#35974602)

I doubt we'll hear as much from Sony, though.
I have an account on Playstation Network and they have already sent me a long and thorough e-mail explaining what happened and what the implications are. And since the problem is ongoing, that makes them MORE proactive than Amazon in getting the word out.
Not to mention that Playstation Network is free and any uptime over 0% ought to be considered a bonus. Whereas you are paying for a certain level of service with cloud computing.

Re:At least they admit it (2)

afex (693734) | more than 3 years ago | (#35974974)

this has gone mildly offtopic, but as a PSN user i just wanted to chime and say the following...

I can't STAND when people say 'its free, so its ok if it goes down.' When i purchased a PS3, the PSN was a FEATURE that i considered when i bought it. As such, it's not really "free", its more like it was wrapped into the MSRP. By your logic, they should be able to take away the entire network for GOOD and everyone should be completely happy. is this true? Heck, let's start pulling out other features that you got for 'free' as well. I mean geez, I heard that no one uses otherOS, lets just...pull..that...oh shit.

Re:At least they admit it (1)

StuartHankins (1020819) | more than 3 years ago | (#35975744)

+1 Insightful. The Playstation would never have acquired the market share it has without PSN. People would have bought something else. It's a major part of the promotion of the product.

Re:At least they admit it (2)

the eric conspiracy (20178) | more than 3 years ago | (#35975680)

The issue is not uptime. It is the loss of sensitive data. If Sony is holding personal data they have an obligation to protect that data.

Re:At least they admit it (0)

Anonymous Coward | more than 3 years ago | (#35975970)

I commend Amazon for providing us with this information. Yes, bad things happened, and data is gone forever. Amazon knows what happened and why, and I'm sure they will implement controls to prevent this again. I doubt we'll hear as much from Sony, though.

While I support Amazon's decision to disclose - I'm not sure it was the best thing for survivability as a provider. I think its competitors are going to take information from this failure and scare less-technically-endowed decision makers at clients into using a less-capable provider. In my opinion - transparency and disclosure work very differently in the technical world vs. the business world.

Re:At least they admit it (0)

Anonymous Coward | more than 3 years ago | (#35976800)

I commend Amazon for providing us with this information. Yes, bad things happened, and data is gone forever. Amazon knows what happened and why, and I'm sure they will implement controls to prevent this again. I doubt we'll hear as much from Sony, though.

FTFA... They detail the controls they are implementing to prevent this from happening again.

Can we get this in non-Amazon speak (3, Interesting)

gad_zuki! (70830) | more than 3 years ago | (#35973858)

What is an EBS? Is it really just a Xen or VMWare disk image? Which data center corresponds with each availability zone? What are they using for storage iSCSI targets on a SAN?

Re:Can we get this in non-Amazon speak (1)

pdbaby (609052) | more than 3 years ago | (#35974298)

While that would be nice to know I don't think it's relevant to a postmortem: they described the architectural elements which encountered the failures.

FYI, though, based on what they've said today and in the past: it seems that they are using regular servers with disks rather than a SAN & I believe they use GNBD to connect the EBS server disk images and EC2 instances (rather than iSCSI)

Re:Can we get this in non-Amazon speak (1)

Synn (6288) | more than 3 years ago | (#35974882)

Amazon EC2 is Xen. The back end storage for that is empherical in that it goes away when you shut it down. So they introduced EBS which is persistent storage. So you could have a EC2 server and mount EBS volumes on it and those EBS volumes will exist even when the EC2(Xen) server goes away. You can even mount them on different EC2(Xen) servers(though not at the same time). Also today you can have the EC2 server itself run on top of EBS if you want the data on it to stay around after a reboot.

Then there's also S3 which while a separate product, ties into all the above as a "slower" storage medium where you keep your Xen images and (if you're smart) permanent backups. You can also use it for storage for your applications, there's an API to put and get data from it. S3 has an insane level of reliability, supports versioning and can even be setup so that you can't delete from it without a hardware fob key device. So you could put customer data there and only the CEO of the corporation could delete it.

EBS is basically like iSCSI, but far more complex. There's a lot of proprietary stuff they're doing with it.

So you have:

EC2(Xen) servers which can be thought of as disposable.
S3 which is an absurdly reliable storage back end, but you can't really use it as a filesystem. It's not high IO.
EBS which is your high IO filesystem that's persistent and portable. It's something you'd use for a database filesystem and their database product(RDS) uses this, since it's basically just MySQL Xen on top of EBS.

In this case they had an issue with EBS in 1 of their north Virginia data centers. This affected 1 of 3 east coast availability zones but you can't really say which one since the zone names are randomized for each customer(to prevent everyone from using the same zone name).

Outside of the 3 east coast availability zones they also have zones in other regions that weren't affected.

Re:Can we get this in non-Amazon speak (1)

StuartHankins (1020819) | more than 3 years ago | (#35975810)

Forgive my ignorance, but why not KVM instead of Xen? And assuming you have a complicated setup with bunches of scripts, mounts, etc, how do you image the entire thing? We have to schedule off-hour downtime to do a snapshot (everything except data) for our internal servers since a new install / config from scratch would take too long for recovery -- but that involves a lot of control that you may or may not have in a "cloud" situation.

Re:Can we get this in non-Amazon speak (1)

kriston (7886) | more than 3 years ago | (#35977082)

They use more than just Xen but they don't really publicize it. With paravirtualization they can use anything they want, but Xen seems to be the most prevalent. Some of my instances say "xen" and others say "paravirtual." Just because the kernel says "xen" or "paravirtual" does not necessarily mean that the hypervisor is Xen or something else.

Also, speaking towards migrating instances between "availability zones," I found out that I cannot use Windows Server EBS boot volumes on anything but the instance I created it with. You can do this with most of the Unix instances, though. I found this little oddity a bit maddenning when I wanted to move a Windows Server instance from a t1.micro to t1.small.
No-can-do with boot volumes. Well, not officially, anyway.

The other thing I don't get is why I should be able to use my Amazon.com shopping account to log into AWS. It seems, well, silly, even with the two-factor authentication dongle, that the same account used to buy Kindle books and Fisher-Price toys is paying for my AWS usage.

Re:Can we get this in non-Amazon speak (1)

Synn (6288) | more than 2 years ago | (#35977686)

They used Xen because that's what was mature at the time. In many ways Xen is still more mature than KVM, though it won't be that way for long.

Amazon supplies you a bunch of tools for dealing with the images. They call the "stored" images AMI's. There's a huge list of public AMI's you can choose from and anyone can create their own AMI off pretty much any Linux distribution.

You can snapshot a running instance using the ec2-ami-tools which are installed in your running instances. Using those you can easily create your own AMI's off a running EC2 instance. So you'd create a base EC2 instance off a public AMI, customize it and then "snapshot" that as a custom AMI you can re-use later.

Re:Can we get this in non-Amazon speak (1)

StuartHankins (1020819) | more than 2 years ago | (#35977808)

Many thanks.

Re:Can we get this in non-Amazon speak (1)

Slashdot Parent (995749) | more than 2 years ago | (#35978310)

EBS which is your high IO filesystem

Damnit, now you owe me a new monitor.

Data loss? (1)

Alex Belits (437) | more than 3 years ago | (#35974238)

And HOW THE HELL does such a procedure cause data loss?!

Are those geniuses using the service transfer procedures that do not perform clean transaction handling and instead just send stuff to be copied expecting that it will sync soon enough?

Re:Data loss? (0)

Anonymous Coward | more than 3 years ago | (#35974716)

I think it has to do with the state of user's databases, which could be in-memory only and mid-processing. IIUC, storage is not local to individual machines and requires a consistent data link to remain connected to their filesystems. Virtualization is a complicating factor here, too.

Here is one possible example, off the top of my head (or OOMA). Data could be accepted from a user; it's state changed in memory; but then is not able to be written back out to disk due this failure mode.

Just a WAG, I could be wrong.

Re:Data loss? (1)

Alex Belits (437) | more than 3 years ago | (#35976160)

This should never be possible if transaction model is properly implemented -- data in memory would have to be stored and confirmed to be stored, or transaction should be cleanly reverted before anything is moved.

Other outage (1)

nns6561 (559085) | more than 3 years ago | (#35974244)

I'm trying to remember what the other outage was recently where the web service failed because they forgot to implement exponential backoff. Anyone remember?

I dont buy it (1)

OverlordQ (264228) | more than 3 years ago | (#35974636)

During the whole issue they never posted a cause and took them forever to even say 'still investigating'. Even if they have a bare bones monitoring system up, it should have been readily apparent that traffic was flowing over the wrong network.

[..] because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving.

So they're basically saying if the primary network has issues theres not really a point in the backup because the backup network will make things explode just as much as having no backup.

pure and utter BS (1)

dieth (951868) | more than 3 years ago | (#35974730)

If this was the cause why wasn't the change corrected immediately and the traffic routed to where it was originally intended. 3 days of downtime just doesn't happen when you fuck up a line in a config. If this was actually the case the downtime would have been minimal.

Re:pure and utter BS (2)

bruce_the_loon (856617) | more than 3 years ago | (#35975484)

Go and read the entire notice, not just the pathetic snippet a bad submitter used. Makes more sense.

Also, this is a storage network, not an access network. Effectively it's like pulling the SAS cable out of the RAID card while the machines are running.

Re:pure and utter BS (1)

DigiShaman (671371) | more than 3 years ago | (#35977288)

OUCH! I wonder how many VHD files got corrupted that way.

Re:pure and utter BS (1)

Zondar (32904) | more than 3 years ago | (#35975546)

This was a cascade failure that affected multiple systems on multiple layers, with ramping race conditions that worsened over time. The engineer didn't hit the "Enter" key and suddenly the little green light turned red to tell him 1/3 of the grid was down.

Tsunami waves are not higher than 19 feet, (1)

tlk nnr (449342) | more than 3 years ago | (#35974862)

and the primary and secondary network will not fail simultaneously for a large number of nodes.

It's nice to see that everyone has the same problem:
There is no approach to identify wrong assumptions.

But what's the conclusion?
Should we stay away from huge systems, because the damage due to a wrong assumption in a huge system is huge?

Where is their testing lab? (0)

Anonymous Coward | more than 3 years ago | (#35974918)

Someone (or multiples thereof) at the top of the Amazon infrastructure management heap should be fired and/or executed! Making such a network change to a live system that can impact so many users and applications without first testing it in a fully functional test environment that reasonably mirrors the real environment, is reckless, incompetent, and unconscionable. If they had done so, the error in configuration would have (should have) been caught before it was applied to the live system. And I don't think they can just blame it on "stumble fingers" either!

Re:Where is their testing lab? (2)

TrevorDoom (228025) | more than 3 years ago | (#35975322)

Have you ever worked in a real environment?

There is ALWAYS a difference between test and production. No matter how many test cases and iterations of changes that you go through, there is always a non-zero percent chance that the change in production will behave differently.
This is why most companies require fall-back procedures for any production change in addition to testing.
It sounds like it may have taken them longer than some might be comfortable to reach the point where they did roll back changes...but I'm sure that this change tested as okay in all of their test cases.

1965 (1)

Triv (181010) | more than 3 years ago | (#35975132)

Huh. Sounds like a 21st century version of the routing failure that caused the 1965 Northeast blackout, just with data instead of electricity.

http://en.wikipedia.org/wiki/Northeast_Blackout_of_1965 [wikipedia.org]

Won't touch the cloud now. (0)

Anonymous Coward | more than 3 years ago | (#35975146)

I was one of the businesses who has suffered from this. I was in the process of migrating my mail server that hosts multiple virtual domains for clients to a EC2 instance. I had provided amazon with the information needed to remove my elastic IP from specific RBL's and was moving forward gracefully with my configuration. I had a couple of clients moved to the new server when I first heard about the data loss. I was still able to access my instance last night so I thought, "ok I must not have been affected" I woke up this morning to the email:

Dear AWS Customer,

Starting at 12:47AM PDT on April 21st, there was a service disruption (for a period of a few hours up to a few days) for Amazon EC2 and Amazon RDS ..... and on and on ...

I logged into my Management Console to see all my work and all my customer data gone. You can imagine how happy my clients were when I told them of the news. The cloud can kiss my ass.

Re:Won't touch the cloud now. (1)

teknopurge (199509) | more than 3 years ago | (#35975198)

Cloud computing is a marketing architecture, not a technical architecture.


Cloud computing is a form of shared hosting, just with more encapsulation; Clouds fall over the same way a server can fall over. It's hard to blame "The Cloud" when the reality is the people that were suckered in by obtuse, non-specific marketing are the ones at fault. The argument can even be made that Clouds are worse becuase instead of many discreet isolated servers you start sharing more single points of failure, which lead to IO bottlenecks, etc.

Data Loss (0)

Anonymous Coward | more than 3 years ago | (#35975196)

My read of the explanation is that 0.07% of physical machines happened to die during the incident (1 out of 1400) due to hardware failure. Since they couldn't re-mirror, there was a single copy of the data on those machines and data was lost. I'm actually pretty impressed with their response and design, based on the explanation.

Circadian rhythms (1)

sleep-doc (905583) | more than 3 years ago | (#35975852)

In thinking about why this happened, don't loose sight of the time they chose to make the configuration change was 00:47 local. Human performance on 3rd shift isn't what it is on day shift, and I would think it very likely the people managing this change had been up and working for a significant number of hours at that time. Would they have noticed something or done something differently at 10:00 local? Certainly making an upgrade at a time of lowest use sounds right, but it's not always as simple as that, and you have to respect the realities of circadian rhythms or suffer the consequences. If this were an air crash, we would not we interviewing survivors, coworkers and family to identify when each of the participants in the event and the decisions made had slept during the days preceding the event.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>