Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Microsoft's Azure Cloud Suffers Major Downtime

Unknown Lamer posted more than 2 years ago | from the higher-availability-through-cloud-computing dept.

Cloud 210

New submitter dcraid writes with a quote from El Reg: "Microsoft's cloudy platform, Windows Azure, is experiencing a major outage: at the time of writing, its service management system had been down for about seven hours worldwide. A customer described the problem to The Register as an 'admin nightmare' and said they couldn't understand how such an important system could go down. 'This should never happen,' said our source. 'The system should be redundant and outages should be confined to some data centres only.'" The Azure service dashboard has regular updates on the situation. According to their update feed the situation should have been resolved a few hours ago but has instead gotten worse: "We continue to work through the issues that are blocking the restoration of service management for some customers in North Central US, South Central US and North Europe sub-regions. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers." To be fair, other cloud providers have had similar issues before.

cancel ×

210 comments

But Remember - (5, Insightful)

Ralph Spoilsport (673134) | more than 2 years ago | (#39197901)

Your data's safe in the Cloud.

Until it isn't.

Re:But Remember - (5, Funny)

Anonymous Coward | more than 2 years ago | (#39197919)

It's very safe though - just so safe no one can get access to it! :)

Re:But Remember - (1)

tnk1 (899206) | more than 2 years ago | (#39197921)

Oh their data is safe. They just can't get to it or use it in any way. :)

Re:But Remember - (2)

geekoid (135745) | more than 2 years ago | (#39198505)

Yes, they can. It's service management that's down, not data.
Users can still access data.

Re:But Remember - (4, Funny)

tripleevenfall (1990004) | more than 2 years ago | (#39197925)

Nonsense, Microsoft is the name you can trust for security.

Re:But Remember - (2)

masternerdguy (2468142) | more than 2 years ago | (#39198027)

ActiveX best X

Re:But Remember - (5, Insightful)

masternerdguy (2468142) | more than 2 years ago | (#39197985)

Also remember the cloud is just the 21st century spin of the dummy terminal-mainframe model.

Re:But Remember - (4, Insightful)

Barsteward (969998) | more than 2 years ago | (#39198183)

Stop talking sense, its no use here on /.

Re:But Remember - (0)

geekoid (135745) | more than 2 years ago | (#39198409)

Nope.
But you keep you overly simplistic view of the world,..to yourself.

Re:But Remember - (4, Insightful)

icebraining (1313345) | more than 2 years ago | (#39198573)

Except those dumb terminals were, well, dumb, while nowadays the "terminals" are essentially the same as the "mainframe" but slower. So you can have hybrid configurations were a dedicated machines handles the base load and spins up remote resources on demand to handle peaks. If those resources are unavailable, the dedicated machine can still do the job, just with some performance degradation.

A good example would be a script on your laptop that started an EC2 instance running distcc to reduce your compilation time from hours to minutes. If the instance can't be loaded, you could still compile, it just takes more time.

Re:But Remember - (-1)

Anonymous Coward | more than 2 years ago | (#39198699)

However many of our devices such as the Android phone I'm using right now become fancy bricks without being able to contact the mothership.

Re:But Remember - (1)

dave420 (699308) | more than 2 years ago | (#39198825)

Then that's a problem with your phone, and not the cloud. My Android works just fine when not connected to the internet. It's got a 1.2GHz dual-core processor, so it's not exactly dumb.

Re:But Remember - (3, Insightful)

Surt (22457) | more than 2 years ago | (#39198815)

If only even a single cloud service were actually built this way, it'd be great!

Re:But Remember - (4, Insightful)

dave420 (699308) | more than 2 years ago | (#39198769)

Except this time you can add as many mainframes you wanted, dynamically. And access them over the internet. And serve content to millions of people over said internet. That wasn't possible with this clichéd "mainframes!!!!!1" nonsense. Yes, you are using a remote computer. That's the only similarity. The current terminals are far from dumb, and the server being connected to is vastly different to the mainframes of old.

Re:But Remember - (2)

V!NCENT (1105021) | more than 2 years ago | (#39198149)

Hey! After rain comes sunshine. Now they'll just have to wait for cloud formations again...

Re:But Remember - (3, Interesting)

poetmatt (793785) | more than 2 years ago | (#39198261)

When you rely on a 3rd party for cloud storage and that 3rd party has a basically nonexistent SLA for an under 30 day outage, it becomes your own fault for making a horrible business decision.

when you take a 3rd party cloud storage solution and implement it yourself for your enterprise, guess what? it works. And if there's issues, you know who's to blame.
https://spideroak.com/diy/ [spideroak.com] - this is one example of but many.

Re:But Remember - (0)

Anonymous Coward | more than 2 years ago | (#39198933)

Wow that is a terrible SLA:

This is intended as a cost effective long term bulk data archival service, so the SLA is geared with that sort of use in mind.

DIY includes a guarantee of 99.7 % uptime. This allows us to be offline for a couple hours each month.

Additionally, we don't consider SpiderOak DIY to be offline at all when it is merely read-only for less than 2 hours.

So they can have virtually unlimited "read only" downtime as long as they turn write back on every

Eggs? (4, Insightful)

OzPeter (195038) | more than 2 years ago | (#39197917)

Basket?

Or how about "Never outsource your core functionality?

Re:Eggs? (1)

alphatel (1450715) | more than 2 years ago | (#39198179)

Never outsource your core functionality

Or more specifically, don't cloud your reasons for using it. Know what you are getting before you go there.

Re:Eggs? (4, Insightful)

Sir_Sri (199544) | more than 2 years ago | (#39198423)

Ah, so there's the question. How much would it cost for you to run a system with 'no' downtime? I'm at a university, some of our labs (not so much in comp sci but generally) have fairly specific requirements about say not losing power, because it would damage/destroy equipment or running experiments.

But IT is more than just power. In almost 4 years here every year we've had several days of downtime for our main undergraduate server (the one undergrads are supposed to use for various things, and that handles their logins and file storage), and several on the separate but arguably more important staff server, which is supposed does the same thing, but that includes all of our grant applications.

Causes of our server outages (I'm not an IT guy, this is just what they've told us that I can remember): Power failures. Yes we have battery backups, but they're only good for so long, and since none of our equipment suffers permanent damage without power this isn't high priority. Networking. We only have two redundant pipes. That, for home use for example, or most businesses is pretty good. For our pipes one goes to a host to the west, one to the east. I'm not specifically familiar with what failed that took our networking offline for 7 or 8 hours but it affected both pipes. Storage: bad raid controller on the main fileserver. This has a few cascading effects. If you don't realizing it's garbling data it ends up distributing that garble off to the backups or clones. When it crashes (which doesn't take that long after the controller starts getting messy) you may have several backups that need to be repaired. We can't do much to the file system while it's being repaired or rebuilt (which, afaik you should be able to do on most professional grade setups, but for whatever reason our linux guys can't get it to behave). Added fun: When the system comes back up, if you tried to access your e-mail while the file system was garbled you probably still can't. And you get no error message about it. It just spits back nothing, as though you have no new mail. The system is 'up' but doesn't work and you have to go into your directory and delete some files that most people have never heard of. It's not hard to do, but because you have no idea that there's a problem the less technically inclined (or just ESL) people in building full of computer scientists don't always fix it immediately. The net effect is that if the storage controller gets messed up, we're down for 3 or 4 days if not longer.

And that's just one university department. We have a relatively decent amount of money, and several full time staff for these things. But we probably can't match any cloud services uptime, even with 7 or 8 hours of downtime regularly, not even close. It's not a trivial calculation, even a 50 or 60 employee outfit will probably have trouble matching Amazon or Azure uptime with a full time IT guy. There's probably a cross over point where you have enough employees to support big enterprise IT infrastructure and manpower, but only support it badly (there's not enough money for proper replication or whatever), and then eventually you get big enough that you just run everything in house anyway because there's definitely no cost advantage to hiring someone. For us, I think we have 5 or 6 IT staff, if we could toss 3 of them, + all of their equipment, you're looking at somewhere around 350, 400k/year to spend on a support contract. I'm guessing, but don't know, if you can get a cloud service for ~20 TB of reasonably reliable file and e-mail storage for less than 350k/year from these guys.

The big place I see people right now (as a sort of flavour of the month) using cloud service as an augment to burst capacity needs. That's a whole other analysis.

Re:Eggs? (1)

phantomfive (622387) | more than 2 years ago | (#39198713)

Fortunately, there is a solution. You can have your own personal cloud [iomegacloud.com] . The best part, "access speed is just as fast as a local hard drive."

Re:Eggs? (1)

gweihir (88907) | more than 2 years ago | (#39198947)

Basket?

Or how about "Never outsource your core functionality?

That would be a good engineering practice. A good business practice is to show your initiative by outsourcing to the cloud and then hope to be promoted away before anything bad happens. It really is time for managers to be liable for the mistakes they make in long-term decisions.

Re:Eggs? (1)

sjames (1099) | more than 2 years ago | (#39199021)

I can see clearly now that the rain is gone...

Gloat gloat gloat. (1)

SuricouRaven (1897204) | more than 2 years ago | (#39197933)

One of the worst things about the cloud is that it can go wrong when someone else screws up, so you get the blame for their mistakes.

Re:Gloat gloat gloat. (4, Insightful)

gral (697468) | more than 2 years ago | (#39197965)

The companies I deal with tend to say things like, we want to go with a company like this so we can can get "Support". Which usually means, so we can blame them if something goes wrong.

Re:Gloat gloat gloat. (2)

WeatherServo9 (1393327) | more than 2 years ago | (#39198451)

This may depend on your specific company or situation. I get the impression our upper management likes the cloud so when things go wrong they can blame someone else (even if only partially). When we were doing things in house, it didn't matter who actually screwed up, ultimately management took the blame. With the cloud, they can now point fingers at someone else and hold up a contract stating this wouldn't happen. We're a company that's still just small enough that we are pretty much always understaffed and don't put enough money into hardware to have proper redundancies so things will go wrong eventually; since moving to the cloud, management can not only point blame elsewhere (it wasn't my people who caused the outage!) but can try (usually successfully) to get some discount or other compensation from the provider when downtime occurs.

In the end I've found the move to have pros and cons. The pros are that we simply never had the hardware infrastructure to provide the uptime requested of us (yet we were denied budget to build said infrastructure). In theory, our cloud providers can provide that uptime (or so our contract says). Development of our sites has been a nightmare though, the environment seems to lend itself to easily creating all sorts of spaghetti code (not sure yet if that is our relative unfamiliarity with the environment and/or lack of skill from the company we outsourced some of the work to, etc). Really I prefer keeping things in house for more control and flexibility, but I'm outnumbered with that opinion and that definitely isn't the way things are going (at least for us).

Cloud services not ready (1)

UnknowingFool (672806) | more than 2 years ago | (#39197957)

One of the selling points of using cloud services was that it would be more reliable than managing your own hardware/software. But to date, every single big player has suffered major downtime. If I would be hesitant to believe the sales pitch.

Re:Cloud services not ready (3, Insightful)

characterZer0 (138196) | more than 2 years ago | (#39198025)

Cluster at the application level and have nodes at different providers. If your volume is too high for that, you are big enough to host your own stuff.

Re:Cloud services not ready (3, Informative)

timeOday (582209) | more than 2 years ago | (#39198081)

I agree, I have nothing against the idea of cloud services, but they do need to work and reputations are based on events like this. After an outage this long, it takes a LOOONG time to earn your way back to five nines (which works out to 5.5 minutes of downtime per year).

Re:Cloud services not ready (3, Insightful)

vlm (69642) | more than 2 years ago | (#39198263)

After an outage this long, it takes a LOOONG time to earn your way back to five nines (which works out to 5.5 minutes of downtime per year).

Only 84 years per the article, and growing at a rate of a year every 5 minutes.

Thats probably about how long it would take me to trust MS in an enterprise environment.

Re:Cloud services not ready (4, Funny)

leonardluen (211265) | more than 2 years ago | (#39198577)

it's a leap year, they can be down a full day and still claim they were up for 365 days this year!

Re:Cloud services not ready (2)

gstoddart (321705) | more than 2 years ago | (#39198305)

After an outage this long, it takes a LOOONG time to earn your way back to five nines (which works out to 5.5 minutes of downtime per year).

I'd be surprised if Microsoft (or anybody) is actually offering five nines for uptime.

The fine print often says "well, we don't actually promise anything, and any outage and loss is your problem".

Re:Cloud services not ready (4, Insightful)

hawguy (1600213) | more than 2 years ago | (#39198297)

One of the selling points of using cloud services was that it would be more reliable than managing your own hardware/software. But to date, every single big player has suffered major downtime. If I would be hesitant to believe the sales pitch.

But still, for most companies that are good candidates for cloud offerings, even 8 hours of downtime once a year is probably better than they can guarantee using their own infrastructure. Companies in this range tend to not have redundant servers, offsite backups, disaster recovery sites, etc. Larger companies that can build redundant infrastructure (and staff it properly) are probably better off staying away from the cloud since they can guarantee any level of uptime and redunancy they want to pay for.

Of course, when a small company Admin spills a cup of coffee in the Exchange server and they are down for 5 days while building a replacement server, it doesn't make the news so you never hear about it...while when a large cloud provider has a 2 hour outage, it's all over the news.

Re:Cloud services not ready (0)

geekoid (135745) | more than 2 years ago | (#39198437)

That's not true at all. Who else had major downtime?

Re:Cloud services not ready (4, Informative)

UnknowingFool (672806) | more than 2 years ago | (#39198847)

You mean besides Amazon, SalesForce, VMWare, Google Gmail, Yahoo Mail, Apple iCloud. Seriously who hasn't had downtime?

Re:Cloud services not ready (1)

Surt (22457) | more than 2 years ago | (#39198875)

Amazon had major downtime.

Re:Cloud services not ready (1)

dave420 (699308) | more than 2 years ago | (#39198859)

And there's a very good chance your own hardware/software would also suffer downtime in the same period.

So merely days after announcing the G-Cloud... (2)

phonewebcam (446772) | more than 2 years ago | (#39197959)

...the British Governments Cloud service suffers the inevitable Microsoft kiss of death [v3.co.uk] .

Re:So merely days after announcing the G-Cloud... (4, Funny)

Chris Mattern (191822) | more than 2 years ago | (#39198071)

The hilarious part of this link is that the article detailing how screwed people are for depending on Microsoft's cloud services is stuffed with rollover ads for...Microsoft's cloud services!

Re:So merely days after announcing the G-Cloud... (1)

courteaudotbiz (1191083) | more than 2 years ago | (#39198267)

Yes, just tried that, this is too cool! You talk about bad ads placement! :-)

The article title is

Government's G-Cloud service knocked offline by Microsoft Azure cloud computing outage

And all around the page, an ad that says

Get in the cloud - Microsoft Office 365

Just like hearing a Subway commercial while in the restrooms of a McDonalds... Priceless!

Re:So merely days after announcing the G-Cloud... (0)

Anonymous Coward | more than 2 years ago | (#39198513)

since they have been down most of a day now, will they have to re-label it "Microsoft Office 364"?

Re:So merely days after announcing the G-Cloud... (2)

JourneymanMereel (191114) | more than 2 years ago | (#39198695)

No... it will still be up for 365 days this year... trouble is... it should have been up for 366.

Re:So merely days after announcing the G-Cloud... (1)

PhilHibbs (4537) | more than 2 years ago | (#39198635)

...And all around the page, an ad that says

Get in the cloud - Microsoft Office 365

Well, clearly they need to release Microsoft Office 366 that works on leap years.

Re:So merely days after announcing the G-Cloud... (1)

leonardluen (211265) | more than 2 years ago | (#39198671)

clearly judging by the name they only intended it to work 365 days a year

2/29/2012 (5, Interesting)

MacBrave (247640) | more than 2 years ago | (#39197977)

Leap year strikes again?

Re:2/29/2012 (1)

jtownatpunk.net (245670) | more than 2 years ago | (#39198119)

That was my first thought.

Re:2/29/2012 (1)

guygo (894298) | more than 2 years ago | (#39198285)

My thought, too. You'd think that after the Y2K madness coders would have learned to adopt more robust calendar implementations.

Re:2/29/2012 (1)

ColdWetDog (752185) | more than 2 years ago | (#39198393)

My thought, too. You'd think that after the Y2K madness coders would have learned to adopt more robust calendar implementations.

Yeah, like the Mayan Long Date!

Re:2/29/2012 (1)

Anonymous Coward | more than 2 years ago | (#39198365)

Yes, according to the details I've read so far it has to do with a certificate issue regarding today's date. I assume their management platform uses signed certificates to access/control their nodes/clusters of servers and apparently today isn't a valid date, so it's not allowing it.

Hilarious.

Re:2/29/2012 (1)

Talderas (1212466) | more than 2 years ago | (#39198679)

So who is it that isn't recognizing 2/29 as a valid date? The platform? The certificate?

Re:2/29/2012 (5, Informative)

the_other_chewey (1119125) | more than 2 years ago | (#39198637)

From the service dashboard:

"4:00 AM UTC We have identified the root cause of this incident. It has been traced back to a cert issue triggered on 2/29/2012 GMT."

So yeah, a leap day bug sounds probable.

People never learn (0)

Anonymous Coward | more than 2 years ago | (#39197993)

Never trust Microsoft. For anything. They can't even manage water vapor, for crying out loud.

To quote the lady in the commercial... (4, Funny)

Pollux (102520) | more than 2 years ago | (#39198001)

Yay, cloud!

Re:To quote the lady in the commercial... (0)

Anonymous Coward | more than 2 years ago | (#39198521)

Yay, cloud!

Fixed that for you:

Yay, cloud! [mylittlefacewhen.com]

(Quoth a Microsoft admin with a cert expiring on a leap day: "I just don't know what went wrong!")

Now they're slashdotted, too... (4, Funny)

Sqr(twg) (2126054) | more than 2 years ago | (#39198069)

This is not helping, guys!

Re:Now they're slashdotted, too... (1)

Anonymous Coward | more than 2 years ago | (#39198139)

That's fine - they just have to go into the Azure service management and spin up new instances... oh wait...

Wait (1)

afidel (530433) | more than 2 years ago | (#39198073)

Wait, so Azure isn't down just the admin functionality is? Who gives a crap. Man, I can't spin up a new VM for 8 hours, boo hoo. This isn't an admin nightmare, the VM's being down for 8 hours would absolutely be a nightmare but the only admins this is a nightmare for are the poor guys working for MS trying to fix whatever the code monkeys screwed up =)

Re:Wait (1)

fuzzyfuzzyfungus (1223518) | more than 2 years ago | (#39198181)

Given that one of the major selling points of 'cloud' is the ability to swiftly spin up(and down) instances as you do or don't require them, that's a bigger deal than it might otherwise be.

If you are doing a BYO Server thing, or a conventional static-sized hosting package, and buying to fit largely static demand, you may never have touched the power button after you first shoved it in the rack and fired it up. However, if you are doing the cloud thing and not spinning stuff up and down pretty frequently, you are probably overpaying.

Re:Wait (2)

Sez Zero (586611) | more than 2 years ago | (#39198421)

We have a vendor that provides software distribution through Azure. It is completely down; no software and not even the web-based administration panel.

So it isn't just the ability to fire up new VMs, but (from my experience) seems to be a complete platform failure for some customers.

Re:Wait (2)

glassware (195317) | more than 2 years ago | (#39198589)

I concur with what others have said. There are numerous services, being provided by Azure, that are completely unreachable, and have been so for longer than seven hours.

last time (5, Informative)

phantomfive (622387) | more than 2 years ago | (#39198075)

Last time a Microsoft cloud product went down, users sustained real data loss. Of course, Microsoft claimed it couldn't happen with Azure [cnet.com] .

Re:last time (1)

Anonymous Coward | more than 2 years ago | (#39198361)

A customer described the problem to The Register as an 'admin nightmare' and said they couldn't understand how such an important system could go down.

This customer is new to the concept of Microsoft, aren't they?

Hell, they're new to the concept of the internet in general, most likely.

Re:last time (1)

wstrucke (876891) | more than 2 years ago | (#39198905)

The system that lost user's data was aptly named "Danger".

it's OK, just explain it in the blog (1)

alen (225700) | more than 2 years ago | (#39198083)

like google does when something goes wrong. just explain how you're going to change things and why it happened and it will all be OK

Re:it's OK, just explain it in the blog (0)

Anonymous Coward | more than 2 years ago | (#39198115)

This is Microsoft, are things ever OK?

Credibility (2, Interesting)

hism (561757) | more than 2 years ago | (#39198087)

At this point, the best way to keep their credibility from further deteriorating is to provide good reports on what is going on. E.g., not like PSN, more like Amazon [amazon.com] . Currently that Azure dashboard doesn't even load for me... has it been slashdotted or something?

As an aside: whenever a cloud system goes down, people come out to rag on the reliability of the cloud. While I'm also annoyed by the marketing guys throwing around "just put it in the cloud!!" as much as anyone else, and agree some applications make no sense living in the cloud, I'd also like to point out that for some people, doing the admin work in-house results in the same amount or more headaches.

Is real failover redundancy a pipedream? (1)

swb (14022) | more than 2 years ago | (#39198121)

It seems like even the biggest guys can't make it work reliably, and presumably given the high profile of these services, they're not afraid to throw money and smart people at these problems.

Re:Is real failover redundancy a pipedream? (1)

medcalf (68293) | more than 2 years ago | (#39198275)

Well, the real problem is that you can never eliminate human error. When combined with the difficulties and costs of maintaining a proper test environment (full duplicate of production, essentially), the odds of something going wrong are always going to be non-zero. Then when you add the interconnectivity that clouds require on top of that, the odds that that something that goes wrong will make everything go wrong all at once becomes non-zero as well. So failure modes for well-designed cloud services tend to be fewer, but more catastrophic, than for non-cloud environments.

The cloud runs Linux. (-1)

Anonymous Coward | more than 2 years ago | (#39198129)

This is not a troll, I run cloud servers that are 24/7/365 with uptimes of well over 1000 days. Microsoft servers to the best of my knowledge can not do that. One Day Microsoft will figure out they never could write a good server OS.

Feature Suggestion! (5, Funny)

fuzzyfuzzyfungus (1223518) | more than 2 years ago | (#39198133)

Since the image that "Azure" and "Cloud" conjurs up is more "sky" than "cloud" it would be my suggestion that Microsoft simply register chickenlit.tl and set up an Azure service status monitor/report page there.

They could have an adorable cartoon chicken that, when the system is working normally, runs around scratching and pecking(speed dependent on load). When downtime occurs, it would begin squawking about how the sky is falling. What could make failure more endearing?

Just to add that Microsoft touch, they could do the entire thing as a Microsoft Agent ActiveX control [wikimedia.org] !

Upgrade to Win 8 Beta? (1)

Anonymous Coward | more than 2 years ago | (#39198135)

They thought it was ready.

To the cloud! (4, Funny)

Howard Beale (92386) | more than 2 years ago | (#39198155)

Well...maybe not right now...

BCoD (2)

tsmithnj (738472) | more than 2 years ago | (#39198199)

It's the Blue CLoud of Death!

Re:BCoD (2)

Sulphur (1548251) | more than 2 years ago | (#39198685)

It's the Azure CLoud of Death!

FTFY

This is why NASDAQ isn't using windows anymore... (0)

Anonymous Coward | more than 2 years ago | (#39198243)

London stock exchange also is using Linux...

The thing about clouds... (2)

ThisIsAnonymous (1146121) | more than 2 years ago | (#39198245)

When it rains, it poors...

down sides of centralization and remote admins (0)

Anonymous Coward | more than 2 years ago | (#39198269)

down sides of centralization and remote admins.

Some times you are better off with local admins and systems.

what is better all your sites / a big chunk of then down or just one?

local admins or centralization with remote admins that do not know about each site local software setups?

basket (0)

Anonymous Coward | more than 2 years ago | (#39198281)

Yet another example of why the Cloud is not ready for production. No way I'm putting my eggs in there. Maybe development/testing, but never production.

Ah, the cloud... (4, Insightful)

ErichTheRed (39327) | more than 2 years ago | (#39198289)

It's funny how those of us who bring up issues of data security and service resiliency are dismissed as just trying to protect our jobs.

Like so many other things, the actual technical underpinnings of "the cloud" are great, and have been standard fare for years. Virtual machines + flexible networking are a godsend for systems guys tasked with getting capacity for a new project up and going yesterday. I love being able to build and rip down entire test environments just to try something out...that used to mean a rack of physical servers, switchgear, etc. tied up while it was being used. That's why everyone's slowly coming around to the "private/hybrid cloud" model, which is really just code for "VMs + network capacity + something to tie it all together + maybe some external hosting".

The problem is that "the cloud" is very badly misunderstood. As sson as a CIO sees "virtual, on-demand capacity without those pesky physical on-site machines and IT staff, for a fixed cost per compute-hour" everything else takes a back seat. Then, it's "why do we need IT staff on-site, everything's being taken care of in the cloud." Public clouds like Amazon or Azure are great for startups who can't really afford their own data centers, or even bigger businesses to offload some of the nonessential stuff. When you start looking at hosting everything though, the marketing hype of the cloud sometimes distracts people from realities that they have to contend with.

Also, I'm not saying that businesses who go the private cloud or traditional hosting/outsourcing route won't have downtime -- they will. However, having onsite staff and infrastructure means you can work those staff until they fix the problem, and you have control over them. Most sane outsourcing contracts have SLAs in them stating that the vendor will expend X amount of effort to fix your problems. Cloud provider agreements, unless specifically mentioned otherwise, are "as is, where is, best effort restoration with no warranty." OK, maybe some providers will give you an SLA, but all that does is buy you free service at a later date if they violate it...it doesn't bring your application back online. You still have no choice but to sit and wait around for the provider to fix whatever's wrong...just ask Amazon EC2 customers about what happened during their last outage...

Companies need to draw sane boundaries around hosted systems, and decide what is critical and what can be offloaded. Do I care about a set of development/test machines that get used once a month? Probably a lot less than the critical database/application servers that run my core business. Comfort level, cost per minute of downtime vs. cost of dedicated resources and other factors need to be carefully considered before jumping into the cloud with both feet.

Re:Ah, the cloud... (2)

geekoid (135745) | more than 2 years ago | (#39198693)

Just so you know, the data is still accessible in Azure, it's the management console that's
  down. That's still bad, but lets deal with the actual facts.

A) the cloud doesn't need to mean offsite. It often is, but the philosophy can be brought in house.
B) redundancy.

Companies should completely adopt the cloud philosophy, but keep onsite system redundancy; which is still cheaper and easier then current non cloud solutions.

The desktops should just be cloud machines. Note, I don't say dumb terminals bacause three is some use for local data, just not application data. Dumb terminal rely ion centralized storage, and processing. Cloud computers do the majority of the processing.

I got to say, getting a new computer, and not needing to do a recovery, or build a system instance is pretty damn good.

Advice (4, Informative)

DickBreath (207180) | more than 2 years ago | (#39198301)

Use the MCSE mantra:
1. Perform virus scan.
2. If that doesn't work, find a different program that will display a reassuring green graphic.
3. If that doesn't work, reboot.
4. If that doesn't work, reformat, reinstall.
5. If that doesn't work, GOTO 1.

Microsoft wouldn't know anything about data center running if it were chase aftering them at full speedo.

Google this: "Microsoft Sidekick / Danger"

http://techcrunch.com/2009/10/10/t-mobile-sidekick-disaster-microsofts-servers-crashed-and-they-dont-have-a-backup/ [techcrunch.com]

https://www.pcworld.com/article/173470/microsoft_redfaced_after_massive_sidekick_data_loss.html [pcworld.com]

http://www.appleinsider.com/articles/09/10/11/microsofts_danger_sidekick_data_loss_casts_dark_on_cloud_computing.html [appleinsider.com]

Is Azure free? (1)

nurb432 (527695) | more than 2 years ago | (#39198333)

if so, that's the breaks. If not, then there should be contractual SLAs and penalties involved.

Re:Is Azure free? (1)

courteaudotbiz (1191083) | more than 2 years ago | (#39198627)

...there should be contractual SLAs and penalties involved

Do you really think Microsoft would put a gun on their own head like that, assuming they learned from their past?

I think they provide the service "As-is and with best-effort service recovery". Read the fine prints, I'm sure you'll find something like that.

Thunk (1)

koan (80826) | more than 2 years ago | (#39198355)

One more nail in the Cloud coffin.

Re:Thunk (2)

geekoid (135745) | more than 2 years ago | (#39198583)

Yes, just like flat tires are putting nails in the auto industry coffin.

Re:Thunk (2)

Chris Mattern (191822) | more than 2 years ago | (#39198967)

If you had "flat tires" that put thousands of cars out of service all at once, then flat tires *would* be putting nails in the auto industry coffin.

Not like Salesforce, yet (1)

mattr (78516) | more than 2 years ago | (#39198359)

I had an outage on Salesforce for 1 week and they did absolutely nothing regarding giving me any free account time or anything except "Sorry".
Their explanation was a massive multiterabyte log file had to processed since what corruption they had extended to their backup.
Shouldn't ever happen.
This was last Autumn.
All boy scouts should take away this: Cloud promises are made to be broken.

Re:Not like Salesforce, yet (1)

gweihir (88907) | more than 2 years ago | (#39199053)

Given how boastful and grand these claims are, this really is not a surprise to anybody competent. Complex systems fail. They fail in complex ways. Redundancy helps in some ways, but makes things worse in others, by increasing complexity.

Also keep in mind that when outsourcing IT, the IT people suddenly have different business goals than you do. As long as they stay afloat, they do not really care whether you go under. In-house IT is different. They are sitting in the same boat. And any sane management will make sure they have all the benefits of being in this boat and so a huge motivation of keeping it going. Unfortunately, many managers just see IT as a problem.

Great uptime! (5, Funny)

gmuslera (3436) | more than 2 years ago | (#39198375)

Put your servers in the Azure cloud to have an uptime of 9.999999999%

Re:Great uptime! (1)

courteaudotbiz (1191083) | more than 2 years ago | (#39198651)

You misplaced the "."... Oh wait...

Resiliency vs. Control (1)

medcalf (68293) | more than 2 years ago | (#39198413)

Clouds are, in a sense, all about using tight control to gain efficiency. Control requires centralization. But this introduces failure modes that are catastrophic: rather than degrading performance overall or seeing point failures, everything is perfect until everything is gone. Resiliency — the ability to survive failures and still function to some degree — requires decentralization both of infrastructure and of decision making power. So attempts to become more efficient, past a certain point, inevitably result in the destruction of the system.

This is not just an IT observation. The same thing happens with biodiversity (fewer species means greater risk that a key part of a food chain will collapse and take the entire chain with it), the economy (ever notice how failures are getting bigger as government steps in more to prevent failures?), and any other complex system. Once a system is too big for a single human mind — and specifically the one in charge of the system — to contain its complexity and understand its failure modes, failure becomes inevitable. The fewer people allowed to understand and make decisions about the system, the more catastrophic the failures when they occur. The more complex the system, the more likely it is for the failures to occur. Which is to say, any complex system is at increased risk of catastrophic failure as it grows in complexity and as it becomes more centralized. Combine the two, and you're just waiting for the disaster to happen.

Cloud ain't so bad (5, Insightful)

Martz (861209) | more than 2 years ago | (#39198431)

I wrote a comment on slashdot a while back which questioned the sensibleness of running services in the cloud. I used to be a sceptic.

Since then I've used Rackspace Cloud and found that it's actually a very good idea, for certain things.

The benefits of using a cloud system are scalability and no commitment- it's not about reliability or higher availability - but you do get a little win in those areas.

To give some examples, I was recently able to play around with mysql clustering. I followed a mysql clustering howto [reliablepenguin.com] and played around with it, setup a mysql cluster with load balancers. Once I was finished geeking about, I saved the VMs to the file storage and deleted the cloud instances. Total cost a £/$2-3 maximum. I hadn't previously been able to do this, I would have had to rent a dedicated server which would serve websites, email etc. I couldn't really use the dedicated server to play with new technology in case it had a negative impact on the live systems. I did have development box for a while, but it essentially doubled my costs without making any more money, just offering some protecting.

Now I have staging/development instances in the cloud - and no commitments to them - I don't have to worry about a £250 monthly bill or sign a 12 month contract to get my own box. I can fire up some resources, use them, and throw it away when I'm done.

The upshot is that I can play around with other peoples cool open source software without risk or buggering something up on my live box, and the costs are insignificant since I'm only renting it per hour. I can try something new, if it works great - it might go/stay in production. If not, delete it and move onto the next cool thing.

If I need high availability, I would use Rackspace, Amazon, Azure, and I'd ensure that I have a plan to deal with a major outage with any of the providers. Each have APIs, so in theory I could create new instances automagically and failover between different cloud providers with a quick DNS change, while keep costs low.

To recap, the cloud isn't all about high availability - no matter what the marketing says. It's about scaling systems and running resources for small amounts of time, and is perfectly suited to services which have peak demand (ticket sales for example).

The Daily Show (1)

ISoldat53 (977164) | more than 2 years ago | (#39198459)

I wonder if this is what is causing the Daily Show to post a maintenance sign on login?

When Clouds go down... (1)

AB3A (192265) | more than 2 years ago | (#39198551)

It's called Fog.

Azure down.... (1)

non-plus (260549) | more than 2 years ago | (#39198579)

so, now that the Azure cloud is down and the news has hit Slashdot - the "service dashboard" has now been "slashdotted"

Network Error (tcp_error)

A communication error occurred: "Operation timed out"
The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.

For assistance, please raise a ticket through the CSC Help Desk (E-mail: CSS_Internal_Help_Desk@csc.com), and provide the information on this page for Proxy: CSC-CHD-CDC-1

ya'll killin' me :-)

my money goes on.... (0)

Anonymous Coward | more than 2 years ago | (#39198611)

....a loop of death as cause for the outage! ^^

Leap year (0)

Anonymous Coward | more than 2 years ago | (#39198681)

29 February and unexpected downtime hummm

Cloud Redundancy (0)

Anonymous Coward | more than 2 years ago | (#39198867)

I have no empathy for any company who relies solely on a single provider. It seems as though nothing is every 100% reliable, and for those companies who rely on outside service providers, they need to understand that no external company will ever value the service as much as you do.

For a while now, I have contemplated the necessity for a data layer which provides replication and failover, between two ENTIRELY separate clouds (think Azure and AWS/EC2). I just keep waiting for someone to do the legwork of developing this (I'm distracted on other projects).

"This should never happen" ... Stupid (2)

gweihir (88907) | more than 2 years ago | (#39198897)

People that believe the cloud is not as risk for downtimes are just stupid and deserve exactly what they get. The cloud not only has the normal risks any comparable infrastructure has, but also suffers from additional risks because of complex network connectivity, complex usage patterns and untried system administration patterns.

People that still think this now are not only stupid but unwilling to learn, as the Amazon outage last year clearly showed the risks. In addition, Amazon is very likely more competent than Microsoft at this by any sane metric.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...