Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

City's IT Infrastructure Brought To Its Knees By Data Center Outage

Soulskill posted more than 2 years ago | from the watch-out-for-that-first-explosion,-it's-a-doozy dept.

Cloud 102

An anonymous reader writes "On July 11th in Calgary, Canada, a fire and explosion was reported at the Shaw Communications headquarters. This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers, local radio stations, emergency 911 services, provincial services such Alberta Health Services computers, and Alberta Registries. One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well.' No doubt this has been a hard lesson on how NOT to host critical public services."

Sorry! There are no comments related to the filter you selected.

First post! (3, Informative)

Svartormr (692822) | more than 2 years ago | (#40643819)

I use Telus. >:)

Re:First post! (5, Funny)

clarkn0va (807617) | more than 2 years ago | (#40643867)

So Shaw customers get all their disappointment in one fell swoop, while you suffer subclinical abuse on an ongoing basis. Congrats.

Re:First post! (1)

Anonymous Coward | more than 2 years ago | (#40644059)

Thanks. I nearly got a hernia I laughed so hard at that. Seriously. You could say Telus are a of bunch of cunts, but cunts are useful.

Re:First post! (1)

Anonymous Coward | more than 2 years ago | (#40644077)

The unofficial offsite backup (the trunk of a certain station wagon) shall henceforth use the Telus' office parking lot.

Re:First post! (0)

Anonymous Coward | more than 2 years ago | (#40644619)

Both the companies are pretty abusive. It's a pretty locked up market.

Re:First post! (0)

Anonymous Coward | more than 2 years ago | (#40645143)

Hear, hear! Telus sucks. It's true.

I just can't say it enough. TELUS SUCKS!!! God, that feels good. TELUS SUCKS DONKEY BALLS!!! Ahhhhhh...

Re:First post! (1)

MrNickname (1918152) | more than 2 years ago | (#40644087)

I am in Calgary and use Shaw as my ISP. I did not have any internet downtime. Perhaps the damage was restricted certain servers while parts of their network were not affected?

Re:First post! (1)

snowraver1 (1052510) | more than 2 years ago | (#40644739)

Shaw court is an IBM datacenter. Many companies lost critical servers. We had a couple dozen there that are still coming back up. This didn't just affect shaw, but also big customers that pay big money for uptime.

Re:First post! (1)

gen0c1de (977481) | more than 2 years ago | (#40645955)

You don't pay big money to IBM for uptime, all IBM does now is resell other companies and take a share of the money. And become a middle man.

Re:First post! (1)

BagOBones (574735) | more than 2 years ago | (#40646721)

This is true, we just finished doing evaluations and IBMs quote included subbing out ALL the work to multiple sub vendors, the only part with IBMs name on it was the Quote.

Re:First post! (1)

Dr Caleb (121505) | more than 2 years ago | (#40646735)

Incorrect. I know many IBMers that have been restoring service to that datacentre for more than 50 hours, on 2 hours sleep. It sucks when you are doing it, but it's worth much geek cred in my book.

Re:First post! (0)

Anonymous Coward | more than 2 years ago | (#40646893)

It was Shaw's fuckup, too many eggs in one basket. The redundant electrical substations should not have been in the same building with the same fire protection system.

Re:First post! (0)

Anonymous Coward | more than 2 years ago | (#40644861)

Engineering department good
IT department bad

Re:First post! (1)

dargon (105684) | more than 2 years ago | (#40646493)

Shaw has 2 major locations in Calgary, the only people effected are those that use the downtown site

Re:First post! (0)

Anonymous Coward | more than 2 years ago | (#40669381)

And downtown is not actually one of them...

is it a problem? (0, Troll)

Anonymous Coward | more than 2 years ago | (#40643849)

i don't really see the problem here. after all it's only canadians...

Or... (4, Insightful)

Transdimentia (840912) | more than 2 years ago | (#40643859)

... it just points out what should be practical thought in that no matter how redundancies you build, you can never escape the (RMS) Titanic effect. So stop claiming stupidity.

Re:Or... (1)

g0es (614709) | more than 2 years ago | (#40644007)

Well it seems that they had the redundant systems in the same building. when designing redundant systems its best to avoid common mode failure when ever possible.

Re:Or... (2)

theshowmecanuck (703852) | more than 2 years ago | (#40644089)

The designers AND their managers and their managers should be made redundant.

Re:Or... (1)

Glendale2x (210533) | more than 2 years ago | (#40644449)

Irrelevant. The fire department became involved and that typically means you shut it all down if they say so (or they'll do it for you), even the redundant stuff that's still running. The only way around that is separate physical buildings.

in some buildings / data centers the fire system (1)

Joe_Dragon (2206452) | more than 2 years ago | (#40645077)

in some buildings / data centers the fire system can kill most of the power

Re:in some buildings / data centers the fire syste (1)

Chris Mattern (191822) | more than 2 years ago | (#40645141)

Which is a *good* thing. Fire and live electrical systems don't mix well.

Re:in some buildings / data centers the fire syste (0)

Anonymous Coward | more than 2 years ago | (#40646903)

In some buildings they use nitrogen gas to extinguish fire instead of water. Obviously this requires immediate evacuation of all people/animals but this is fairly easy in a standalone building dedicated datacenter.

You don't need to be at the mercy of the fire department killing your power system in all cases.

Re:Or... (0)

Anonymous Coward | more than 2 years ago | (#40644455)

The problem there is that as soon as you split your infrastructure across multiple data centers, you have to start worrying about latency and split-brain scenarios. In most cases, the possible failure modes for multiple data centers are both greater and more likely.

Re:Or... (1)

cusco (717999) | more than 2 years ago | (#40645341)

There's a local municipality whose IT department was very proud of its redundant fiber ring. Then a backhoe pointed out the fact that all of the fibers, prod and redundant, were all in the same conduit. Oops.

Re:Or... (1)

sortius_nod (1080919) | more than 2 years ago | (#40644705)

Uhh, it is stupidity. Having your DR in the same site as your production servers is monumentally stupid. Most companies I've worked at have rules that state a minimum of 5km distance between production & DR sites in case of catastrophic failure. The Department of Defence here in Australia has 500km between their two production sites & DR. Our biggest service provider has 1000km between production & DR.

The only time I have seen the same building used is for redundancy & then the two comms/server rooms were blast proof bunkers (that was for a newspaper).

So yes, this is stupidity on a grand scale.

Re:Or... (0)

Anonymous Coward | more than 2 years ago | (#40645075)

Do you even know what the titanic is? Is your dod fortress built on the same overconfidence principle?

Re:Or... (1)

Anonymous Coward | more than 2 years ago | (#40645387)

Yeah, silly Australian Department of Defense with poor network redundancy planning -- if a meteor hits Australia and wipes it off the map there will be no backups available.

Re:Or... (1)

flyingfsck (986395) | more than 2 years ago | (#40646281)

Two buildings eh? Like those US companies who had their main and backup systems in the two World Trade Centre Towers in NY. It sure helped them a lot...

No Site Level Resiliency? (5, Insightful)

sociocapitalist (2471722) | more than 2 years ago | (#40643861)

Whoever designed this should be smacked in the head. You never have critical services relying on a single location. Should have redundancy at every level, including geographic (ie not in the same flood / fault / fire zone).

Re:No Site Level Resiliency? (-1)

Anonymous Coward | more than 2 years ago | (#40643939)

so your saying they should be... fault tolerant?

Re:No Site Level Resiliency? (2, Informative)

Anonymous Coward | more than 2 years ago | (#40643991)

The issue is IBM runs the Alberta Health Services and other infrastructure from the Shaw building of which IBM has their own datacenter in. IBM had no proper backups in place for these services.

911 being the most critical was also not affected, just Shaw VoIP users couldn't call 911 if their lines were down -- obviously (only ~20k people downtown were affected).

Re:No Site Level Resiliency? (1)

Mike Buddha (10734) | more than 2 years ago | (#40644831)

It's IBM's fault that their customers didn't have a DR plan?

Re:No Site Level Resiliency? (0)

Anonymous Coward | more than 2 years ago | (#40645427)

It's IBM's fault that their customers didn't have a DR plan?

Probably not--unless their customers were paying for it.

IBM (0)

Anonymous Coward | more than 2 years ago | (#40646373)

Didn't IBM also host stuff in one of the World Trade Center towers, and had the backups in the second tower?

Re:No Site Level Resiliency? (0)

Anonymous Coward | more than 2 years ago | (#40643993)

Smacked in the head? It should be a capital offense to be so stupid in setting up critical systems! Where's the .357 magnum?

Re:No Site Level Resiliency? (0)

Anonymous Coward | more than 2 years ago | (#40644071)

I guess not everyone has as much money as you think they should have.

Re:No Site Level Resiliency? (1)

jtnix (173853) | more than 2 years ago | (#40644093)

Add 'tornado zone' to that list.

If you host all your cloud services at Rackspace in Texas and a tornado happens to rip apart their datacenter, well expect a few hours/days downtime. And you better have offsite backups of mission critical data or that's a long bet that is getting shorter every day.

Re:No Site Level Resiliency? (3, Insightful)

sumdumass (711423) | more than 2 years ago | (#40644435)

This is why i do not understand the rush to cloud space. The same types of outages that apply to locally hosting the data apply to the cloud space providers. You still need the backup's, disaster plans with the ability to access the servers and such, much of the same stuff if not more then you would need if hosting it yourself. Is the clouds that much cheaper or something? Or is it more about marketing hype that talks PHBs and supervisors who want to sound cool into situations like this where diligence is not necessarily a priority?

Re:No Site Level Resiliency? (1)

Eponymous Hero (2090636) | more than 2 years ago | (#40644523)

it's the warm body on the other end of the line who kisses your ass so well you can hardly yell at them for anything

Re:No Site Level Resiliency? (1)

englishstudent (1638477) | more than 2 years ago | (#40644717)

Totally agree.

Re:No Site Level Resiliency? (1, Insightful)

ahodgson (74077) | more than 2 years ago | (#40644771)

The cloud is not cheaper, unless you're doing things really wrong in the first place, like buying tier 1 servers or running windows.

It does provide economies of scale, can be somewhat cost-competitive with doing it yourself for at least some things, and you don't have to deal with hardware depreciation and the constant refresh cycle.

The big cloud providers also integrate a lot of services that would be a pain to build internally for small and mid-sized clients.

Hype explains the rest. PHBs are always looking for a silver bullet to make things "easy".

Oh, and developers and managers both think that moving to the cloud means they won't need sysadmins. Only to (eventually) find out that running stuff in the cloud needs sysadmins who not only know how to do everything themselves but can also then work around the cloud providers' idiosyncracies to still build things that work.

Re:No Site Level Resiliency? (1)

sociocapitalist (2471722) | more than 2 years ago | (#40646959)

Arguably you could still use two different cloud providers after verifying (and continuing to verify over time) that the infrastructure (and connectivity to it) is actually redundant.

Re:No Site Level Resiliency? (1)

0100010001010011 (652467) | more than 2 years ago | (#40645109)

Texas and a tornado happens to rip apart their datacenter

I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.

Re:No Site Level Resiliency? (1)

drinkypoo (153816) | more than 2 years ago | (#40645631)

I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.

Mostly these days it's built in whatever they have available, because actually building anything costs too much. That's got to be responsible in large part for the rise of the shipping container as a data center... it's a temporary, soft-set structure. You only need a permit for the electrical connection, and maybe a pad.

Re:No Site Level Resiliency? (1)

Ol Olsoc (1175323) | more than 2 years ago | (#40645645)

I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.

The cost is impressive.

Re:No Site Level Resiliency? (0)

Anonymous Coward | more than 2 years ago | (#40644165)

What remains a question is... Is everybody safe? Geez, there's an explosion and everyone is complaining because there was no internet.

What pisses me of from the future, is that people is starting to think that data backup is more valuable than human lives. Many Universities in the US are contracting disaster recovery sites. So if they are bombed, they can have all the information, who cares if there's any student left alive. I hope that brings stuff into perspective.

Re:No Site Level Resiliency? (2)

Sir_Sri (199544) | more than 2 years ago | (#40644437)

Is everybody safe

That is, quite literally, someone else's problem. It sounds calloused to say, but seriously, /. isn't a site for first responders, it's for IT and CS types. It's not like we're looking at one thing at the expense of another here, your data (and 911 access) should work, people shouldn't die in a fire and your data shouldn't be hosed if it was housed there.

As to your point about universities. As tragic as it might be if someone died in a fire tomorrow at the university I graduated from 10 years ago, I still want them to be able to provide me transcripts and a copy of my degree if needed 10 years from now.

People around here are supposed to worry about preserving data, usually not at the expense of peoples lives (although there is a market for that in government secrets). Worrying about how to put out a fire and treat burn victims is someone else's job.

Re:No Site Level Resiliency? (1)

Eponymous Hero (2090636) | more than 2 years ago | (#40644567)

it does not remain a question, you just didn't RTFA. why does no one care about the safety of people in the explosion?

No one was hurt when a blast in a 13th-floor electrical room on Wednesday brought down Alberta Health Services computers, put three radio stations off the air and affected some banking services.

because no one was hurt, you fucking chicken little. if you gave a shit at all about "the people" you would have RTFA to find out. go ahead and read it. i hope it brings stuff into perspective for you.

Re:No Site Level Resiliency? (3, Interesting)

foradoxium (2446368) | more than 2 years ago | (#40644675)

Imagine if the library of Alexandria had backup copies of all those books, manuscripts and other treasures? How about Constantinople? I'm sure there were people that tried to protect that data who believed it was worth more then their life. I hope that brings stuff into perspective.

Re:No Site Level Resiliency? (0)

Anonymous Coward | more than 2 years ago | (#40644299)

Dude, that costs money reducing profit and bonuses.

Re:No Site Level Resiliency? (1)

JustOK (667959) | more than 2 years ago | (#40644569)

it's boni

Re:No Site Level Resiliency? (0)

Anonymous Coward | more than 2 years ago | (#40662287)

The designer isn't necessarily at fault. On multiple occasions in my career I've run into management and/or owners who don't think the risk is justified by the cost for site DR.

Maybe the city/provinces should skip on redundancy (3, Interesting)

Anonymous Coward | more than 2 years ago | (#40643943)

The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.

Re:Maybe the city/provinces should skip on redunda (0)

Anonymous Coward | more than 2 years ago | (#40644277)

Shaw has other buildings in the city. They should have used that for redundancy.

Re:Maybe the city/provinces should skip on redunda (0)

Anonymous Coward | more than 2 years ago | (#40644891)

As already mentioned, health services was not hosted by shaw

Re:Maybe the city/provinces should skip on redunda (2)

snowraver1 (1052510) | more than 2 years ago | (#40645001)

The problem wasn't necessarily with Shaw. Shaw's problems were relatively minor. Internet and television services were affected over a small geographic area (downtown Calgary). Those affected by the Internet outage who also had Shaw Home Phone, couldn't use their phone as the network was down. If they called 911 on a cell phone or a land line, they would have received help.

The real problem was with the datacenter that is housed in the same building. 20,000 consumer class Internet outages is nothing compared to (Estimate, based on almost nothing)5,000 servers going down. The Fire Dept was involved, so power is going to get cut whether they like it or not, but there were still other problems. There are reports that the backup generators didn't kick in (whether or not that would have avoided the outage, I don't know). I have received indication that if those backup generators had worked, service could have been restored slightly faster(I'm hearing this all 3rd party, so salt is needed).

IBM runs the datacenter where the servers live. Who screwed up? IBM? Shaw? Someone else? I don't know. We'll have to wait and see.

Re:Maybe the city/provinces should skip on redunda (2)

snowraver1 (1052510) | more than 2 years ago | (#40645039)

Oh yeah, I also heard that IBM will be incurring HUGE fines from SLAs. I think I heard some obscene number like 1M/minute.

Re:Maybe the city/provinces should skip on redunda (0)

Anonymous Coward | more than 2 years ago | (#40645477)

were you sitting in a Stampede beer tent at the time?

Re:Maybe the city/provinces should skip on redunda (0)

Anonymous Coward | more than 2 years ago | (#40645733)

Never fear, IBM started flying tape backups [alberta.ca] to an alternate datacenter (datacentre?), probably in Ontario...

Re:Maybe the city/provinces should skip on redunda (1)

sociocapitalist (2471722) | more than 2 years ago | (#40644397)

The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.

Cost should not be an issue when we're talking about life or death critical services that are provided by some level of government. You spend what you have to spend to get the job done right, not more, not less. We're also not talking about a town with a population of 16 but a city with a population of 3,645,257 (in 2011). I am quite sure that they had the means to do this the right way and just chose not to.

Re:Maybe the city/provinces should skip on redunda (0)

Anonymous Coward | more than 2 years ago | (#40644691)

Calgary population is about 1,200,000. Alberta has had a long succession of conservative governments. Spending money on health care infrastructure is not that high on their list of priorities. They're all about oil, pickup trucks, big hats and small government. This kind of thing is the natural result.

Re:Maybe the city/provinces should skip on redunda (0)

Anonymous Coward | more than 2 years ago | (#40646731)

"Cost should not be an issue when we're talking about life or death critical services that are provided by some level of government."

Ahh, but here in the real world, IT IS. Why is it people think money is always no object? That all redundancy is free, and any lack of said redundancy means someone should be fired.

Go rule your little make believe world in some shitty flash based "Sim IT Manager" game or something, ok?

Re:Maybe the city/provinces should skip on redunda (0)

Anonymous Coward | more than 2 years ago | (#40646939)

Learn something please: http://en.wikipedia.org/wiki/List_of_the_100_largest_metropolitan_areas_in_Canada

Re:Maybe the city/provinces should skip on redunda (0)

Anonymous Coward | more than 2 years ago | (#40644641)

It'd be interesting to see Shaw's quarterly profits against the cost of making life and death services geographically redundant.
Why are our telecommunications companies allowed to operate with minimal to no real competition as private entities?

Shaw is an ISP (1)

Capt.DrumkenBum (1173011) | more than 2 years ago | (#40643985)

All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.

Re:Shaw is an ISP (2)

tlhIngan (30335) | more than 2 years ago | (#40644055)

All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.

Actually, Shaw's a media company - they do not only internet, but phones as well. Those went down as well (Shaw has business packages for phone service over cable, but downtown, I'd guess they also have fiber phone service too).

And it really isn't a screwup - they were doing something important - testing the backup generators. It's just the generators blew up, which took out the other backup generators (dual redundant power!), and knocked out power to all the equipment by knocking the utility power offline as well.

Re:Shaw is an ISP (0)

Anonymous Coward | more than 2 years ago | (#40646665)

The lesson is to never test anything. If it fails because you didn't test it, it's an unforseen accident. If it fails because you did test it, it's your fault for testing it instead of trusting it to work.

Re:Shaw is an ISP (0)

Anonymous Coward | more than 2 years ago | (#40644113)

It was a Shaw owned datacenter, major services were hosted there through IBM. For the most part Shaw internet connections didn't have an issue.

Re:Shaw is an ISP (2)

mysidia (191772) | more than 2 years ago | (#40645789)

Redundant internet connections are no guarantee of no single points of failure.

"Redundant" connections can sometimes wind up on the same fiber somewhere upstream, unbeknownst to the subscriber.

Most telecommunications infrastructure in any area has some very large aggregation points also... Telco Central Offices; a single point of failure for telecommunications services served by that office.

What good is working 911 service, if nobody can call in, because all their phones are rendered useless by a failure of the Class5 switch [wikipedia.org] that all the phones in the city are connected to?

This kind of equipment typically has redundancy built in to survive the failure of any one processing unit or card, and telco facilities may be constructed with steel-reinforced concrete walls, and many protections against external events such as tornados.. but when you consider catastrophic disaster scenarios, where the problem originates inside, you are still faced with single points of failure

It's not like the average household is willing to pay for two phone lines, each to a different exchange, and some kind of "automatic failure switching" mechanism to select the working telco exchange office.

Fukushima Daiichi - Anyone? (2)

Paleolibertarian (930578) | more than 2 years ago | (#40644035)

Putting all of ones eggs in one basket, all your reactors on one backup generator or all your data in one place is the reason for these catastrophic failures. Back 30 odd years ago we had mainframes, I used to operate an IBM 360, then along came the internet and the distributed computing model where the system didn't have all of its data in even one box in a company. There was a box on everyone's desktop. Now that's come full circle with the "Cloud" initiative where all your data is housed in one place (datacenter) again.

The reason the cloud was "invented" was to bring back the more profitable mainframe/dumb terminal business model.

Re:Fukushima Daiichi - Anyone? (0)

Anonymous Coward | more than 2 years ago | (#40648061)

The problem at the nuke plant was not the lack of multiple backup generators. The problem was that the tsunami wave flooded them and rendered them inoperable. Multiple flooded generators wouldn't have done diddly to help. The problem was the plant designers were ultimately driven by bean counters to not build a wall high enough to shield the generators from such a wave based on the perceived unlikeliness of such a high tsunami wave.

Re:Fukushima Daiichi - Anyone? (1)

Paleolibertarian (930578) | more than 2 years ago | (#40652409)

Actually the backup generators were located in the basement as per GE's original design. Engineers requested to locate the generators in a more secure (from tsunamis) location when the plants were built but were overruled by upper management.

However the location of the generators at Fukushima is irrelevant to my point that had the power grid been a distributed system with local batteries or what have you then all residents would not have lost power when the plant was flooded. This is not an argument about the design of the centralized power plant but an argument that a centralized power plant need not be used or even exist.

Limitations (2)

phorm (591458) | more than 2 years ago | (#40644043)

There are limitations to how high your HA can be depending on the volume of data you process and the infrastructure available.

In this case an entire building was knocked out by an exceptional circumstance. You can plan for that by having buildings in multiple sites, but as you get farther apart the connecting infrastructure gets more difficult. In this Shaw is an ISP (one of the big-boys in that part of the country), so in that case you'd expect that access to fast connections should be their forte. One thing that 9-11 showed is that even huge skyscrapers - though unlikely - can be knocked out by a crazy set of circumstances (or just crazy people).

However, what happens if you're running through gigs of data on a constant basis? If you can't get a fibre connection between two sites, you might not be able to have a live redundant backup.

Now what if you connect to multiple outside entities. They'll need to have redundant connections to both your sites. You'll want two ISP's, in case one kicks etc, etc etc.

How about power? Both sites will need a big generator or something of the like, plus battery-backup to hold things until the generator kicks in. Preferably they'd both be fairly far apart on the grid so if one site doesn't.

Weather... they'd both better be outside of any major weather considerations (forest fires, floods, quakes, whatever).

I won't make excuses for Shaw (no I don't work for them, in fact I'm affected negatively by the outage), but for many companies 100% HA/redundancy isn't really possible.

Luckily for those using services, I believe that this was a case of connection/infrastructure loss rather than all the data, so I hope that Shaw is working their a**es off to get things back.

Re:Limitations (1, Interesting)

theshowmecanuck (703852) | more than 2 years ago | (#40644475)

It's kind of funny for me to hear someone call Shaw one of the "big boys" after working on a number of Telecom projects in the USA. In the scheme of things Shaw would only rate being maybe a tier 3 player. Their maximum customer base is maybe 8 million potential people .... not households (and I'm presupposing they are in Saskatchewan and Manitoba now otherwise subtract a couple million). And they compete with Telus and Manitoba Tel and Sasktel (or whoever it is there). That's definitely tier 3 or smaller. Heck, Bell Canada and Rogers are considered tier 2 and their market is probably 15 million or more. FYI Sprint USA has 50 million plus accounts, AT&T has at least 300 million accounts.

I told a Telus (Shaw's major and bigger competitor) senior manager in a conversation one time that it was ridiculous that they made employees pay for their coffee and had the gall to close the cafeteria at 2:30. He said "do you know how many employees we have?! 20,000!" I said I had worked on campuses of companies that had more people than that... and they bought the coffee... I guess you don't think much of your employees. He didn't believe there were business campuses that big. People in Canada don't get that we're not that big in the scheme of things world wide and like it or not it's why we have to work with out neighbor and stop insulting them with our Napoleon syndrome antics. Land area doesn't make up for relatively small population.

Captain Obvious (2)

roc97007 (608802) | more than 2 years ago | (#40644107)

> No doubt this has been a hard lesson on how NOT to host critical public services.

And no doubt the lesson was not learned.

Re:Such lessons are never learn in Alberta.. (0)

Anonymous Coward | more than 2 years ago | (#40644415)

..where everything must be privatized for profit.

Re:Such lessons are never learn in Alberta.. (1)

roc97007 (608802) | more than 2 years ago | (#40644959)

Yeah, if it was entirely government owned, it'd be rock solid and cheaper, too.

Not surprising (3, Informative)

Anonymous Coward | more than 2 years ago | (#40644135)

There are buildings all over the US that can have a similar effect but worse. In Seattle it would be the Westin Tower, get the two electrical vaults in that building and you'll pretty much take most phone service, internet service and various emergency agency services all over the state offline for a while.

What I now consider a classic example is the outage of Fischer plaza. It not only took down credit card processors, bing travel and a couple other big online services. It also took out Verizon's FiOS service for western washington.
http://www.datacenterknowledge.com/archives/2009/07/03/major-outage-at-seattle-data-center/
(apologies don't comment a lot and don't know how to properly link)

The big problem is that many services no matter how redundant they may seem to be, now-in-days have a upstream geographic single point of failure (Ala my Westin tower example.)

Transformer fire? (1)

Glendale2x (210533) | more than 2 years ago | (#40644147)

Transformers sometimes fail catastrophically and without warning. Other than keeping transformers outside, such things simply fall under "shit happens". Then once the fire department gets involved you turn off all the power: your backup generators, your UPS, everything.

Re:Transformer fire? (0)

Anonymous Coward | more than 2 years ago | (#40644179)

I find keeping Transformers outside is bad for the chrome parts. Besides, do you really want Megatron where any kid could find him?

911 not down (2)

CaptainPuff (323270) | more than 2 years ago | (#40644227)

911 service was not down, only customers using Shaw as their phone service provide were unable to access it via Shaw's phone service. People were asked to use cell phones to call 911 as an alternative. Sounds like the city's emergency plan was activated and followed, prioritizing and assessing critical services and leaving the other non-essentials offline. Very likely that's also what is deemed to have redundancy (those ones probably have more than one ISP) while non-essential services don't.

Re:911 not down (0)

Anonymous Coward | more than 2 years ago | (#40644933)

only some downtown customers

Re:911 not down (1)

Svartormr (692822) | more than 2 years ago | (#40645841)

Well, I was standing in a hospital emergency several hours after the initial service loss, watching the staff fall back on paper systems. And many commonly used services, like finding out what medications a patient was on by checking a shared database used by pharmacists, were unavailable. No single event like this outage should have degraded all this services to uselessness.

Not just Shaw's network was affected (1)

Anonymous Coward | more than 2 years ago | (#40644249)

Our primary internet connection was Bell and Shaw was our backup. To our surprise Bell's downtown network relies on Shaw's backbone and was ultimately affected by this monumental single point of failure.

To get back on the internet without having to fail-over to our DR site we came up with a crazy solution of hooking up a Rogers Rocket Hub. The damn thing worked without our ~85 employees and 3 remote users noticing a difference.

Over the next few weeks we will be canceling all of our Shaw services, signing up with Enmax for our primary, and bumping Bell down to our secondary.

Re:Not just Shaw's network was affected (1)

gen0c1de (977481) | more than 2 years ago | (#40646003)

Have fun with Enmax, their stability record is pretty awesome. Try getting any kind of service on a weekend, their noc number on the weekends goes to a pager and they will call you back within the hour. I deal with them so often it isn't funny. And watch out, they contract out the last mile in many of their build out. Their like using telus and shaw and the best, they won't tell you.

Poof! (1)

Antipater (2053064) | more than 2 years ago | (#40644267)

Sounds like the datacenter heard about this "cloud" thing and decided to give it a try.

What really happened... (4, Interesting)

Anonymous Coward | more than 2 years ago | (#40644297)

Shaw had a generator overheat and literally blow up which damaged their other 2 generators and caused an electrical arc fire. This fire set off the sprinklers and in turn, the water shut down the backup systems.

Yes, it was stupid that Shaw housed all their critical systems, including backups, in one building but even more stupid was the fact that they used a water based sprinkler system in a bloody telecom room.

Also, Alberta has this wonderful thing called Alberta SuperNet, which, if I recall, all health regions use to use before our government decided to spend hundreds of millions of dollars to merge everything together and spend even more money to use the Shaw network to connect everything. The SuperNet was specifically designed with government offices in mind but nooo, why use something you have already paid for when you can spend more money and use something different.

Re:What really happened... (0)

Anonymous Coward | more than 2 years ago | (#40644453)

+1 Truth, from another AC who was there.

And SuperNet wasn't private-for-profit, so it must have have been bad. This is Alberta after all.

Water + equipment = magic smoke escaping (0)

Anonymous Coward | more than 2 years ago | (#40644333)

The worst thing about this is that someone designed the fire suppression systems for the DC and electricals with water. Not halon, co2 or foam.... Water. Thats just past common sense, it's pure negligence or incompetence.

Re:Water + equipment = magic smoke escaping (1)

corychristison (951993) | more than 2 years ago | (#40644629)

Halon has been banned for quite some time now. The replacement, Halotron, was just recently restricted.

CO2 would make the most sense.

The problem from what I understand was the generator room. A CO2 system would be ideal in this situation, you're dealing with lubricants as well and foam just makes an awful mess.

What I dont understand is how the sprinkler system was involved at all. When that valve bursts, it only flows at th affected area. Its not like the movies where if one pops the whole system goes off.

I don't know the building in question but I worked in the fire protection industry a short while.

It was so bad.. (5, Funny)

Megahard (1053072) | more than 2 years ago | (#40644339)

It caused a stampede.

don't jump to conclusions (0)

Anonymous Coward | more than 2 years ago | (#40644741)

There's way too many assumptions going on with this story. There's more than one company in the building and not all issues reflect on Shaw as is mostly being reported. There's also an IBM datacentre located in the building and that's where the Alberta Health Services stuff resides. There's also a lot of shared infrastructure, but when water is everywhere due to the transformer explosion and they cut power to the entire building...well what does one expect.

As with anything issues like this need to be learned from and not turned into a blame circus.

-Just someone who's had previous experience in the building in question

Systems on systems (0)

Anonymous Coward | more than 2 years ago | (#40645161)

There is a public building. Inside this building there is a level where no elevator can go, no stair can reach and it has no working network backup. This level is filled with flaws. These flaws lead to many catastrophes. Unpredictable catastrophes. But one flaw is special. One flaw leads to the source of all other problems.
  This building is protected by a very secure system. Every alarm triggers da'bomb for public services. But like all systems it has a weakness, the system is based on the rules, regulations and the budget of the building. One system built on another. If one fails, so must the other.

OK, you read the headlines, now some FACTS (1)

Anonymous Coward | more than 2 years ago | (#40645471)

'City's IT Infrastructure Brought To Its Knees By Data Center Outage'
Incorrect!! Certain key public and private infrastructure systems were (and still are) housed at the Shaw Court data centre, yes. But the 'City's IT Infrastructure' was certainly ANYTHING BUT 'brought to its knees'. Simply not true, inflated, and blown way out of proportion.

'This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers'
Grossly overstated. 'Large swath' I can accept as the impact was far reaching, sure. Only 30,000 downtown core subscribers were affected (and service was restored to them quite quickly). In a geographic location with MILLIONS of customers, this is hardly a 'large swath' of customers though - a bit of a stretch.

'local radio stations'
THREE radio stations: one country (I think most ppl were pretty happy Country 105 was off the air), and two talk radio. One of which was just a studio, and affected only one particular show. So, really - TWO radio stations.

'emergency 911 services'
Completely FALSE and over-hyped by media. The Shaw VOIP customers couldn't access 911 - yes. But as long as they had access to a cell phone, or lived outside the affected area, 911 was up the whole time. Some EMS systems were affected, though. However they have a backup analogue radio system should the digital system go down - so, nothing catastrophic here.

'provincial services such Alberta Health Services computers, and Alberta Registries'
Only SOME AHS computers were affected, in the Calgary area only. The rest of the province had no email or VPN until this morning. Big deal, we can live without email...just phone or fax someone. Alberta Registries was hit pretty hard (as in, completely offline)- but are allowing a very gratuitous grace period if anyone needs to renew a license or some such. Not a big deal.

'One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well''
Sigh. Again, way off base. Yes, there were 'backup' systems - comprised of a UPS system that suffered water damage. However, no servers in the data centre suffered any water damage. They were worried about condensation, but that turned out to be a non-issue.

'No doubt this has been a hard lesson on how NOT to host critical public services'
No, this is a lesson for the submitter, if he or she is really interested in clear reporting, to avoid the word 'not'. So, a cleaner sentence might be:
"No doubt, this is a hard lesson on how to host critical public services with clustering across sites, thereby avoiding a single point of failure." ...and that was the biggest mistake - a design flaw. The building design aside (13th floor = mechanical room. A transformer blew, triggering the sprinkler system. The backup generators engaged, but the battery room already suffered water damage and shorted out with the high load. When the fire marshal and building ppl arrived, they simply cut power to the building entirely, as water in the bus ducts - "wire trays" for non-construction types - was found), the REAL lesson here is what was already mentioned to AHS execs, Shaw managers and IBM Global - too many eggs in one basket. Instead of fire suppression via water; use halon. Instead of the 'backup systems' (what a JOKE) in the SAME BUILDING, configure clustered services across two or (better yet) more sites.

But - we're just geeks, not execs...what do we know?

It has to be said (0)

Phibz (254992) | more than 2 years ago | (#40645483)

I used to be a City until l took an arrow to the knee.

the FACTS (0)

Anonymous Coward | more than 2 years ago | (#40645517)

'City's IT Infrastructure Brought To Its Knees By Data Center Outage'
Incorrect!! Certain key public and private infrastructure systems were (and still are) housed at the Shaw Court data centre, yes. But the 'City's IT Infrastructure' was certainly ANYTHING BUT 'brought to its knees'. Simply not true, inflated, and blown way out of proportion.

'This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers'
Grossly overstated. 'Large swath' I can accept as the impact was far reaching, sure. Only 30,000 downtown core customers were affected. In a geographic location with MILLIONS of customers, this is hardly a 'large swath' of customers, however.

'local radio stations'
THREE radio stations - one country (I think most ppl were pretty happy Country 105 was off the air), and two talk radio. One of which was just a studio, and affected only one particular show. So, really - TWO radio stations.

'emergency 911 services'
Completely FALSE and over-hyped by media. The Shaw VOIP customers couldn't access 911 - yes. But as long as they had access to a cell phone, or lived outside the affected area, 911 was up the whole time. Some EMS systems were affected, though. However they have a backup analogue radio system should the digital system go down - so, nothing catastrophic here.

'provincial services such Alberta Health Services computers, and Alberta Registries'
Only SOME AHS computers were affected, in the Calgary area only. The rest of the province had no email or VPN until this morning. Big deal, we can live without email...just phone or fax someone. Alberta Registries was hit pretty hard (as in, completely offline)- but are allowing a very gratuitous grace period if anyone needs to renew a license or some such. Not a big deal.

'One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well''
Sigh. Again, way off base. Yes, there were 'backup' systems - comprised of a UPS system that suffered water damage. However, no servers in the data centre suffered any water damage. They were worried about condensation, but that turned out to be a non-issue.

'No doubt this has been a hard lesson on how NOT to host critical public services'
No, this is a lesson for the OP, if he or she is really interested in clear reporting, to avoid the word 'not'. So, a cleaner sentence might be:
"No doubt, this is a hard lesson on how to host critical public services with clustering across sites, thereby avoiding a single point of failure." ...and that was the biggest mistake - a design flaw. The building design aside (13th floor = mechanical room. A transformer blew, triggering the sprinkler system. The backup generators engaged, but the battery room already suffered water damage and shorted out with the high load. When the fire marshal and building ppl arrived, they simply cut power to the building entirely, as water in the bus ducts - "wire trays" for non-construction types - was found), the REAL lesson here is what was already mentioned to AHS execs, Shaw managers and IBM Global - too many eggs in one basket. Instead of fire suppression via water; use halon. Instead of the 'backup systems' (what a JOKE) in the SAME BUILDING, configure clustered services across two or (better yet) more sites.

But - we're just geeks, not execs...what do we know?

Single Points of Failure (1)

AB3A (192265) | more than 2 years ago | (#40645559)

People often walk around with some very bad assumptions about how resilient the Internet or a Cloud must be.

You may have a very good internet presence with lots of bandwidth, but it may be all housed in the same building where the same sprinkler system can bring it all down. You may think that ISPs can reroute lots of traffic to other places because it is possible. Yet, there are common failure modes there too.

Cloud computing is often hailed as a very resilient method for infrastructure. Yet, there is a disturbing tendency to focus all the servers in one big glass room of everything. You may get the dynamic pay per clock-cycle performance, but it may all come back to one substation. A single fire in that substation could bring everything down.

This is the problem with SLA deals: You don't know what kind of planning they may use for such infrastructure. Remember, the Internet itself may be resilient, but your cloud and your ISP may not be.

Oblig xkcd (0)

Anonymous Coward | more than 2 years ago | (#40647045)

People often walk around with some very bad assumptions about how resilient the Internet or a Cloud must be.

The Cloud [xkcd.com]

this fire (0)

Anonymous Coward | more than 2 years ago | (#40648801)

My company relies on that data center to receive all of our EDI data from Union Pacific and Norfolk Southern. So that was all down for like 18 hours. It kinda sucked.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?