Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

RIM Releases Reason for Blackberry Outage

Zonk posted more than 7 years ago | from the isn't-testing-a-requirement dept.

Handhelds 106

An anonymous reader writes "According to BBC News, RIM has announced that the cause of this week's network failure for the Blackberry wireless e-mail device was an insufficiently tested software upgrade. Blackberry said in a statement that the failure was trigged by 'the introduction of a new, non-critical system routine' designed to increase the system's e-mail holding space. The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options."

cancel ×

106 comments

Sorry! There are no comments related to the filter you selected.

perhaps (5, Interesting)

geekoid (135745) | more than 7 years ago | (#18812141)

a routine that can take down the system is a tad more critical then you think?

Why Buy Apple? Here is Why! (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#18812185)


Why? iiii iiii iiii phhhhhhhhh oooooooooooooooooo nnnnnnnnnnnnnnn eeeeeee!

Re:perhaps (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#18812371)

Then I think about what???

Re:perhaps (1)

ChunkyLoverYYZ (1030246) | more than 7 years ago | (#18813095)

I get your point but I think the context here is the initial change wasn't initiated as a severity of "Critical", which usually means something along the lines of core systems being unavailable. The impact would probably have been high, or at least a significant risk probably identified. Whatever the cause, more than one person or group dropped the ball. ITIL... so confusing. ;-)

Re:perhaps (0)

Anonymous Coward | more than 7 years ago | (#18813121)

You keep saying those words... I do not think they mean what you think they mean...

I'd hate to be their QA manager right now! (1)

soft_guy (534437) | more than 7 years ago | (#18812143)

I'd really hate to be the guy that signed off on the quality of this software update. And apparently they didn't adequately test their recovery system. Oh, well. I hope they learn from this and improve!

Re:I'd hate to be their QA manager right now! (5, Insightful)

Mr Pippin (659094) | more than 7 years ago | (#18812203)

More importantly, they apparently had no or a very bad backout plan.

It's quite likely the development group listed this as a risk, with a good backout plan, and upper management simply didn't want to pay for the cost of having a quick backout.

If that's the case, you can be pretty sure upper management WON'T take the blame.

Re:I'd hate to be their QA manager right now! (5, Insightful)

spells (203251) | more than 7 years ago | (#18812311)

You can tell this is a geek site. Bad software rollout, first post wants to blame the QA manager, second wants to blame "Upper Management." How about a little blame for the devs?

Re:I'd hate to be their QA manager right now! (5, Insightful)

lucabrasi999 (585141) | more than 7 years ago | (#18812433)

How about a little blame for the devs?

Blasphemer!

Re:I'd hate to be their QA manager right now! (5, Insightful)

bradkittenbrink (608877) | more than 7 years ago | (#18812471)

Clearly bugs originate with devs, the same way typos and spelling errors originate with authors. The occurrence of such errors is inevitable. The process as a whole is what is responsible for eliminating them. To the extent that the devs failed to contribute to that process then yes, they also deserve blame.

Re:I'd hate to be their QA manager right now! (3, Insightful)

bcat24 (914105) | more than 7 years ago | (#18812743)

I couldn't agree more. Yes, the developers should be responsible for their errors, but still, they're only human. Even the best dev makes a serious mistake from time to time. That's why it's essential to have good coders and good QA folks and good management for any project, especially one as large as the Blackberry network. Sometimes redundancy is a good thing.

Re:I'd hate to be their QA manager right now! (1)

giorgiofr (887762) | more than 7 years ago | (#18812487)

This is blasphemy! This is MADNESS!

Re:I'd hate to be their QA manager right now! (3, Funny)

david_g17 (976842) | more than 7 years ago | (#18812601)

Madness?!?!, No, this is SLASHDOTTTTTT!!!

~kicks guy into a bottomless pit~

Re:I'd hate to be their QA manager right now! (4, Insightful)

roman_mir (125474) | more than 7 years ago | (#18812499)

I am not sure if you are trying to be funny or insightful, probably you are aiming for a bit of both, however, while bugs in software (inevitably) are developers' fault, release of software with bugs into production system is always management fault. There must be a process in place to catch bugs before release for mission critical systems (isn't it one of them?) There must be a process in place for a quick rollback for such systems. There must be some form of backup. How about running both, new and old systems in parallel for a while with ability to switch to the old if the new one fails?

Whatever it is, the production problems are due to bad process, which is what management is supposed to control. They are not even responsible for coming up with the technicalities of the process, they are responsible for making sure that there is a sufficient process (sufficient in terms that it is agreed by all parties, DEVs, QAs, BAs, client that it is good enough.) They are responsible to make sure that the process is followed.

Over a year ago now in Toronto, ON, Canada, the Royal Bank of Canada had a similar problem, but of course with a bank it is much more dangerous it is lots of money of lots of people. Heads rolled at the management level only.

Re:I'd hate to be their QA manager right now! (0)

Anonymous Coward | more than 7 years ago | (#18812673)

You can always spread the blame around ;-)

Denis the SQL Menace
http://sqlservercode.blogspot.com/ [blogspot.com]

Re:I'd hate to be their QA manager right now! (4, Insightful)

jimicus (737525) | more than 7 years ago | (#18812713)

How about a little blame for the devs?

Because that's not how change should happen in large/business critical applications.

What should happen is that the update is thoroughly tested, a change control request is raised and at the next change control meeting the change request is discussed.

The change request should include at the very least a benefit analysis (what's the benefit in making this change), risk analysis (what could happen if it goes wrong) and a rollback plan (what we do if it goes wrong). None of these should necessarily be vastly complicated - but if the risk analysis is "our entire network falls apart horribly" and the rollback plan is "er... we haven't got one. Suppose we'll have to go back to backups. We have tested those, haven't we?" then the change request should be denied.

As much as anything else, this process forces the person who's going to be making the change to think about what they're going to be doing in a clear way and make sure they've got a plan B. It also serves as a means to notify the management that a change is going to be taking place, and that a risk is attached to it.

And if a change is made but hasn't been approved through that process, then it's a disciplinary issue.

Of course, it's entirely possible that such a process was in place and someone did put a change through without approval. In which case, I don't envy their next job interview.... "Why did you leave your last job?"

Re:I'd hate to be their QA manager right now! (1)

ePhil_One (634771) | more than 7 years ago | (#18813927)

The change request should include at the very least a benefit analysis (what's the benefit in making this change), risk analysis (what could happen if it goes wrong) and a rollback plan (what we do if it goes wrong).

And what if there was? What if, gasp, this software upgrade had an "unexpected" impact? Risk analysis almost certainly would not have listed "worldwide operations will grind to a halt, cats and dogs start sleeping together, all the molecules in your person fly apart in exciting ways", and the "unexpected impact" would not have been accounted for in the rollback plan.

I've yet to work for a company that had the resources to exactly replicate the production environment in QA, you look at the risk/reward, study your budget, and do the best you can. I saw a gaming website once that showed statistically someone rolls double's 6 times in a row about every 5 minutes. Freaky stuff happens and happens far more often than you think.

Now for the million dollar question, what real damage has been done by this outage. Millions of folks were unable to instantly read their email while away from their desks, something they were always unable to do prior to Blackberry. Its a small bruise to their rep, users Blackberries go off line all the time (god I hate supporting those things), this was just a first time it effected them ALL.

Re:I'd hate to be their QA manager right now! (0)

Anonymous Coward | more than 7 years ago | (#18813105)

Amen brother...I've been a dev in the past and you all know that the devs don't provide good backout plans. They list the risk and if management wants a rollback plan then they provide one but devs in general are not nearly pessimistic enough to think about a rollback...I mean when you're as arrogant as we are why even consider that your code might not work.

Re:I'd hate to be their QA manager right now! (2, Insightful)

soft_guy (534437) | more than 7 years ago | (#18813193)

I am a dev and my motto is "all software engineers are liars and idiots" and I include myself in this. If you want to know how something is supposed to work in theory, ask the dev. If you want to know the actual behavior, ask QA.

Re:I'd hate to be their QA manager right now! (3, Insightful)

mutterc (828335) | more than 7 years ago | (#18814499)

How many people here have checked in buggy code that neither management nor QA knew was buggy? (crickets)

How many people here have been on projects where management shoved the code out the door despite major bugs that they knew about? (thunderous applause)

How many people here have tried to get time on The Schedule to do something The Right Way, only to be told by management to do it half-assed, because that's all there's time/resources for? (applause, hooting)

There you go.

Re:I'd hate to be their QA manager right now! (1)

triso (67491) | more than 7 years ago | (#18817019)

...How about a little blame for the devs?
Stone the heretic.

Re:I'd hate to be their QA manager right now! (1)

thePowerOfGrayskull (905905) | more than 7 years ago | (#18815735)

More importantly, they apparently had no or a very bad backout plan.

It's quite likely the development group listed this as a risk, with a good backout plan, and upper management simply didn't want to pay for the cost of having a quick backout.

If that's the case, you can be pretty sure upper management WON'T take the blame.
I don't know what shops you've worked in, but the devs in most places I've worked never have a backout plan unless management forces them to -- the prevailing attitude is that the software is tested, so what could possibly go wrong?

A QA manager has any say on how much testing? (1, Funny)

Anonymous Coward | more than 7 years ago | (#18812463)

I think not. You realize this is 2007, yes? Ask the marketing department for how much testing they get.

Re:I'd hate to be their QA manager right now! (2, Insightful)

SABME (524360) | more than 7 years ago | (#18814267)

As a QA guy, I can't tell you how many times I've been told, on a Monday, "Do whatever is required to make sure this software is stable, as long as you release it on Friday."

We're lucky we can get through a single pass of functionality testing; forget about load/stress/performance/long-term stability. We're lucky we have a test environment composed of hardware retired from production, because it was deemed insufficient to meet the needs of the production environment.

True story: I was supposed to be testing a product that interfaced with an IP videoconferencing bridge. Except we had no such bridge in our environment, and no budget to purchase one. No one in management thought this was absurd until I took a cardboard box and wrote "Video Bridge" on it, along with little holes labeled eth0, eth1, DS1, etc. (much like the famous P-p-p-powerbook). I complained to the VP of Engineering that our tests were blocked because I couldn't get the video bridge to come up on our lab network. When I showed him the "box," he got the point. :-).

In my experience, customers are more interested in getting new features ASAP than they are in reliability, which is why so many organizations put a premium on rolling out new features quickly. When was the last time anyone worked on a release with no new features outside of performance and stability improvements?

Re:I'd hate to be their QA manager right now! (2, Funny)

scottv67 (731709) | more than 7 years ago | (#18814607)

I complained to the VP of Engineering that our tests were blocked because I couldn't get the video bridge to come up on our lab network.

Did you try setting CardboardEthernet0/0 to "100/full" instead of "auto/auto"? :^)

What really happened... (5, Funny)

Mockylock (1087585) | more than 7 years ago | (#18812179)

This is all just technical jargon for, "I tripped over the power cord. MY BAD."

Re:What really happened... (1)

JustNilt (984644) | more than 7 years ago | (#18812755)

This is all just technical jargon for, "I tripped over the power cord. MY BAD."

I think it's more along the lines of "Unplugged the coffee maker; please feel free to restart the server now."

Re:What really happened... (1)

SwordsmanLuke (1083699) | more than 7 years ago | (#18814487)

That actually happened at a medium sized web host I did tech support for! A network admin had been at the NOC all night installing patches on our servers and on his way out tripped over the cords for the main routing system. As a result, every one of our websites (serving some 20,000 customers worldwide) were offline for about two hours before someone discovered what he'd done.

Re:What really happened... (1)

Mockylock (1087585) | more than 7 years ago | (#18814613)

Awesome. Better him than me! Sounds like something I'd have happen to me.

Re:What really happened... (1)

Dragonslicer (991472) | more than 7 years ago | (#18816265)

How tired do you have to be to not notice that you tripped over a bunch of cables?

Re:What really happened... (0)

Anonymous Coward | more than 7 years ago | (#18814819)

Damn, so that's what happened!
I thought it was just aliens trying to attack Earth.

Non-critical? (5, Funny)

Anonymous Coward | more than 7 years ago | (#18812191)

This is obviously some new definition of the word "non-critical" with which I was previously unfamiliar.

bkd

mod parent up (0)

Anonymous Coward | more than 7 years ago | (#18812303)

This is funnier than the other comments about it being critical.

Re:mod parent up (1)

Rukie (930506) | more than 7 years ago | (#18812405)

I'd hate to know the the detox from the crackberry felt! More importantly, something not critical CAN cause a critical problem. Look at windows.

Re:Non-critical? (1)

Red Flayer (890720) | more than 7 years ago | (#18812491)

It seems also to be some new definition for the word "upgrade" with which we're not familiar.

Re:Non-critical? (1)

contrapunctus (907549) | more than 7 years ago | (#18812737)

Inigo Montoya: You keep using that word. I do not think it means what you think it means.

Buying time (5, Funny)

faloi (738831) | more than 7 years ago | (#18812197)

The irony is that the SEC couldn't do any more investigating during the outage because they had no email access!

Re:Buying time (0)

Anonymous Coward | more than 7 years ago | (#18817167)

Not to mention that the SEC's reach stops at the Niagra River, and RIM is
somewhere to the northwest of it...

However, I'm sure the SEC has already whined to Ottawa and demanded action,
as if that will do anything to help the situation. I'm sure RIM is in just
the right mood to have a bunch of Men In Black come in and snoop through all
the financial records while they're trying to do an autopsy on their software.

Short answer (1)

gEvil (beta) (945888) | more than 7 years ago | (#18812221)

Their tubes were clogged and the plumber wasn't responding. Damn Canadian plumbers...

Re:Short answer (1)

David_W (35680) | more than 7 years ago | (#18813047)

Their tubes were clogged and the plumber wasn't responding.

That's probably because dispatch couldn't reach him on his Blackberry.

testing departments (1)

pytheron (443963) | more than 7 years ago | (#18812241)

So, an outage affecting a core part of the buisiness was caused by a 'non-critical' upgrade. Someone needs to redefine what non-critical actually is. As far as my experience goes (mostly in mission critical datacentres), most of the testing was actually done by the engineers installing and fixing on-the-fly. Engineers are more likely to look in the right places to find a bug, due to hands-on real life experience.

Re:testing departments (4, Informative)

Red Flayer (890720) | more than 7 years ago | (#18812427)

Someone needs to redefine what non-critical actually is.
A non-critical upgrade is one that isn't critical that it be performed.

Increasing storage capacity (when current capacity not close to exhaustion)? Non-critical.

Fixing the shut-down system that resulted from the upgrade? Critical.

Watching the sales reps in my office apoplectically try to figure out how to get in touch with their clients? Priceless.

Relevance? (1)

EveryNickIsTaken (1054794) | more than 7 years ago | (#18812263)

The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options.
And this is relevant how? Do you expect the SEC to fine them for downtime?

Re:Relevance? (1)

Curmudgeonlyoldbloke (850482) | more than 7 years ago | (#18812581)

Because the journo concerned had some space to fill, probably?

Financial Relevance (1)

Comboman (895500) | more than 7 years ago | (#18814111)

The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options.
And this is relevant how? Do you expect the SEC to fine them for downtime?
Both would likely have a negative impact on stock prices (and you can add ongoing patent troubles and competition from iPhone to the list as well).

Re:Relevance? (1)

mutterc (828335) | more than 7 years ago | (#18814395)

And this is relevant how?

Conspiracy theories. I think I'm anti-business and cynical enough to see it:

RIM sending a message to the SEC: "Enough of the government and business is dependent on us that, if you take us down, you both make a big hit to the economy, and piss off your own bosses, who probably use our product."

Ah ha! (4, Funny)

Grashnak (1003791) | more than 7 years ago | (#18812273)

So that is where the missing 5 million White House emails went! Sneaky Canadians!

Re:Ah ha! (1)

PinkPanther (42194) | more than 7 years ago | (#18812687)

So that is where the missing 5 million White House emails went! Sneaky Canadians!

Damn! We thought these were emails for whitehouse.com, eh?

We have blackberries and Bes (1)

grasshoppa (657393) | more than 7 years ago | (#18812323)

And let me tell you, I have no problem believing they have buggy software.

on the plus side... (1)

apodyopsis (1048476) | more than 7 years ago | (#18812387)

...they just became famous as a lesson in what not to do

all publicity is good publicity, right?

as the other poster said:- boy I would hate to be their QA at this time.

Re:on the plus side... (1)

SoapBox17 (1020345) | more than 7 years ago | (#18812639)

Luck for them, they obviously don't have QA at all.

Re:on the plus side... (1)

glitch23 (557124) | more than 7 years ago | (#18813937)

as the other poster said:- boy I would hate to be their QA at this time.

Maybe you give them too much credit in assuming they even have a QA department.

Is this really so bad? (4, Insightful)

TheBishop613 (454798) | more than 7 years ago | (#18812509)

Am I the only one who thinks they actually survived this pretty well? I mean sure, the goal is to try to make sure that the system never goes down and is up 24/7, but sometimes shit happens in large systems. It seems to me that getting everything back to normal within 12 hours is pretty reasonable. Did they have an instant fix? Well no, of course not, but they got the system back to a working state relatively quickly and hopefully didn't lose data.


Yeah, they've got areas to tighten up their QA and patch processes, but on the whole they got it all back up and running faster than most enterprises get their email functioning after a worm.

Re:Is this really so bad? (2, Funny)

NoseyNick (19946) | more than 7 years ago | (#18812951)

"BlackBerry goes down, it's headline news. Exchange goes down, it but be Friday"

Yes it is. They've put themselves in a critical... (4, Insightful)

WoTG (610710) | more than 7 years ago | (#18813153)

RIM is not a regular company. They have specifically created a centralized system where the email for millions of people depend on the uptime of their two (?!?!) data centres. Delivering email is literally their business and uptime is a critical part of that. IMHO, a half an hour of system wide downtime is pushing RIM's luck.

Several hours of email downtime is "OKish" if you are talking about a medium sized company that only has a handful of servers and a few IT guys. This is not the same at all.

Prior to this, I never realized that the RIM system was THIS centralized. It's kind of concerning really. And I don't quite understand why so many US gov't users are allowed to route their email through a NOC in Canada (disclosure: I'm Canadian).

Re:Yes it is. They've put themselves in a critical (1)

afidel (530433) | more than 7 years ago | (#18814125)

What gets me is all the media talk about Emergency responders not being able to be contacted. It's not like their Blackberries burst into flames because the message passing servers were down, they still had SMS and phone capability. Hell because we aren't certain our email relays or BES servers won't be the down system our alerting system automatically switches from email to SMS for the second round of notifications. I guess RIM isn't the only ones who could use a little process improvement!

Re:Yes it is. They've put themselves in a critical (1)

Tack (4642) | more than 7 years ago | (#18814467)

They have specifically created a centralized system where the email for millions of people depend on the uptime of their two (?!?!) data centres.

Your information isn't quite right here. RIM has more than two data centers in more than two locations in more than two continents.

And I don't quite understand why so many US gov't users are allowed to route their email through a NOC in Canada (disclosure: I'm Canadian).

Governments tend to be (justifiably) paranoid customers. I'm sure it's safe to assume that each government does a fair amount of investigation before deciding it's safe to use BlackBerry for official use. And even then I expect it's only permitted for certain classification levels -- probably low ones.

Re:Yes it is. They've put themselves in a critical (1)

644bd346996 (1012333) | more than 7 years ago | (#18814707)

Maybe the feds check out that kind of thing. I can testify that at least one county government (with several thousand employees and about a million citizens) has such a bad IT department that they would be hard pressed to figure out that RIM is canadian. All employees within about 5 hops of the county manager on the O-chart have blackberries for official use.

Re:Yes it is. They've put themselves in a critical (1)

WoTG (610710) | more than 7 years ago | (#18816419)

I don't really know how many data centres RIM has. Two doesn't really sound right to me, but that's what has been frequently quoted in the media lately. Maybe people, including myself, are mixing up data centres and NOCs in the RIM world.

E.g. http://news.zdnet.com/2100-1035_22-6177829.html [zdnet.com]

Re:Yes it is. They've put themselves in a critical (1)

Mr. X (17716) | more than 7 years ago | (#18815063)

It's my understanding that corporate BlackBerrys use encryption for while the messages are in transit. I'm not sure if the central RIM server ever gets a chance to see the cleartext message.

Re:Yes it is. They've put themselves in a critical (1)

WoTG (610710) | more than 7 years ago | (#18816369)

Encrypted transit makes sense. However, that still leaves a fairly important point of failure significantly outside of US control. I hate to say it, if RIM use continues to grow in the government, eventually, those NOCs become strategic targets.

I can understand smaller countries having to accept that as a part of life, but lets face it, America has a few bucks to toss around. I'm quite surprised that the government hasn't forced RIM to put a NOC on American ground.

Re:Is this really so bad? (1)

Bearhouse (1034238) | more than 7 years ago | (#18813559)

Sorry, don't agree. I have a Pearl, and love it. But (here in Europe) it's expensive, and many similarly-priced phones offer more features, such as GPS... But I bought a Pearl because I just 'wanted it to work'. Sure enough, 10 minutes after unpacking it, I was receiving my first mail. But if RIMs USP is in doubt, well, why not put up with the hassle of configuring a competitive device? So yes - it's really very bad...

Re:Is this really so bad? (1)

ACMENEWSLLC (940904) | more than 7 years ago | (#18814451)

I purchased an AT&T (old AT&T) cell phone right when they did their system migration. As I recall, they were down for about 3 weeks. On-hold times averaged 8 hours for first level.

Name me one piece of software that is as complex as this which has no bugs in it.

T-Mobile recently did an upgrade which took many months. There were bugs in this system too, but they worked in quite the opposite direction. Did you here about those?

RIM's biggest failure (4, Interesting)

toupsie (88295) | more than 7 years ago | (#18812517)

Mistakes in QA do happen and everyone can do more testing but RIM's biggest failure during the outage was not their QA but their PR. How many BES Admins wasted an hour or two trying to figure out why their servers were not delivering properly to their user's handhelds? If there was a statement on their website or a message on their support line, a lot of wasted time would have been averted. If it were not for a few of the independent blackberry forums, I would not have known their was a nationwide outage during my troubleshooting.

Re:RIM's biggest failure (2, Funny)

nettdata (88196) | more than 7 years ago | (#18813081)

Yeah... they should have just sent out an email to all the BlackBerries saying email would be disrupted for a while....

Re:RIM's biggest failure (0)

Anonymous Coward | more than 7 years ago | (#18814533)

I spent 10 minutes. Realized all of my US servers were down, none of my Asia or Europe servers were and sent a message to all of the users (yes, e-mail...). Then I went home and sent SMS updates to my boss. "Still down" "yup, still down" "it might be coming up.. nope, still down"

Pop quiz! (2, Insightful)

8127972 (73495) | more than 7 years ago | (#18812523)

Which is worse:

A) The fact one piece of software took down their environment.
B) Their failover plan didn't work.
C) All of the above.
D) None of the above.

Personally, I vote for "B". Face it, s**h happens. But when you plan for s**t happening and the plan doesn't work, that's a VERY bad thing.

Re:Pop quiz! (1)

savanik (1090193) | more than 7 years ago | (#18812591)

You mean they had a failover plan? I didn't see any evidence of it from where I was standing.

what failover plan .. (1)

rs232 (849320) | more than 7 years ago | (#18813363)

B) Their failover plan didn't work.

What failover plan and assuming what they said really happened ..

was Re:Pop quiz!

HAAAA HAAAA (0)

Anonymous Coward | more than 7 years ago | (#18812551)

Nobody should be allowed to charge for anything or make any money ever!

Information wants to be free!

Serves them right!

Screws fall out (1)

Weaselmancer (533834) | more than 7 years ago | (#18812611)

It's an imperfect world. Now, show Dick some respect!

Testing of Complex Systems (4, Insightful)

Fritz T. Coyote (1087965) | more than 7 years ago | (#18812659)

I love the (Friday) morning quarterbacks who will now proceed to beat up RIM for a system outage after a 'non critical' upgrade.

And a bunch of suits will want the heads of the technicians responsible.

I feel for them, I really do.

A few years ago I put in a minor maintenance change that made headlines for my employer.

This is a natural result of the budgetary constraints we have to live with in the real world. Testing and certification is expensive, and the more complex the environment, the more expensive it gets. It is difficult to justify a full blown certification test for minor, routine maintenance, unless you are talking about health and safety systems. So a worst-case event occurred, RIM suffers some corporate embarrassment, some low-level techs will get yelled at, and possibly fired, and a bunch of people had to suffer crackberry withdrawal.

Nobody died. No planes crashed. No reactors melted down.

RIM will work up some new and improved testing standards, and tighten the screws on system maintenance so much that productivity will suffer, they may even spend a bunch of money on the equipment needed to do full-production-parallel certification testing. And then in a year or so cut the budget to upgrade the certification environment as 'needless expense', and come up with work-arounds to reduce the time it takes to get trivial changes and bugfixes rolled out.

I wish them luck. Especially to the poor sods who did the implementation.

At least when I did my 'headline-making-minor-maintenance' it only made the local papers for a couple of days.

Re:Testing of Complex Systems (2, Insightful)

slashbob22 (918040) | more than 7 years ago | (#18813317)

Nobody died. No planes crashed. No reactors melted down.
You are safe on the planes crashing and on the meltdowns. I didn't hear of any such incidents.

However, I will argue that the outage may have contributed to deaths. There are many hospitals which use Blackberries instead of pagers (2-way comms), so paging a surgeon or doctor or other staff to an emergency may not work well. I am sure there are other examples of critical applications (which should or should not use blackberries) that may have been effected. The obvious thing is that I cannot provide stats, because they certainly aren't available - but saying that nobody died would be a gross overstatement.

On a lighter side, other casualties may have been caused from crackberry withdrawal: people walking into walls because they aren't used to walking without reading their blackberry, people jumping out of buildings because they cant get their latest stock quote, etc..

Re:Testing of Complex Systems (1)

jon_joy_1999 (946738) | more than 7 years ago | (#18814655)

However, I will argue that the outage may have contributed to deaths. There are many hospitals which use Blackberries instead of pagers (2-way comms), so paging a surgeon or doctor or other staff to an emergency may not work well.
you can't be serious. I mean come on. they (hospitals, surgeons, doctors, other staff) don't have phones, or public address systems? that sounds like malpractice suits waiting to happen.

"yes, your honor, we called and called and called our doctor to schedule an appointment to have our daughter's ingrown toenail removed, but he didn't respond. when she got gangrene we had to take her to the hospital emergency room, and there, we had to wait 12 hours before a surgeon arrived, during that time our daughter died"

"your honor, we couldn't reach any doctors or surgeons, or any medical staff due to the fact that our blackberry service was not responding properly, despite numerous tech support calls that did nothing to fix the problem. we can't be at fault for our blackberry service failing and causing the death of this young girl"

you can't rely on a single system for critical functions (which is why aircraft have backup power systems and triple redundant flight control systems and warning systems)

the REAL reason.... (3, Funny)

markana (152984) | more than 7 years ago | (#18812697)

>...the failure was trigged by 'the introduction of a new, non-critical system routine' designed to increase the system's e-mail holding space.
    :
>The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options.

Hmmm... so when they wiped the incriminating e-mails from the system (which would certainly create more space), they took the rest of the system down (which prevented anyone else from grabbing copies).

I'm reading WAY too many conspiracy novels these days :-)

(Not that I think this actually happened - but it makes for a great plotline).

Re:the REAL reason.... (0)

Anonymous Coward | more than 7 years ago | (#18816081)

Truth is stranger than fiction. THE FACTS BEHIND THE WHITE HOUSE EMAIL SCANDALS" [citizensforethics.org] states,

"Even though DOJ sent White House a preservation request for records related to CIA leak investigation in September 2003, RNC continued to purge all emails every 30 days until August 2004"

Here's the complete fact sheet as of today.

There are two separate email scandals:

- Top White House officials use of RNC email accounts and RNC destruction of those emails
- Five million EOP emails missing from White House (EOP) server from period 3/03 to 10/05

1. RNC Email Scandal:

- Top White House officials, including Karl Rove, used RNC and other outside email accounts to conduct White House business

- Those officials took no steps to ensure that the emails were preserved, as the Presidential Records Act requires

- Emails show that officials were aware that if they used outside email accounts, their email messages would not be preserved

- Even though DOJ sent White House a preservation request for records related to CIA leak investigation in September 2003, RNC continued to purge all emails every 30 days until August 2004

2. White House Email Scandal:

- In late 2001 or early 2002, Bush administration discontinued automatic email archiving/preservation system put in place by Clinton administration (ARMS)

- Bush administration failed to put another system in place that would appropriately and effectively save email records in a records management system

- Instead, Bush administration extracts email messages from the EOP server and stores them in files on a file server

- There are no effective internal controls on this system to ensure complete set of messages; messages can be modified or deleted

- In October 2005, White House discovered emails were missing from this system, briefing White House Counsel (Harriet Miers) on the problem as well as Special Counsel Patrick Fitzgeralds staff

- EOPs Office of Administration (OA) did independent analysis to determine extent of missing email problem found hundreds of days of email missing between March 2003 and October 2005, for a rough total estimate of five million missing emails

- White House Counsel was briefed on this and given plan of action to recover missing emails

- White House never implemented plan to recover missing emails (even in face of preservation order from DOJ)

- White House has still not put effective email archiving system in place, even though it knows current system is not effective and has led to at least five million missing emails

Bush administration is still not telling the truth:

- Dana Perino has said problem with EOP server occurred when White House switched from Lotus Notes to Microsoft Outlook this is untrue; emails are missing for a 2½ year period starting in March 2003 and ending in October 2005

- Dana Perino has said no intentional loss of any document but by October 2005, White House knew system wasnt working and knowingly and willfully refused to implement plan to recover five million emails missing from EOP server, instead leaving in place a system that does not work

- Dana Perino has said system set up to comply with Presidential Records Act by automatically preserving EOP emails but White House is using system that doesnt effectively preserve email and that doesnt comply with archiving standards (see 36 C.F.R. Part 1234 guidance for preserving email under Federal Records Act) and doesnt work (e.g. five million missing emails)

More details (3, Informative)

kbahey (102895) | more than 7 years ago | (#18812735)

I live in Waterloo, and have friends and acquaintances who work at RIM. Talking to one of them who got called that night, he says that it started with a vendor issue, and then RIM's software did not react well to that issue.

Of course he would not elaborate more on what it is.

This Computer World article [computerworld.com] has more detail.

The outage lasted about 12 hours overnight Tuesday for BlackBerry users mainly in North America, RIM and users reported.

RIM said a fail-over system designed to stop the impact of such a problem did not work as expected, either. The company apologized to its 8 million users. RIM added that security and capacity issues were not the cause of the outage.

"RIM has determined that the incident was triggered by the introduction of a new, noncritical system routine that was designed to provide better optimization of the system's cache," RIM officials said in a statement.

"The system routine was expected to be nonimpacting with respect to the real-time operation of the BlackBerry infrastructure, but the pretesting of the system routine proved to be insufficient," the statement said.

The new system routine "produced an unexpected impact and triggered a compounding series of interaction errors between the system's operational database and cache," according to the statement. "After isolating the resulting database problem and unsuccessfully attempting to correct it, RIM began it's fail-over process to a backup system."

RIM described the backup system inadequacies this way: "Although the backup system and fail-over process had been repeatedly and successfully tested previously, the fail-over process did not fully perform to RIM's expectations in this situation and therefore caused further delay in restoring service and processing the resulting message queue."


I don't believe it .. (1)

rs232 (849320) | more than 7 years ago | (#18813659)

it started with a vendor issue, and then RIM's software did not react well to that issue.

Given the nature of the technology I find the explanation of a 'fail-over' system failing to kick in a tad disingenuous. It's not like a generator kicking in when the mains electricity stops. And what kind of design decisions led to an upgrade triggering outages for the entire North America.

I would have thought they had multiple nodes at multiple locations with no single point of failure. Or at least three redundant and independent systems, a main system, a backup system and a system for testing upgrades. Or is it like most commercial companies they designed the cheapest system possible.

Tell me it's not like the Uks DOH system where power cuts in Kent lead to system outages [silicon.com] in the north of england. It takes real genius to design a distributed database that borks because of a power cut. sarcasm [answers.com] .

More details(Score:3, Informative)

Re:I don't believe it .. (1)

SuiteSisterMary (123932) | more than 7 years ago | (#18814637)

I would have thought they had multiple nodes at multiple locations with no single point of failure. Or at least three redundant and independent systems, a main system, a backup system and a system for testing upgrades. Or is it like most commercial companies they designed the cheapest system possible.

Paid for how? Increased service rates? We see things like this all the time. When it's cheque signing time, they talk up and down about how they understand that they're leaving redundancy or uptime on the table, that they are choosing to trade cost for said redundancy or uptime, and that when it's down, they can wail and gnash their teeth and pull at their hair and talk about how their 'business depends on it!' all they want, and it won't matter, because they haven't paid for it. And the answer is invariable 'Sure, sure, we understand, no problem!'

And what happens the next time something does down? A wailing and gnashing of teeth and pulling of hair and anguished screams.

Now, in this case, I can also see the failover system being specced for normal system load, but not for the huge backlog of messages. Or maybe even for the backlog, but they wern't expecting that every blackberry user would instantly start sending messages to each person in their contact list saying 'did u get this? Rply if u did!'

PR to IT translation results (2, Funny)

192939495969798999 (58312) | more than 7 years ago | (#18812787)

"insufficiently tested software upgrade" => "untested software upgrade" => "some superstar at RIM changed the CRASH_NETWORK constant from 0 to 1."

Fp tadco... (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#18812813)

I ever did. It of i7s core

Routine? (1)

NineSprings (1060260) | more than 7 years ago | (#18812835)

"In other news, the wikipedia.org web site screeched to a halt as /. readers rushed to lookup the meaning of the term 'routine' applied in the context of software systems. The RIM public relations department could not be reached for a clarification as to why such an anachronism was used in their announcement."

Chandler: "Quick, we must telegraph presidend Coolidge!"

Wireless e-mail is a utility? (0)

Anonymous Coward | more than 7 years ago | (#18812841)

I own a Blackberry and could not send/recieve e-mail from approximately 8:00 pm to 10:00 am. Considering that I was asleep or showering for 8 hours, that is 6 hours of personal impact. Although I think the outage is unacceptable and shows the fragility of the system, I am surprised at the size of the reaction, even considering the "Crackberry" effect. I guess that wireless e-mail is now seen as a utility like cell-phones, land-lines and electricity.

Although an explanation would have been better yesterday, as an IT person, I can understand the process: Tuesday night: Panic! Get the system back up ASAP. Wednesday: Investigate exactly what went wrong, monitor systems with extreme dilligence, hold your breath. Thursday: Meet with marketing folks to come up with a statement. Thursday Night: release statement.

Does anyone have a link to the actual statement that RIM made? 5 minutes of googling could only found articles that quoted the statement.

Those wacky physicists! (0)

Anonymous Coward | more than 7 years ago | (#18812871)

I heard it was a practical joke gone bad at The Perimeter Institue [perimeterinstitute.ca] . Apparently, Lee Smolin was preparing synthetic black holes for the string theorists' offices, and one of them escaped and headed for RIM's control centre.

Non-critical (1)

DeadboltX (751907) | more than 7 years ago | (#18812999)

Their use of the term "non critical" is most likely referring to the nature of the patch. It was an "optional" patch that did not fix any "critical vulnerabilities" or anything like that.

It is quite obvious they were not referring to the criticalness of the system which was affected.

It's not as simple as a defective patch (1)

calculatino (588372) | more than 7 years ago | (#18813001)

Many things went wrong at once: - defective patch - automated and manual testing missed the defect - defective patch rolled out to a huge portion of the user base at once - rollback failed/ineffective On the flip side, some things went right: - no data loss, messages delayed instead of lost - BB's continued to function properly while server offline - phone functions unaffected - firms running their own BES servers unaffected - lastly, even with this outage, they're still offering lots of 9's of availability.

Re:It's not as simple as a defective patch (0)

Anonymous Coward | more than 7 years ago | (#18813133)

People running their own BES were effected. A BES server talks to RIM's NOC to get messages out to the handhelds. With RIM's NOC down no messages were flowing.

Re:It's not as simple as a defective patch (1)

BAKup (40339) | more than 7 years ago | (#18814757)

Sorry, you are completely wrong about the companies with a BES server being unaffected. All that the BES server does is interface to the company's mail system, and the servers at RIM to send emails over to the users Blackberries. Our company was without blackberry email for over 6 hours. Good thing I was able to say it was a global issue with blackberries, and my bosses went, ok, thanks for keeping me informed about it.

Foolish, foolish VMware (0)

Yeechang Lee (3429) | more than 7 years ago | (#18813061)

Don't VMware's admins know to turn Automatic Updates off in the copy of Windows ME that the Blackberry backend runs on?

Ship Dates (1)

cowass (872106) | more than 7 years ago | (#18813135)

The problem now a days with the product life cycle is ship dates. I have seen time and time again where something was shipped based on a date. This all comes from a product project cycle that is based on software that is shipped to customers. When you develope a service\product that is ran on the internet, developers need to have the mind set to "Develope to run, not to ship". This is was i preach everyday as an operations manager for a large online ecommerce site.

what ever happened to no single point of failure . (1)

rs232 (849320) | more than 7 years ago | (#18813163)

What ever happened to no single point of failure. And since when do you update a live system. Has no one learned anything in the past decade.

Reminds me of when a Mobile phone company upgrades over the weekend and everyone discovered you could make long distance phone calls for free.

Re:what ever happened to no single point of failur (1)

FrameRotBlues (1082971) | more than 7 years ago | (#18815357)

Uh, I think the RIM BES runs constantly, 100% of the time. They'd have to update a live system, if they were going to update anything at all. Otherwise, depending on the frequency of updates, I'd imagine BB users would be pissed if they had fluky BBs for an hour every week. And the "centralization" aspect is what a lot of people are wondering about.

Re:what ever happened to no single point of failur (1)

lgw (121541) | more than 7 years ago | (#18816019)

So you update your fully redundant production system, then test that production system a bit just to make sure, then flip live traffic over to it. It everything looks good you upgrade the system that was previously live, otherwise you flip back.

This isn't even hard (unless, of course, you really have learned nothing in the past 10 years and don't have a fully redundant production system hot at all times).

Sounds more like... (1)

guruevi (827432) | more than 7 years ago | (#18813255)

...somebody forgot the ~ in rm -rf ~/

Adding storage space to a single system shouldn't be a problem, since you take your system down for that anyway (or put it in spare mode or so) even if it's a hotplug-always-on-superfast-resizing-raid-with-aut omatic-failover-and-d2d2t2brain system. That it takes the whole network down, is a problem.

Problem while shredding Karl Rove's e-mails? (0)

Anonymous Coward | more than 7 years ago | (#18813309)

Maybe someone could've told them that erasing (shredding) files and unused disk space can grind the system to a halt.

If anyone said something like this five years ago, I'd accuse them of being a tin-hat wearing paranoid fool. But times have changed.

There are too many things, such as the unprecedented use of signing statements, abuse of the Patriot Act, death of investigative journalism (replaced by partisan pundits disguised as reporters), Valerie Plame's outing, and unchecked kleptocracy going on that turns trusting people into cynics.

When hearing about lost e-mails on TV, I think "e-mails get lost all the time" but when I read detailed reports, the facts clearly show that it could not have been an accident. For example, this report shows facts about the lost e-mails that should be unacceptable to Democrats, Republicans, and independents alike:

WITHOUT A TRACE: THE MISSING WHITE HOUSE EMAILS AND THE VIOLATIONS OF THE PRESIDENTIAL RECORDS ACT

http://www.citizensforethics.org/node/27607 [citizensforethics.org]

If you read the report and know that people in Washington use Blackberries, how could you not wonder if the recent outage was caused by attempts to destroy evidence?

Congress would have to be completely blind if they don't immediately contact RIM and have them confirm under oath that no evidence was destroyed.

Given that Karl Rove is known to be a Blackberry user, Congress would have to be incompetent to ignore this incident. Please give them a clue by contacting your representative by e-mail or fax or phone!

RIM won't be a fun place to work in anymore (1)

cyberianpan (975767) | more than 7 years ago | (#18813627)

I pity the RIM staff now. I work in an client that has had two "bad headlines" incidents. Understandably they are now highly risk averse - up to 12 sign offs required for minor changes ; Documentation to code ratio is >> 20:1 ; 5 chiefs to 1 Indian on many projects ...

The public is ignorant as to what causes IT problems - even if RIM upgrade their QA process to "better than normal" no one will forgive them if lightning strikes twice. Thus RIM are likely to bring in extraordinarily restrictive processes. If I was a creative developer or solutions architect in RIM I'd be looking for a new job.

Been there - it was my first maintenance callout (1)

Two99Point80 (542678) | more than 7 years ago | (#18813759)

I was made a maintenance programmer fresh out of training, and before long I got my first callout (rather dated jargon follows): One of our routine nightly batch jobs ABENDed with a S0C7 (data error) while opening (not even reading yet) a file, and in I went at about 2AM. But what bad customer data could there be in a COBOL OPEN statement...?

Turned out one of our contract software guys had made a simple change to the file retention period - so trivial, he said, there was no need to test it. He was rather chagrined the next day.

Yeah, this was a long time ago - 1973 or so. But some cherished principles hold up pretty well, such as: Test the damn "trivial" changes!

Re:Been there - it was my first maintenance callou (0)

Anonymous Coward | more than 7 years ago | (#18814393)

so trivial, he said, there was no need to test it.
Rule #1: There is nothing so simple that you can't fsck it up!

living proof that QA matters... (2, Insightful)

Ralph Spoilsport (673134) | more than 7 years ago | (#18813771)

If the product had been properly tested (and face it - outside of medical and military applications, how much of ANYTHING is properly tested?) they'd have found, reported, and fixed the bug weeks earlier.

You can't expect programmers to do perfect work, even with unit testing and all the other basic amenities of software development. It requires QA, and that is something sorely lacking in contemprary software product. From the smallest OSX widget to MS Vista,Testing Matters.

RS

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?