Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Hospital Brought Down by Networking Glitch

michael posted more than 11 years ago | from the risks-digest dept.

News 575

hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"

cancel ×

575 comments

Sorry! There are no comments related to the filter you selected.

Well! Woopsy! (1, Interesting)

uberred (584819) | more than 11 years ago | (#4766983)

This is almost too good... could someone have hacked in to their network and deliberately taken it down?

Re:Well! Woopsy! (5, Funny)

Iamthefallen (523816) | more than 11 years ago | (#4767029)

Yes, I believe we should rush to conclusions and blame it on foreign terrorists since there is nothing suggesting terrorism, and that just proves that they're extremely sneaky.

You may now begin to panic in an orderly fashion, thank you.

Re:Well! Woopsy! (4, Interesting)

hey! (33014) | more than 11 years ago | (#4767070)

I don't think that deliberate malicious action is a very likely cause. The article wasn't for technical folk, so it's anyone's guess; mine is that the network grew gradually to the point where it couldn't be restarted. You can always add a few nodes to a large network, but it isn't necessarily possible to start such a network from a dead stop. Probably a handful of well placed routers would have prevented this.

However, a network like this could be life-critical, and there probably should be contingencies for a variety of circumstances, including deliberate subversion.

Irresponsibility at its finest (-1, Offtopic)

ekrout (139379) | more than 11 years ago | (#4766989)

I read that in Soviet Russia, hospitals brought down networks.

-1 leading questions (n/t) (0)

Karamchand (607798) | more than 11 years ago | (#4766990)

i said n/t

Problem was with an application, (5, Insightful)

Anonymous Coward | more than 11 years ago | (#4766996)

according to the coverage in the printed 11/25/02 Network World magazine I read yesterday. My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.

Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.

Re:Problem was with an application, (5, Insightful)

cryptowhore (556837) | more than 11 years ago | (#4767050)

Agreed, I work for a bank and we have several environments to work in, including multiple UAT, SIT, and Performance Testing Environments. Poor infrastructure managment.

Re:Problem was with an application, (-1, Flamebait)

Anonymous Coward | more than 11 years ago | (#4767121)

Get lost, moron! Banks are not critical. No matter how important it is for you to screw people's wallet.

Re:Problem was with an application, (5, Interesting)

sugrshack (519761) | more than 11 years ago | (#4767067)

that's a good initial assumption, however my experience with similar issues tells me that you can't pin all of this one one person.

Yes, this person should have been using an adhoc database (assuming one is set up), however access to various things like this tends to get tied up due to "odd" management practices.

realistically a backup network sounds good, however there are other ways around this... it could have been prevented with correct administration of the network itself; for instance, in Sybase systems, there are procedures set up to handle bottlenecks like this. (of course, I could be talking out of my a$$, as I'm one of those people without real access anyway... far from root... more like a leaf).

Re:Problem was with an application, (4, Insightful)

Anonymous Coward | more than 11 years ago | (#4767108)

So a researcher with a workstation isn't allowed to use the network do to his job? No, this stems from incompetence on the part of the network engineering team.

I don't buy it (5, Insightful)

hey! (33014) | more than 11 years ago | (#4767168)

The same explanation was floated in the Globe, but I don't buy it.

People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.

The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.

One thing I would agree with you is that the hospital probably needs a separate network for life critical information.

FUck it's hot (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4767001)

GODDAMN IT'S HOT!!!

and i have to stare this friggin' microsoft windows gui all day

an identical network (0)

Anonymous Coward | more than 11 years ago | (#4767003)

having an identical network would almost be like raiding several harddrives to have the databacked up (raid 0+1 i think). It would almost guarrantee a connection unless of course they both go down. But how likely is that? :)

scapegoat

This is what you call... (2, Funny)

Anonymous Coward | more than 11 years ago | (#4767005)

... "an old boys' network"

That's why I hate automatic routing (-1, Troll)

Hairy_Potter (219096) | more than 11 years ago | (#4767007)

too often you'll run into strange glitches like this.

It takes me a bit more time to manually edit all my routing tables on my Ciscos and Suns, but I feel the homemade touch makes up for the extra time.

Re:That's why I hate automatic routing (3, Insightful)

parc (25467) | more than 11 years ago | (#4767081)

And your change in routing policy is going to affect spanning tree how?

How do you handle mobile users? What about dialup static IP addresses from multiple RAS devices?
Hand-editing of routing tables works only in the most simple of networks.

Re:That's why I hate automatic routing (0)

Anonymous Coward | more than 11 years ago | (#4767150)

Actually we have a horribly complex network. (Australian national network, multiple extranets to governmental offices, national dialup service, plus DialConnect global roaming hook up. Its 95% static routed.

Management is a key issue, with tools to aid deployment the next. Static in large networks is not impossible, sometimes you have to set limits and miss out on some "cool" features.

Probably having old school network engineers is a big part of this setup. They don't like giving up control to automated systems.

Re:That's why I hate automatic routing (5, Interesting)

Swannie (221489) | more than 11 years ago | (#4767115)

Routing has nothing to do this, spanning tree is a layer two function, and is responsible for allowing multiple links and redundancy between switches in a network. A properly set-up network running properly set-up spanning tree works wonderfully. Unfortunately, many, many people play with things they don't understand (on a production network no less).


This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.


Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people? :)


Swannie

No. (5, Interesting)

Clue4All (580842) | more than 11 years ago | (#4767009)

do you think the answer to having an massive and unreliable network is to build a second identical network?

No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.

Re:No. (0)

Anonymous Coward | more than 11 years ago | (#4767061)

They don't break on their own either.

Re:No. (0)

Anonymous Coward | more than 11 years ago | (#4767116)

Shaddup.

Re:No. (2)

passion (84900) | more than 11 years ago | (#4767135)

good idea, the problem is that most institutions don't do enough regression testing to see if *absolutely everything* is working. Oh sure, my cat's webpage with the 3-d rotating chrome logo still loads, but what about the machine that goes ping keeping Mr. Johnson alive just down the hall?

Re:No. (5, Informative)

Anonymous Coward | more than 11 years ago | (#4767153)

As an employee at BIDMC (the Beth Israel Deaconess Medical Center) I can tell you that they did not just install a parallel network. The first network was completely redesigned to be more stable and once it proved its stability, then a second redundant network was put in place to ensure that if the network ever became unstable again for any reason there was a backup that was known to work immediately instead of having to wait to fix the original again. Most of the housestaff at BIDMC were already familiar with the paper system as the transition to paperless had only occured over the last two years and in stages. The real problems was obtaining lab and test results as these have been on computer for years.

Re:No. (5, Insightful)

barberio (42711) | more than 11 years ago | (#4767163)

The problem here is that it will take days, maybe weeks to do this. Hospitals want the data flowing *Now*.

So the answer is - Yes. In a situation where 100% uptime is demanded, the only solution is redundant systems.

Of course it can help (2, Insightful)

Anonymous Coward | more than 11 years ago | (#4767010)

Yes, a second, fully redundant network would be "good" from a stance of giving better fail-over potential.

But will anyone know when one network fails? If not, then how will they fix it? If they don't fix it, then doesn't that mean that they really only have one network?

Which puts them right back to where they were.

Of course, if they put a redundant network in, then fix their problems to try to prevent this issue happening in future, then they'll be in much better shape the next time their network gets flushed with the medical waste.

Re:Of course it can help (1)

dprior (221280) | more than 11 years ago | (#4767092)

I'm pretty sure they'll realize their network is down when they are forced to start running around looking for paper forms again. That might clue 'em in.

Well... (1)

REBloomfield (550182) | more than 11 years ago | (#4767011)

If the first one's bust, how's a second going to help? :) Although i must admit that redundancy is a wonderful thing for servers, power supplies, etc, but for infrastructure?? Having identical copies of routers kicking around is extremely useful, but cost effectiveness comes into play. If you can afford it, I can't argue with the logic.

Networked hospitals (0)

Anonymous Coward | more than 11 years ago | (#4767018)

Hmm, a second parallel system. Would this include parallel wiring closets? I suspect that the cost involved (I once worked on a project team that was merely replacing wiring at a hospital, and it took 6 months) would have them continue to use existing wiring runs. You have now created a single point of failure for *both* networks.

For those who think that a hospital wouldn't cut corners in that way, think again. I know what we had to do with our project, and I for one will never let anyone I know stay at that hospital. If they were willing to cut there, where else will they cut?

Anon Coward

friggin windoze users (0, Flamebait)

kraksmoka (561333) | more than 11 years ago | (#4767019)

that's what u get when u sign onto monopolyware. fact is, with all the fancy toys that docs use like MRI and tomography, i haven't met one that knows anything about a computer. in fact they were probably glad their stuff crashed. in fact, it was probably a setup to get the old system back! lousy docs :(

Re:friggin windoze users (2)

danheskett (178529) | more than 11 years ago | (#4767102)

that docs use like MRI and tomography, i haven't met one that knows anything about a computer
I'll let my doctor worry about curing whats wrong with my brian than dealing with high-order complex networking issues, thank you very much.

Major American Bank Outage (5, Informative)

MS_leases_my_soul (562160) | more than 11 years ago | (#4767020)

A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.

We had to rebuild the entire network ... it took a week. All non-critical traffic had to be cut-off as we pushed everything through the backup T1s and ISDN lines. It cost the bank MILLIONS of dollars.

Suddenly, that backup network was real cheap. They are now quite proud to tote their redundancy.

MONEY (1)

Botchka (589180) | more than 11 years ago | (#4767021)

I would look at making the original network more reliable and what the hell, if the hospital has money to burn, redundancy is a good thing. I didn't read the article. Was this caused by some knucklehead that was testing in a production environment?

The Israeli Way (-1, Flamebait)

joelwest (38708) | more than 11 years ago | (#4767022)

This is the way of the israeli -- if it is broken, paint it. israel is very much a machar - read manyana - society. Things will be done tomorrow and tomorrow never comes.

Re:The Israeli Way (0)

Anonymous Coward | more than 11 years ago | (#4767047)

This is an American hospital in Boston. Geez, if you are going to bash Israel, at least do it with something credible...

Re:The Israeli Way (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4767058)

except when it comes to palestina

Re:The Israeli Way (0)

Anonymous Coward | more than 11 years ago | (#4767089)

How complete of a moron are you?

No wait... you've already answered the in your post:

A total moron!

This hospital is not in Israel, its in Boston Massachusetts. Try reading the article before wasting everyone's time with your idiocy.

Re:The Israeli Way (0)

Anonymous Coward | more than 11 years ago | (#4767098)

oh, put a sock in it already.. the juvenile racism and hatred gets old after a while.

Bobby Lost An Eye (0)

Anonymous Coward | more than 11 years ago | (#4767026)

It's all fun and games until Bobby loses an eye because his doctor couldn't read his forwards.

-Eezy Bordone

Leading question (4, Insightful)

Junks Jerzey (54586) | more than 11 years ago | (#4767028)

do you think the answer to having an massive and unreliable network is to build a second identical network?

Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?

Re:Leading question (1, Offtopic)

danheskett (178529) | more than 11 years ago | (#4767071)

I agree. It's really lame to have these snippy, loaded, last sentence, purile attachments to the actual story. It has really been noticable lately. It is even worse in my opinion when the jab is put in the form a question.

The editors could make /. a bit nicer and more enticing to read without allowing submitters to instantly set the tone in such a slanted way.

Spanning tree (2, Interesting)

skinfitz (564041) | more than 11 years ago | (#4767030)

do you think the answer to having an massive and unreliable network is to build a second identical network?"

I think the answer is to disable spanning tree.

We had a similar problem here (large academic installtion, hundreds of workstations, several sites) with things (before my time I hasten to add) being one Big Flat Network (shudder) using IPX primarily and Novell. Needless to say this was not good. I've since redesigned things using IP and multiple VLANS, however there is still the odd legacy system that needs access to the old net.

My solution was to tap the protocols running in the flat network and to put these into VLAN's that can be safely propagated around the layer 3 switched network and presented wherever we wish. The entire "flat" network is tapped into a VLAN and the IP services that are running on it routed into. Any problems with either network and we just pull the routes linking the two together if it were to get that bad.

Re:Spanning tree (2)

zyglow (585790) | more than 11 years ago | (#4767091)

Adding on to the VLAN idea, I'd also change the routing protocol to OSPF. They would be squandering a lot of money to run two networks side by side.

Re:Spanning tree (5, Interesting)

GLX (514482) | more than 11 years ago | (#4767118)

This would imply that either:

A) A campus could afford to do Layer 3 at every closet switch

or

B) Live without Layer 2 redundancy back to the Layer 3 core.

I'm sure in a healthcare environment, neither is an option. The first is too expensive (unless you buy cheap, and hence unreliable equipment) and the second is too risky.

Spanning tree didn't cause the problem here. Mis management of spanning tree sounds like it caused the problem.

Spanning tree is our friend, when used properly.

Flat networks. (2)

zerofoo (262795) | more than 11 years ago | (#4767122)

Do your VLANS share the same physical cable? If so, how are they connected? Do you use a one-armed router?

-ted

Re:Spanning tree (3, Insightful)

TheMidget (512188) | more than 11 years ago | (#4767126)

I think the answer is to disable spanning tree.

On a network as complex and messy as theirs? That's basically the situation where you need spanning tree, or else it just crumbles to dust once they do produce a loop...

Re:Spanning tree (3, Insightful)

AKnightCowboy (608632) | more than 11 years ago | (#4767149)

I think the answer is to disable spanning tree.

Are you talking about a different spanning tree protocol than I think you're talking about? Spanning tree is a very good thing to run to stop loops exactly like this. More than likely one of the hospital network techs misconfigured something and ended up disabling it (portfast on two access points linked into another switch accidently or a rogue switch?).

Are you crazy? (2, Insightful)

AriesGeek (593959) | more than 11 years ago | (#4767160)

Disable STP? And create, or at least take the risk of creating bridging loops? That will bring the network right back down to its knees!

No, disabling STP is NOT an option. Learning how to use STP properly is the option.

Hospital Systems (4, Informative)

charnov (183495) | more than 11 years ago | (#4767033)

I also used to work at a teaching hospital (Wishard for Indiana University) and I learned more there about networking and systems support than in years of college. I remember one day we found a still used piece of thick-net (you know...old firehose). It was connecting the ambulance office's systems to the rest of the hostpital. The rest of the hospital ran on DEC VAX clusters and terminals. To be fair, they have gotten much better (I don't work there anymore either), but this wasn't the first hospital network I had seen that truly terrified me, and it hasn't been the last.

Re:Hospital Systems (1)

charnov (183495) | more than 11 years ago | (#4767065)

Oh yeah...Hey Joe...good luck out there...heh.

A second (unreliable) network? (4, Insightful)

shrinkwrap (160744) | more than 11 years ago | (#4767037)

Or as was said in the movie "Contact" -

"Why buy one when you can buy two at twice the price?"

Disaster recovery (4, Interesting)

laughing_badger (628416) | more than 11 years ago | (#4767041)

do you think the answer to having an massive and unreliable network is to build a second identical network?

No. They did everything right. Falling back to paper and runners is the best they could do to safeguard patients lives. An 'identical' network would be susceptible to the same failure modes as the primary.

That said, hopefully it wasn't really six years since they had run a disaster exercise where they pretended that the computers were unavailable...

Um.. (4, Insightful)

acehole (174372) | more than 11 years ago | (#4767049)

In six years they never thought to have a backup/redundant system in place in case of a failure like this?

Even the best networks will come unglued sooner or later. It's surprising to see that most business' networks need prime operating conditions to function properly.

Using Windows I bet (-1, Troll)

PhysicsGenius (565228) | more than 11 years ago | (#4767051)

I've found that Winblows 2000 has a lot of problems in the networking subsystem. My network is constantly going down and the only Wincrap machines are on the secretaries desk! Those Winjunk programmers must sure be terrible to be able to take down a hospital like this. Somebody should hold them responsible, I call boycott!!!

How many domain controllers? (2)

Hairy_Potter (219096) | more than 11 years ago | (#4767101)

If you're just using a Primary Domain Controller, that could be your problem. I'd recommend adding a backup PDC, as well as a Tertiary Domain Controller, and add an X.25 backup network layer to give you hot-swappability and real-time rollover capabilities.

2nd network (4, Insightful)

Rubbersoul (199583) | more than 11 years ago | (#4767053)

Yes I think having a 2nd network for a vital system is a good idea. This sort of thing is used all the time for things like fiber rings were you have the work and protect path. If the primary work path goes down (cut, maintenance what ever) then you roll to the protect. Yes it is a bit more expensive but in case like this maybe it is needed.

the sad part (1, Offtopic)

tps12 (105590) | more than 11 years ago | (#4767057)

This event has a lesson for us. Of course, I expect the Slashdot response to be something along the lines of "they should have used Linux," but the true fact is that all technology, even Linux, is unreliable. Rather than dicking around with which OS can provide the best network, we should accept that none of them provide the robustness necessary for things like hospitals and fire departments, and what we really need is to reduce our dependency on technology altogether. If the hospital had been paper-based, this tragedy would not have occurred.

and now.. (0)

Anonymous Coward | more than 11 years ago | (#4767062)

and now there server gets slashdotted, administrators run around trying to work out what to do - rebooting NT boxes. Well the article is on the boston globe so there server is okay.

Politics (1, Insightful)

Anonymous Coward | more than 11 years ago | (#4767069)

I work at a med school / hospital and in my experience some of the greatest issues are political ones. The school is not for profit, but the hospital is privately owned. The outcome? The school get's fleeced - imagine paying over $50 a month for a port charge! The hospital should have enough money from that to build an adequate network...but that assumes that the focus is in the correct place. All too often the focus is on politics (in a place full on PhD's and MD's, the whole driving force is political power and reputation.) instead of technology. The network suffers while the Senior Officers buy new handmade mahogany desks, that sort of thing.


Doesn't really matter. If you had to deal with Med Students as we do, you'd die before you went to the doctor. Trust me.

Reliability is inverse to the number of components (1)

ChimChim (54048) | more than 11 years ago | (#4767075)

Ok, so here's an SAT question for ya:

IF you have one train going from NY->LA that's likely to break down 10% of the time, and you get a second identical train going in the opposite direction, what's the probability that one of the trains will fail?

(number of trains) * (probability of failure)
= 2 * .10
= 20%

The more components in the system, the more likely it is that parts of the system will be down. This isn't to say that the extra redundancy isn't useful, but it doesn't give you more reliability...it decreases it. So additional mangement costs are incurred in making sure that enough redundancy is always available to compensate for parts of the system that are down, and replacing bad components.

Re:Reliability is inverse to the number of compone (1)

Kajakske (59577) | more than 11 years ago | (#4767127)

Nice math, but the point here is that only 1 train has to arrive, thus in those 20% we can still safely travel.

Re:Reliability is inverse to the number of compone (4, Insightful)

Xugumad (39311) | more than 11 years ago | (#4767137)

However, the probability of both failing at the same time is:

0.1 * 0.1 = 1%

So as long as it can run on just one out of two, get you get ten-fold increase in stability.

Re:Reliability is inverse to the number of compone (2, Informative)

pknoll (215959) | more than 11 years ago | (#4767139)

Sure, but that's not the point of redunancy. The question you want to ask is: How likely is it that both redundant components will fail at the same time?.

That's how mirrored RAID arrays work: you increase your chances of a disk failure by adding more disks to the system due to probability; but your chances of recovering the data in the event of a crash go up, since more than one disk failing at once is unlikely.

Re:Reliability is inverse to the number of compone (0)

Anonymous Coward | more than 11 years ago | (#4767145)

The probability that one train will fail is still 0.1. It is irrelevant how many trains there are, the probability that any given one will fail will be 0.1(Of course assuming the trains fail independently). The probability that that both train will fail simultaneously is 0.1*0.1

Re:Reliability is inverse to the number of compone (2)

Alranor (472986) | more than 11 years ago | (#4767170)

I'm a little confused here:-

Prob train A fails = 0.1
Prob train B fails = 0.1

Prob train A doesn't = 0.9
Prob train B doesn't = 0.9

So Prob neither fail = 0.9 * 0.9 = 0.81

So prob at least one fails = 0.19 = 19%

One of us has got the maths wrong.
Can someone who's not trying to remember his stats courses from years back tell me if it's me :)

Re:Reliability is inverse to the number of compone (1)

Skater (41976) | more than 11 years ago | (#4767173)

It depends on how it's set up. I think of it in terms of parallel or serial wiring. Your example is serial, in that if one goes down they both go down, thereby decreasing reliability. If you ask the question a different way, such as "What is the possibility that both trains break down" (i.e., parallel--if one goes down it doesn't affect the other one), the probability is .10*.10, which is .01: more reliable.

--RJ

Re:Reliability is inverse to the number of compone (2)

dago (25724) | more than 11 years ago | (#4767178)

I don't know what SAT is, but I think you made some mistakes.

if your 10% is the probability that 1 train will fail during NY -> LA trip then you've got the following probability :

0 train fails = 0.9 * 0.9 = 0.81
1 train fails = 2 * 0.1 * 0.9 = 0.18
2 train fails = 0.1 * 0.1 = 0.01

which means that the probability of having at least one train going from NY -> LA is ... 98%, much better than the previous 90%.

more info, less sensationalism (5, Informative)

bugpit (588944) | more than 11 years ago | (#4767076)

The Boston Globe article was a tad on the sensational side, and did a pretty poor job of describing the technical nature of the problem. This article [nwfusion.com] does a somewhat better job, but is still pretty slim on the details. Almost makes it sound like someone running a grid application was the cause of the trouble.

Senior executives actually worked?!?! (0)

Anonymous Coward | more than 11 years ago | (#4767079)

Senior executives were reduced to errand runners

No shit? It must have been the first time in decades since these guys did any honest work.

Sure it was STP? (1)

53x19 (621459) | more than 11 years ago | (#4767080)

Spanning Tree is pretty robust protocol. Problems usually arise when admins get impatient with convergence times and start messing with the timers.... or enabling features like portfast, backbonefast and the like.

New Technology (0)

Anonymous Coward | more than 11 years ago | (#4767083)

This [216.136.200.194] is what they were testing.

Short answer? No. (2)

krinsh (94283) | more than 11 years ago | (#4767085)

Should there be a few replacement devices on hand for failures? Yes. Should there be backups of the IOS and configurations for all of the routers? Yes. Should this stuff be anal-retentively documented in triplicate by someone who knows how to write documentation that is detailed yet at the same time easy to understand? Yet another yes.

If it is so critical, it should be done right in the first place. If a physically damaged or otherwise down link is ESSENTIAL to the operation or is responsible for HUMAN LIFE, then there should be duplicate circuits in place throughout the campus to be used in the event of an emergency; just like certain organizations have special failover or dedicated circuits to other locations for emergencies.

Last but absolutely certainly not least; the 'researcher', regardless of their position at the school, should be taken severely to task for this. You don't experiment on production equipment at all. If you need switching fabric; you get it physically separated from the rest of the network or if you really need outside access you drop controls in place like a firewall, etc. to severely restrict your influence on other fabric areas.

done right in the first place (3, Interesting)

wiredog (43288) | more than 11 years ago | (#4767174)

You've never worked in the Real World, have you? It is very rare for a network to be put in place, with everything attached in it's final location, and then never ever upgraded until the entire thing is replaced.

In the Real World, where you can't shut everything down at upgrade time, a PDP-11 connected to terminals was put in 25 years ago. The PDP-11 was replaced with a VAX, which ran in parallel with the PDP-11 while it was brought online. A few years later a couple of PC's (running DOS 3.0) were hooked up to each other via a Novell network, which was connected to the VAX. Ten years ago the VAX was replaced with a few servers, which ran in parallel with the VAX until they were trusted. Along the way various hubs, switches, and routers were installed. And upgraded as the need arose. The cables were upgraded, also as the need arose, and not all at once.

what the hospital really needs is.. (-1, Offtopic)

Anonymous Coward | more than 11 years ago | (#4767086)

a beowulf cluster.

What is spanning tree protocol? (google whoring) (5, Informative)

Anonymous Coward | more than 11 years ago | (#4767088)

Spanning-Tree Protocol is a link management protocol that provides path redundancy while preventing undesirable loops in the network. For an Ethernet network to function properly, only one active path can exist between two stations.

Multiple active paths between stations cause loops in the network. If a loop exists in the network topology, the potential exists for duplication of messages. When loops occur, some switches see stations appear on both sides of the switch. This condition confuses the forwarding algorithm and allows duplicate frames to be forwarded.

To provide path redundancy, Spanning-Tree Protocol defines a tree that spans all switches in an extended network. Spanning-Tree Protocol forces certain redundant data paths into a standby (blocked) state. If one network segment in the Spanning-Tree Protocol becomes unreachable, or if Spanning-Tree Protocol costs change, the spanning-tree algorithm reconfigures the spanning-tree topology and reestablishes the link by activating the standby path.

Spanning-Tree Protocol operation is transparent to end stations, which are unaware whether they are connected to a single LAN segment or a switched LAN of multiple segments.

see this page [cisco.com] for mode info

Why fly equipment from california?? (1)

Viol8 (599362) | more than 11 years ago | (#4767090)

Did a company as large as Cisco seriously have no appropriate troubleshooting equipment on the WHOLE of the east coast or anywhere closer the california? What kind of mickey mouse support outfit are they running??

Re:Why fly equipment from california?? (2, Interesting)

marklyon (251926) | more than 11 years ago | (#4767134)

They have a huge hot lab in California where they have pre-configured switches, routers, ect running and ready to go at a moment's notice. When my ISP went down, they sent (same day) three new racks of modems configured with our last known "good" configuration so all we had to do was unplug, pull, connect.

It would be redundant to have one on each coast, because they were able to get our stuff to us the same day in rural Mississippi.

Re:Why fly equipment from california?? (2)

GLX (514482) | more than 11 years ago | (#4767138)

Because Cisco is very California-centric, and the fact is that when it comes to their switching and routing gear, there is very little "hardware" that you can bring in to troubleshoot that's little more than commodity software loaded onto a commodity PC.

The best thing they had was the input of (hopefully) knowledgeable Cisco engineers. God knows if they relied on Cisco TAC Level 1 support they'd still be down today.

Of course they need another network (5, Insightful)

virtual_mps (62997) | more than 11 years ago | (#4767096)

Why on earth would a researcher be plugged into the same network as time-sensitive patient information? Yes it's expensive, but critical functions should be seperated from non-critical functions. And the critical network needs to be fairly rigidly controlled (i.e., no researchers should "accidentally" plug into it.) Note further information in http://www.nwfusion.com/news/2002/1125bethisrael.h tml [nwfusion.com]

Well the thing is... (1, Interesting)

Anonymous Coward | more than 11 years ago | (#4767097)

Having worked on several database systems, improper planning and maintenance are the main causes of large, unwieldy and ultimately unstable systems. In large organizations where IT is not a major business area, i.e. a Hospital system, their existing database system has probably been augmented several times to increase functionality (and capacity) - probably by different parties as well. This multiple patching approach results in instability as the database has grown far beyond its orginal intended purpose. However, due to the vast stores of data, and the repeated tinkering with it by various parties, migration is a nightmare.

Rebuilding the system from the ground up poses several major hurdles. First being the systematic migration of data while the original database is still running! as for hospitals, this database is clearly mission critical!

The other problem is mimicing the interface and relationships within the database, such as to reduce retraining. Retraining is a major problem when switching systems. All in all, it is a major undertaking to redo the database, and probably not viable, both in time or money for the hospital.

Saddly, I have to contend that duplication of their system is the best short to medium term solution.

Isolate faults (0)

Anonymous Coward | more than 11 years ago | (#4767100)

The network can be designed (hierarchical) such that a network fault will isolate only a part of network that can be locally fixed and does not affect the entire network. The important network servers should be redundant and can be be made fault tolerant by automatic switchovers during server faults. The main switches and routers can use loopback addressing to other network cards in case a network card on the switch or router goes down.

All Layer 2? (5, Informative)

CatHerder (20831) | more than 11 years ago | (#4767104)

If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets. That way if something like STP goes wrong in one spot, it doesn't affect the others.

Building a parallel identical net is almost definitely the wrong answer. Especially if it uses the same design and equipment!

Unfortunately, often older networks grow in a piecemeal way and end up like this, commonly having application level stuff that requires it to be flat. The job of a good network engineer (and diplomat) is to slowly have all the apps converted to being routable and then subnet the net.

Rebooting is your friend (1)

cperciva (102828) | more than 11 years ago | (#4767112)

As much as we all laugh at the Windows "close all your applications and reboot" way of "solving" problems, there is something to be said for rebooting systems: If all else fails, you can quickly restore the system to a known working state.

Ideally, rebooting a system should be unnecessary. But practically speaking, people make dumb mistakes -- like the bug which caused the telephone crash of 1990 -- and Bad Things can happen. Rebooting a system should be a last resort; but it should be a last resort which always works.

:( Hey I submitted this a week ago :O (0, Troll)

Flamesplash (469287) | more than 11 years ago | (#4767113)

So when I submitted this a week ago it gets rejected, but now that Mr. hey! submits it, it gets accepted. I see what's going on. Damn I need more punctuation in my handle.

OMG! (2, Funny)

jmo_jon (253460) | more than 11 years ago | (#4767119)

The crisis began on a Wednesday afternoon, Nov. 13, and lasted nearly four days.

Did that mean the doctors couldn't play Quake for four days!?

Misleading question? (0)

Anonymous Coward | more than 11 years ago | (#4767120)

I assume by "network" they just mean backbone. Obviously the backbone is what failed, otherwise it wouldn't have brought down the entire network. Obviously they need some redundancy there.

They did rather well really. (1)

91degrees (207121) | more than 11 years ago | (#4767125)

How many other organisation scan run at all if their network dies? And if the execs really were running around as errand boys, that's just great. Nice to see the senior staff actually caring enough to help keep things going. Really they need a prodcedure to deal with the networks failing rather that a redundant network.

Complexity brings bugs (5, Interesting)

stevens (84346) | more than 11 years ago | (#4767132)

The network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.

We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.

But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.

Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.

Fucking stupid (0)

Anonymous Coward | more than 11 years ago | (#4767136)

"Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions."

By law they have to have a disaster recovery plan, all US hospitals HAVE to. So they "scrambled" to the disaster recovery plan, made copies of the forms, and were up. Big deal.

Obviously not. (2)

buss_error (142273) | more than 11 years ago | (#4767144)

do you think the answer to having an massive and unreliable network is to build a second identical network?"

Obviously, if something fails due to design, then duplicating the design duplicates the problem. While this can be a useful troubleshooting tool, it makes somewhat less sense for production enviroments.

I would be willing to guess that the network was one giant collision domain, and that the trouble springs from that. But it is just a guess.

STP (2)

netwiz (33291) | more than 11 years ago | (#4767146)

isn't that hard to troubleshoot. You look at the device ID that most recently made a Topology Change Notification, and then start looking at the hardware diagnostics for that system. If they're showing clean, reboot the switch. If, while the device is rebooting, the network stabilizes, you've found the problem. When the system finishes it's boot, check the hardware diagnostics again (Ciscos only run H/W diags at POST, and a reset is the only way to re-run them); odds are that you'll see there's a failed component.

A previous poster nailed it too, simply back out the changes you made (obviously the problem you were fixing is of a lower magnitude than a total outage), and things should start working again.

It seems obvious (1)

Woogiemonger (628172) | more than 11 years ago | (#4767148)

Even if a network is engineered perfectly, someone could maliciously or accidentally physically harm it and cause down time. Having a second, perhaps lower-end, backup network, when you have people's lives at stake (missing prescription information could quickly cause a fatality) ..it's a necessity, especially for a hospital with such a good reputation. Plus, the telecomm industry giants such as Cisco are just DYING for more business, so this could also help the economy :)

My best hospital glitch (5, Informative)

eaddict (148006) | more than 11 years ago | (#4767154)

was a human error. We were a smallish hospital (270 beds). I was the new IS Manager. I was looking for power outlets in the computer room for all the new euqipment I had ordered. Well, there were a lot of dead plugs. Also, I was told to stop since electricity based things like that were left up to the union guys. No big deal. I called them and asked them to locate and label the outlets under the raised floor. While I was sitting at my desk later that day the power went off for a sec then on.... I got up and looked toward the data center. The lights AND the equipment went off then on. I ran in to find the union guys flipping switches on the UPS (on/off). They had stuck a light bulb w/plug in each of the open outlets and were flicking the power on and off to see what bulb was effected. They were on the equipment side of the UPS! All of our servers, network gear, and such took hard downs that day! Ahhh!!! Who needs technology to make things not work! This was the same union that wrote me up for moving a cube wall to get at an outlet. Moving furniture was a union duty!

Lawsuit (2)

Gary Franczyk (7387) | more than 11 years ago | (#4767155)

There will probably be many lawsuits after this.

The line of thinking will be something like this:

How many people died or will die, or get improper treatment because of this networking glitch? If the hospital is as large as described, certainly a number of persons were given inadequate healthcare while they were there.

Some may have a good case.

Cisco implemenatation of Spanning Tree sucks (4, Interesting)

xaoslaad (590527) | more than 11 years ago | (#4767158)

I am not up to speed on spanning tree, but speaking with a coworker after reading this article it is my understanding that Cisco equipment runs a new instance of spanning tree each time a new VLAN is created. As you can imagine in such a large campus environment there can be many tens if not hundreds of VLANS. In a short time you turn your network into a spanning tree nightmare. I'd much rather use some nice Extreme Networks (or founrdy or whatever) Layer 3 switching equipment at the core and turn off spanning tree. Use tagged VLANS from the closets to the core and voila no need for spanning tree... Use cisco edge devices for WAN links. Building out a second rats nest out of the same equipment seems foolish.

I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past. It may not be possible to turn around and re-engineer it with the existing equipment; but I think that I would much rather throw out the vendor and reengineer the entire thing correctly before putting in a second shabby network.

I speak from having assisted on something like this on a very small campus environment (1,500 nodes maybe) and we basically tore out a disgusting mess of a LAN and implemented a fully switched, beautifully layed out network with redundant links to all closets an 8 GB trunk between two buildings etc in the breadth of one weekend. Obviously there was tons of planning involved, cabling run in preparation and so on, but what a fantastic move it was.

Sure there were hiccups Monday morning, but everything was perfectly fine by the end of the week.

Two wrongs don't make a right.

Spanning Tree Protocol problems (1)

Qzukk (229616) | more than 11 years ago | (#4767164)

Its just too complex for people to understand.

One of the first things we learned when we got to this part of our networking class, was that spanning trees for more than a few nodes is damn near impossible for a human to figure out. We learned how to diagnose the problem if it occurred, we even studied ethernet frame dumps to watch the spanning tree build itself. But, if you weren't there to watch the tree get built, there's no way at all to guess what exactly went wrong with it. You just pull all the bridges and routers, reset them all, and start over.

This was probably caused by a combination of bad hardware, and some nut plugging two branches of the network together that were already connected somehow. The hardware should have recognized this as a loop and cut it, but for some reason it didn't.

Well, hopefully they won't repeat the same loop in their backup network.

The real problem (4, Insightful)

Enry (630) | more than 11 years ago | (#4767165)

There was no central organization that handled the networking for the associated hospitals, so more networks just got bolted on until it couldn't handle the load.

So what's the lessons?

1) Make sure your solution scales, and be ready in case it doesn't.
2) Make sure some overall organization can control how networks get connected.

Business Continuity Plans? FDA, 21CFR11? (0)

Anonymous Coward | more than 11 years ago | (#4767166)

Any medical system used for patient data is fair game for an FDA audit: these are *massive* in scope and they should issue you with a a 483 if you're found not to be 21CFR11 compliant.
To be compliant you should have massive amounts of validation documents covering everything from how to build *the whole system from scratch* in the event of an error, to your business continuity plan, your disaster recovery plan etc etc etc.
Your initial User Requirement Spec document when the system was implemented should have included details of failsafes and redundancy and been built in from the word go.


You would be on very shaky legal ground if you ran a system that was not FDA compliant like this.

What *really happened here*?

"the last drop that made the network overflow" (1)

dagg (153577) | more than 11 years ago | (#4767169)

... just one more wafer :-) ...

Overall, as long as patient care wasn't diminished (the degree of diminishment is debateable), it is probably good that things like this occasionally happen. It's a great way to test non-technical systems that usually only get tested in a wide-spread disaster.

--

Fix it the first way that works. (3, Insightful)

tomblackwell (6196) | more than 11 years ago | (#4767175)

If you have something that's broken, and you need its functionality soon, and don't have a fucking clue as to what's wrong with it, you might want to replace it.

It may not be the right way to do it, but they're running a hospital, and might not have the time to let their network people puzzle it out.

Network Utilization Analysis not run yet (2, Interesting)

chopkins1 (321043) | more than 11 years ago | (#4767177)

In the article, it also states that they had just approved a contractor to do a network analysis: "on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time." If the article summary gives the correct information, I'll bet that large parts of their network were overburdened and hadn't been upgraded in years.

They were probably running at around 30-35% capacity and most networks get REAL funny at around that point. The following comment is rather telling: "The large volume of data the researcher was uploading happened to be the last drop that made the network overflow."

Another telling comment about the situation was: "network function was fading in and out".
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?