Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

VMware Causes Second Outage While Recovering From First

Soulskill posted more than 3 years ago | from the third-time's-a-charm dept.

Cloud 215

jbrodkin writes "VMware's new Cloud Foundry service was online for just two weeks when it suffered its first outage, caused by a power failure. Things got really interesting the next day, when a VMware employee accidentally caused a second, more serious outage while a VMware team was writing up a plan of action to recover from future power loss incidents. An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry.' Clearly, human error is still a major factor in cloud networks."

cancel ×

215 comments

Sorry! There are no comments related to the filter you selected.

This is very bad design (5, Interesting)

FunkyRider (1128099) | more than 3 years ago | (#36006070)

[[An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry]] Really? Pressing a single key and bam! All gone? Is that the best they can do?

Re:This is very bad design (5, Interesting)

drosboro (1046516) | more than 3 years ago | (#36006108)

I didn't get the sense from reading the linked analysis that it was necessarily a single key-press. It reads like this:

This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.

My sense is that "touched the keyboard" doesn't literally mean "touched a single key on the keyboard", but actually means "ignored the hands-off-the-keyboard part of the exercise, and executed some commands".

But who knows, I could be wrong... I'm sure hoping I'm not!

Re:This is very bad design (4, Insightful)

nurb432 (527695) | more than 3 years ago | (#36006404)

I am sure that is what happened. I don't know of any single keystroke that would take down an entire data center. ( aside from that big red button on the wall over there.. )

Re:This is very bad design (2)

Daniel_Staal (609844) | more than 3 years ago | (#36006544)

'Enter' should do it, in most cases...

(Assuming, of course, that the (in)correct command has been typed at the command line already.)

Re:This is very bad design (3, Informative)

X0563511 (793323) | more than 3 years ago | (#36006576)

... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.

Re:This is very bad design (1)

Vrtigo1 (1303147) | more than 3 years ago | (#36006776)

+1. I got in the habit of using the control key to wake sleeping PCs a long time ago. Nowadays you'd hope that a sleeping PC would wake to a login screen, but I'm continuously amazed that I still see guys in IT shops that don't bother with locking their workstations...

Re:This is very bad design (2)

Archangel Michael (180766) | more than 3 years ago | (#36006992)

When an unlocked and unmanned workstation is found in our Dept, the SOP is to place a RICKROLL somewhere in the system. Bonus points for being creative. I have one that is still waiting to go off, because the guy never reboots his computer. He'll never know who did it, or when.

Re:This is very bad design (1)

NFN_NLN (633283) | more than 3 years ago | (#36006818)

... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.

So I should stop typing this into random terminals and then leaving?

> nohup "history -c; passwd -l root; rm -rf /" &

Re:This is very bad design (1)

42forty-two42 (532340) | more than 3 years ago | (#36006888)

On a serial link, just use the right arrow key. Or possibly ESC (although you'll have to deal with clearing the ESC chord afterward if it happened to be in vi or something)

Re:This is very bad design (1)

X0563511 (793323) | more than 3 years ago | (#36006952)

Not a bad idea. I think cleaning up the vi example is a good compromise - you wanted a prompt after all, not necessarily someone's leavings.

Re:This is very bad design (1)

shutdown -p now (807394) | more than 3 years ago | (#36006904)

"Updates are available for your computer; would you like to reboot it to install them?" ~

Re:This is very bad design (1)

interiot (50685) | more than 3 years ago | (#36006608)

If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.

Re:This is very bad design (2)

c6gunner (950153) | more than 3 years ago | (#36006860)

If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.

Screw that. I'd remove the sign. And replace it with one that says "FREE MOUNTAIN-DEW!".

Re:This is very bad design (1)

BitZtream (692029) | more than 3 years ago | (#36007008)

The enter key being pressed after doing something silly like typing up an example command line for a half written script that will automate some large process to simply copy and paste into another document.

While the reality of it is the reason they said 'hands off' was to avoid just such an accident, an engineer actually executing the test plan before it was actually ready to do its job, by accident. And it happened.

Its really one of those moments where the poor guy is just the most perfect example of why management said 'hands off'. Has to be a shitty feeling to be in, I'm sure they'll be giving him shit for years.

Re:This is very bad design (3, Funny)

DigitalJanitor (21725) | more than 3 years ago | (#36006740)

Sounds like they could benefit from a virtual environment to test things out in.

Re:This is very bad design (1)

verbatim (18390) | more than 3 years ago | (#36006112)

Finally, MovieOS being used in a production environment. Pretty soon, the cops will be using Visual Basic to hunt down suspects.

It's a euphemism (1)

symbolset (646467) | more than 3 years ago | (#36006138)

Just like "paper only" is a metaphor for the electronic document version, which is what was happening. In this case it means the engineer engaged in active management of the network instead of brainstorming ideas with the group. Presumably he intended to just investigate.

Re:This is very bad design (1)

FunkyRider (1128099) | more than 3 years ago | (#36006266)

Hate to reply myself but I've figured it out! First they typed this: "#> rm -rf / & shutdown" out of boredom and amusement to the master control console, then someone hit the magical key - 'Enter', oops

Re:This is very bad design (1)

sumdumass (711423) | more than 3 years ago | (#36006464)

It wasn't out of boredom. He went into a chat room and asked for advice. The guy talking the most gave him that information after asking if he was running windows and he replied I think so.

Big Red Button (0)

Anonymous Coward | more than 3 years ago | (#36006842)

What, did he hit the giant red blinking "Fuck Everything Sideways" button? Seems like that might be a design flaw they should look into.

'An inadvertent press of a key on a keyboard' (0)

Anonymous Coward | more than 3 years ago | (#36006088)

Any programming error can be traced back to one or two of those.

Re:'An inadvertent press of a key on a keyboard' (5, Funny)

verbatim (18390) | more than 3 years ago | (#36006124)

This pretty much describes my entire career.

Game Over (3, Insightful)

ae1294 (1547521) | more than 3 years ago | (#36006096)

The cloud is a lie. Would the next marketing buzz world please come on down!

Re:Game Over (2)

Samantha Wright (1324923) | more than 3 years ago | (#36006302)

Completely disagree. The solution is clear: eliminate all potential sources of human error.

Re:Game Over (1)

Anonymous Coward | more than 3 years ago | (#36006320)

Cue skynet

Re:Game Over (1)

Sene (1794986) | more than 3 years ago | (#36006386)

And call the solution Skynet?

Re:Game Over (0)

Anonymous Coward | more than 3 years ago | (#36006444)

To do that you would need to remove the human factor. I agree that removing all sources of human error is a good idea but maybe there should just be another verification of shutdown asking if the operator would really, REALLY like to shutdown all sources of power. And if that still fails maybe we need to have self managing systems that can determine dependencies and think for itself if a task is really needed to be done and can weigh the consequences of its actions. Something like probablistic decision making with thresholds.

This is not impossible as I have read that there are now automation systems that can control a majority of systems in an enterprise environment that are programmable themselves.

Re:Game Over (2, Funny)

Anonymous Coward | more than 3 years ago | (#36006448)

Has anyone mentioned Skynet yet?

Re:Game Over (0)

Anonymous Coward | more than 3 years ago | (#36006458)

Completely disagree. The solution is clear: eliminate all potential sources of human error.

And Skynet was born.

Re:Game Over (0)

Anonymous Coward | more than 3 years ago | (#36006476)

You're just holding it wrong.

Re:Game Over (1)

jd (1658) | more than 3 years ago | (#36006928)

How is this [wolfram.com] a cloud?

A cloud in need, is a cloud indeed (1)

Anonymous Coward | more than 3 years ago | (#36007074)

No, no, it is indeed a cloud: Thin, wispy and ephemeral.

Slashdot summary non sensationalist (3, Interesting)

rsborg (111459) | more than 3 years ago | (#36006128)

Amazingly the Cloudfoundry blog itself had a much more dramatic telling:

"... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.

Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

(emphasis mine).

I'd hate to be that ops guy.

VMware shows its PR colors. (4, Insightful)

shuz (706678) | more than 3 years ago | (#36006220)

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X. This also outline a major issue with "cloud" technologies. They are only as redundant and stable as the individuals managing them. Also that there is always the opportunity for single point of failure in any system, you just need to go up the support tree high enough. For most companies this is the data center itself as offsite DR can get expensive quick. For VMware it can be the Virtual Center, a misconfigured vRouter or even vSwitch. Finally putting all your eggs into one basket can increase efficiency and save money. It can also raise your risk profile. An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.

Re:VMware shows its PR colors. (2)

HFShadow (530449) | more than 3 years ago | (#36006664)

Agreed. They seem to treat it as some magical instance where touching the keyboard breaks things, as though this was written by someone's grandmother.

How did one engineer touching a keyboard when he shouldn't, take everything down? I don't think I could do this at work unless I was really trying hard. This is a really shitty response, especially compared to the writeup that amazon put out.

The Answer is obvious (1)

SuperKendall (25149) | more than 3 years ago | (#36006884)

How did one engineer touching a keyboard when he shouldn't, take everything down?

He touched the keyboard in its Special Place.

Not to worry though, they called in Chris Hanson to help with network ops in the future, we'll not be seeing a repeat.

Re:VMware shows its PR colors. (1)

Chuck Chunder (21021) | more than 3 years ago | (#36006804)

A better PR response

In what sense? I know that I appreciate frank disclosures of problems from our providers rather than obfuscating the issue (if nothing else it might highlight a similar problem in our procedures).

Re:VMware shows its PR colors. (1)

Vrtigo1 (1303147) | more than 3 years ago | (#36006874)

I find your comment regarding offsite DR a bit off base. For small shops, I would agree that maintaining two data centers would be expensive, but for most places that have any kind of substantial investment in IT, it should be just an expense that is factored in from day one. For instance, the company I work for has an annual IT budget of about 2.5 million. We have three datacenters in addition to our computer room at HQ. Two of the data centers are for our public facing apps which are load balanced between them. We have a generator at HQ which can run us for about a week, but if TSHTF, we can move our apps to the remote datacenter. At HQ, I've put as much of the critical infrastructure as possible in VMs for portability and ease of management. HQ is backed up by the 3rd datacenter, where we put a single God box consisting of four 6 core CPUs and 96 GB of RAM. This is sufficient to run all of our critical apps on the single server until we can get our HQ equipment back up and running, or we have time to order and install new equipment elsewhere. The storage from HQ is continuously replicated to the offsite D/R facility, so in the event of a disaster, all I have to do is power up the VMs there, change the outside hostname of our HQ VPN endpoint to point to the D/R firewall and tell people to disconnect and reconnect to VPN. This setup cost us about 90k in capital expenditures including equipment, software and implementation and costs about 10k a year to run. Call it 150k for the D/R site and the generator at HQ, and I' say that's a relatively minor cost in the grand scheme of things.

Re:VMware shows its PR colors. (5, Insightful)

ToasterMonkey (467067) | more than 3 years ago | (#36007002)

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.

"Transparency is bad" +4 Insightful

What the... ?

Re:VMware shows its PR colors. (3, Informative)

drooling-dog (189103) | more than 3 years ago | (#36007026)

To me it sounds like someone (non-technical) high up in the chain wanted to focus blame on an inadverant act by one of the engineers. Inadvertant, of course, so no one needs to get fired and file a lawsuit, and an engineer so that no one in upper management appears culpable. The downside is that they dramatically underscore the fragility of their cloud, thereby undermining its acceptance in the market. Not a good tradeoff, if that's the case.

Re:Slashdot summary non sensationalist (0)

Anonymous Coward | more than 3 years ago | (#36006234)

Yup, the good old "touched the keyboard" outage. When will they learn?!?

Re:Slashdot summary non sensationalist (2)

fuzzyfuzzyfungus (1223518) | more than 3 years ago | (#36006240)

"And that is the story we tell the new hires. If they ask why the employee health plan covers cyanide..."

Re:Slashdot summary non sensationalist (2)

Icegryphon (715550) | more than 3 years ago | (#36006324)


Keyboards, how do they work?
This does not bode well for VMware.
As much as I love their production,
I did chuckle at this major failure.

Re:Slashdot summary non sensationalist (1)

mirix (1649853) | more than 3 years ago | (#36006390)

It's easy to be all high and mighty when your Selectric isn't even capable of being connected to anything mission critical.

Actually - how did you manage to get it to post on /.?

Re:Slashdot summary non sensationalist (5, Funny)

Icegryphon (715550) | more than 3 years ago | (#36006420)

Don't go knocking my typewriter
It's Electric, and has wonderful BNC connector
for network access. IBM, you did good.

Better Verse [Re:Slashdot summary non sensationt] (0)

Anonymous Coward | more than 3 years ago | (#36006460)


Keyboards, how do they work?
This does not bode well for VMware.
As much as I do so love their production,
I did chuckle at this major failure.

No, you need to change that first line if you're going to post in rhyme:


Ah, keyboards: how do they function?
This bodes not so well for VMware.
As much as I do love their production,
I chuckled a bit at this major failure.

I'd work on that slant rhyme a bit, but then, what to I know? I'm an anonymous coward.

Re:Slashdot summary non sensationalist (-1)

Anonymous Coward | more than 3 years ago | (#36006470)


Keyboards, how do they work?
This does not bode well for VMware.
As much as I love their production,
I did chuckle at this major failure.


Burma Shave

Re:Slashdot summary non sensationalist (1)

BigGerman (541312) | more than 3 years ago | (#36006484)

Stopping engineers from touching keyboards is important part of maintaining one's cloud infrastructure. From experience.

Re:Slashdot summary non sensationalist (1)

Virtucon (127420) | more than 3 years ago | (#36006840)

The infrastructure design is not resilient and it seems late in the game to "develop a playbook" after you've gone live. Their credibility also in building a fault tolerant platform is questionable. While VMWare is at the core of a lot of data centers, there are other players that bring things to the table to build out the other pieces that make high availability and reliability a reality; I don't think they understand how all of this fits together. By reading that this was a "paper only" all hands on deck style of management also means that there's turmoil within their walls. Why? Somebody knowingly wouldn't take down infrastructure. Sure, it was a mistake but again it demonstrates the fragile nature of their design. I can shut down load balancers, storage processors, cluster nodes and power but it takes a heck of a lot of effort by a few keystrokes to take all of it out. A "full outage" of the network infrastructure by one guy? What was he doing? Was Change Management in play here? Was this person fucking around?

I'm sorry folks. Go back to the drawing board and design this correctly.

Re:Slashdot summary non sensationalist (0)

Anonymous Coward | more than 3 years ago | (#36006946)

See, the Playbook is such a disaster it brings down the entire Cloud (TM). Damn you RIM!

UR DOING IT WRONG! (2)

celest (100606) | more than 3 years ago | (#36006142)

You would think someone as big as VMware would have figured out, by now, that if "An inadvertent press of a key on a keyboard" can lead to "a full outage of the network infrastructure [including] all load balancers, routers, and firewalls [resulting] in a complete external loss of connectivity to [their Cloud service]" that they are DOING IT WRONG!

In other news, VMware announces they're releasing a new voting machine: http://xkcd.com/463/ [xkcd.com]

Re:UR DOING IT WRONG! (1)

Xtravar (725372) | more than 3 years ago | (#36006366)

I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context.
Like, did they touch it and press a key?
Did they touch it for an extended period, typing "killall cloud"?
Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?

Re:UR DOING IT WRONG! (0)

Anonymous Coward | more than 3 years ago | (#36006482)

This story sounds like the "my dog ate my homework" lie, so I expect no details.

Re:UR DOING IT WRONG! (1)

RoFLKOPTr (1294290) | more than 3 years ago | (#36006506)

I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context. Like, did they touch it and press a key? Did they touch it for an extended period, typing "killall cloud"? Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?

The keyboard they touched wasn't a keyboard in the conventional sense. It was a small 3"x3" yellow/black striped board with one large circular red key on it. Somebody touched that key even though the sign said "DON'T PUSH THIS." A harmless prank.

Re:UR DOING IT WRONG! (1)

LoRdTAW (99712) | more than 3 years ago | (#36006652)

It was probably inappropriately touched in a no-no place.

Re:UR DOING IT WRONG! (0)

Anonymous Coward | more than 3 years ago | (#36006704)

I use Ctrl-Z in shell windows a lot. If I hit it without realizing my VMware Workstation session has focus, VM suspended, no warning. Not hard to imagine something similar happened.

Re:UR DOING IT WRONG! (4, Funny)

Jeremi (14640) | more than 3 years ago | (#36006848)

I would like more elaboration on what "touched the keyboard" means.

It was an extreme case of static discharge. The engineer is lucky to be alive -- when doing cloud computing, thunderstorms are a huge hazard.

Re:UR DOING IT WRONG! (3, Informative)

larry bagina (561269) | more than 3 years ago | (#36007034)

Remember how your uncle used to touch you in your naughty place? It was like that.

This had to happen (0)

Anonymous Coward | more than 3 years ago | (#36006146)

All the VMware employees have their heads in the clouds!

Not the RED button!!! (2)

geekmux (1040042) | more than 3 years ago | (#36006160)

"...An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry."

OK, seriously, who the hell has that much shit tied to a single key on a keyboard?

I've heard of macros for the lazy, but damn...

Someone... (0)

Anonymous Coward | more than 3 years ago | (#36006182)

...forgot to press Ctrl+Alt.

Engineering Errors (4, Interesting)

Bruha (412869) | more than 3 years ago | (#36006188)

You can not really stop stupid people. However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices. Now you're losing customers on top of losing money, especially when it comes to compute clouds where you're literally billing by the hour. Even more so for long distance providers, cellular companies, and VOIP communications providers.

I am curious how the press of one key managed to wipe out the cloud, the load balancers, and the routers at the same time. Either they're using some program to manage their switching network which is the only key thing that could take it all out, or the idiot had the command queued up.

More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment. Simply fixed by requiring a password that way you can really nail an idiot that does it, and secondly bite the admin bullet and run vtp transparent mode.

There's no one command that's going to bring it all down, it's going to be a series of actions that result from a lack of proper network management, and lack of proper tested redundancy. Redundancy does not exist in the same physical facility, redundancy exists in a separate facility nowhere associated with anything that runs the backed up facility. Pull the plug on data center A, your customers should not notice a thing is amiss. If you can do that, then you have proper redundancy.

I believe the other problem is that were working on a 30+ year old protocol stack, and it's starting to show it's limitations. TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup. Something like keyed content where you can accept replies from two different sources, but the app can use the data as it comes in from each, much like bittorrent, but on a real time level. It requires twice the resources to handle an app, but if redundancy is king this type of system would be king and prevent some of the large outages we have seen in the past.

Re:Engineering Errors (1)

Niac (2101) | more than 3 years ago | (#36006248)

Bring Down Business [y/N]?> y <Enter>

Re:Engineering Errors (1)

Anonymous Coward | more than 3 years ago | (#36006558)

That's two keys. Bzzt. Wrong.

Re:Engineering Errors (1)

Anonymous Coward | more than 3 years ago | (#36006292)

I have never seen a decent size datacenter that actually uses VTP. VTP is sometimes used in campus networking, where things tend to move often so dynamically assigning VLAN's to trunks is useful, but even there it usually gets turned off because admins are scared of it. More likely in my opinion they were developing new configs to mitigate the problem they most recently experienced, and someone deployed the change to the production network instead of the test network. There's a catch to deploying a test network, you have to make the systems very similar for it to be effective, and you have to make making changes to the test network then deploying those same systems to the production network quick and easy to make it actually be used. In a crisis especially, you want to test your changes before you make the problem worse, but don't want to delay the solution any more then you need to.

Re:Engineering Errors (1)

dissy (172727) | more than 3 years ago | (#36006378)

Perhaps most of their infrastructure is virtual, and the button he pressed was the hosts power key, shutting down all the guests at once.

Re:Engineering Errors (0)

Anonymous Coward | more than 3 years ago | (#36006452)

I've heard of this thing where you can create an arbitrary sequence of commands or instructions, which may then, let me phrase this correctly, which may then, be executed by issuing a single so-called 'command'. I think it's called a scrip, or an app maybe.. they're kinda like batch files. I think that's how you do more with just a single key-press.

Re:Engineering Errors (1)

lucifuge31337 (529072) | more than 3 years ago | (#36006516)

More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment.

How does that even happen in a properly managed environment? In fact, even in an improperly managed one? I'd have to try hard to make that happen......I mean...really. Bring up an identically configured VTP master, change it enough times to get a higher rev number, put it on the same LAN and......without external inputs (dropping links to the real VTP master) pretty much nothing ought to happen (other than syslog screaming) unless you're using some really crusty old IOS/CatOS.

Re:Engineering Errors (1)

mulaz (1538147) | more than 3 years ago | (#36006648)

Easy!

Have a scaled-down copy of the production network in a lab, with all the same settings (like VTP domain etc.), test weird things (like it's normally done in a lab enviroment), and get the rev. number up high.

Then some piece of production equipment fails, (let's say a switch), and why not take one (basically the same one) from the lab? The lab can wait for the replacement, production usually can not. Then plug the switch to the production network, and puff, there go the vlans!

Re:Engineering Errors (1)

lucifuge31337 (529072) | more than 3 years ago | (#36006678)

So....what I said. Except you have it in your lab environment. And you don't relize its your VTP master. And you don't bother to put your production config on your replacement box before putting it in production....... Yeah. Not buying it as a likely scenario. This required multiple steps, and a fundamental lack of understanding of key functions of networking equipment in a datacenter setting (namely not knowing what your VTP master is) and a lack of any sort of sane procedures (putting a piece of equipment into production without so much as verifying a config). It's a plausible, but unlikely series of events that would require the input of someone who was not capable of building or maintaing the network in the first place.

Re:Engineering Errors (0)

Anonymous Coward | more than 3 years ago | (#36006726)

Plenty of people run their VTP domains as all servers...since they are too lazy to remember which is the server :)

Re:Engineering Errors (1)

lucifuge31337 (529072) | more than 3 years ago | (#36006828)

Plenty of people run their VTP domains as all servers...since they are too lazy to remember which is the server :)

And to my point, that's amateur hour stuff. Not what one would expect in a professional data center.

Also, that would not cause this proposed issue, as if they were all servers, none of them would take data as ca VTP client. It would be like not running VTP at all.

Re:Engineering Errors (1)

zbaron (649094) | more than 3 years ago | (#36006802)

Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.

Re:Engineering Errors (1)

lucifuge31337 (529072) | more than 3 years ago | (#36007032)

Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.

See my previous post about "crusty old IOS/CatOS".

Also, who the hell runs the same VTP name and auth key in production and the lab? That is BEGGING for problems.

Maybe I've just been doing this the right way for too long. I find it difficult te believe that there are networks of any scale that have any duration of uptime that aren't following very, very simple procedures to ensure uptime and/or are operating with such a complete lack of knowledge of the basic plumbing that makes them work. Also, who doesn't have automated config backups of infrastructure equipment?

I guess this boils down to the fact that I'm not an armchair network admin. I've been doing this a long time, and I know how it works. Someone doing something this stupid would be like watching someone put a car in gear and then crawl under it to me. It's not something you should have to TELL someone not to do. It's something that SHOULD'T HAPPEN when one or more well agreed upon basic procedures are being followed. If the person you are asking to do that kind of work needs to be told these things, you have failed as a manager, and likely as an organization. If your network(s) set the stage for this type of thing to be a possibility (sharing vtp info bewteen production and lab, hoping someone won't ever accidentally bridge the two) you again have failed as a manager or organization. The most basic of widely accepted best practices would put multiple barriers between this type of thing happening, requiring a cascading series of procedural failures to actually happen.

In summary.....Nope, still not buying this as a reasonable explanation.

I disagree. (1)

khasim (1285) | more than 3 years ago | (#36006806)

However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices.

The problem with such "security" is that the easier you make it for your admins to connect ... the easier you make it for the bad guys to connect.

The answer is to run training exercises for the various scenarios so that everyone knows what to do and where to go in such situations.

The problem with that is that people are lazy. Security is not difficult. But NOT doing it will always be easier (and yield immediate rewards) in the short term.

TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup.

Sounds good. But the system also has to be designed to take advantage of the technology that is available today. Too often the systems are based around the single machine running a single application with full administrative rights model. And the technological advances have just made it possible to fool the app into thinking it is on one machine while it runs on multiple machines (badly).

The CLOUD is VAPOR-WARE (1)

Purist (716624) | more than 3 years ago | (#36006226)

Next.

Re:The CLOUD is VAPOR-WARE (1)

Dyinobal (1427207) | more than 3 years ago | (#36006468)

I'd think that was obvious, clouds are made out of vapor by definition.

Re:The CLOUD is VAPOR-WARE (1)

Purist (716624) | more than 3 years ago | (#36006706)

Tanks for validating my joke...was it too dry?

Since I'm being an awful person today... (2)

fuzzyfuzzyfungus (1223518) | more than 3 years ago | (#36006228)

I, for one, would like to suggest that the Cloud Foundry is really foundering...

PEBKAC (1)

MrQuacker (1938262) | more than 3 years ago | (#36006330)

And that is why we need skynet.

Don't let it happen again (5, Funny)

stumblingblock (409645) | more than 3 years ago | (#36006334)

They just have to remove that key from the keyboard. You know, the one that massively crashes the entire system. Poor judgement to have that key there.

I don't trust "The Cloud" (1)

Beelzebud (1361137) | more than 3 years ago | (#36006340)

When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that. The Cloud is great for sharing photos or game saves, but I don't see a future where we all do our computing "in the cloud".

Re:I don't trust "The Cloud" (1)

Jeremi (14640) | more than 3 years ago | (#36006920)

When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that.

You know what beats a local hard drive? Two local hard drives, so that if one of them dies, you can still retrieve your data on the other one. And you know what beats two local hard drives? N hard drives in different locations, so that even after Evil Otto nukes your office and your branch office, you can still retrieve a backup copy of your data from another zip code.

I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home computer's local drive? That seems like it would be a good feature to have.

Re:I don't trust "The Cloud" (1)

jd (1658) | more than 3 years ago | (#36006958)

Hard drives are easy to beat. Core memory has an estimated lifespan 20-30x that of a hard drive, is impervious to EMP and won't crash if bumped.

Human error will always... (0)

Super Dave Osbourne (688888) | more than 3 years ago | (#36006350)

be an issue. The problem is how poorly is the infrastructure designed and implemented to allow one moron one key stroke to cause such havoc? Apparently it is very weak and susceptible.

Reminds Me of A Bad Challenger Joke... (0)

Anonymous Coward | more than 3 years ago | (#36006384)

What were Christa McCauliffe's last words ?

"What's this button for ..."

Re:Reminds Me of A Bad Challenger Joke... (-1)

Anonymous Coward | more than 3 years ago | (#36006634)

What were Osama bin Laden's last words?

"Who's there?"

Re:Reminds Me of A Bad Challenger Joke... (0)

Sulphur (1548251) | more than 3 years ago | (#36006906)

What were Osama bin Laden's last words?

"Shoot?"

... Our New Dark cloud Overlords (0)

Anonymous Coward | more than 3 years ago | (#36006388)

And, by the way, that was a really perfect and fully credible explanation, kind Sirs. Yes, indeed! Totally, perfectly, unassailably perfect. It makes perfect sense. Happens all the time. (Ohboy!) But then, this is the age of credulity, after all.

Cloud depends too much on internet (1)

bmservice (2102022) | more than 3 years ago | (#36006514)

I don‘t think we have enter the period that internet is available everywhere and everytime but without internet cloud is nothing

Cloudy Vision of My Future... (1)

BoRegardless (721219) | more than 3 years ago | (#36006528)

If I think I can trust a cloud to support my data.

Press any key to destroy everything... (0)

Anonymous Coward | more than 3 years ago | (#36006638)

Never let a cartoon super villain design your network infrastructure.

Human Factor (0)

Anonymous Coward | more than 3 years ago | (#36006640)

I was working for the world's largest SMS & MMS hosted provider powering up a few extra servers for provisioning when the entire server room went dark. The Engineering Manager had ordered a 100 Amp circuit breaker but had never replaced the 60 Amp breaker because he kept forgetting to schedule it. When the lights went out it took 3 hours from midnight 'till 3am to get everything back up and running. The 100 Amp breaker was sitting inches from where it was supposed to go - right there on top of the circuit breaker box.

Three months later the same thing happened again - with the "redundant" server row.

You didn't hear this from me.

Press Any Key to Continue! (0)

Anonymous Coward | more than 3 years ago | (#36006714)

Proceed to bang head on table.

The technology is almost there (1)

Exceptica (2022320) | more than 3 years ago | (#36006762)

I am only considering VMware products again if they fire the idiot who wrote the blog post and cane him in the public square. Come on VMware, we are hoping for some retribution here.

Now, for the technical part, I'm only considering cloudy products again if they replace keyboards and human engineers with unicorns fluent in Lisp who can rainbow-activate and maintain the flockolent interfuzzys to the cervically index, to protect my data. I'm just not using any ol' cloud. No sir.

"press any key to crash the cloud" (0)

Anonymous Coward | more than 3 years ago | (#36006824)

nice option

www.happyshopping100.com (0)

irisppp (2102166) | more than 3 years ago | (#36006914)

====== Something unexpected surprise ====== welcome to: -------====== http://www.happyshopping100.com/ [slashdot.org] " > http://www.happyshopping100.com/ [happyshopping100.com] ===== The website wholesale for many kinds of fashion shoes, like the nike, jordan, prada, also including the jeans, shirts, bags, hat and the decorations. All the products are free shipping, and the the price is competitive, and also can accept the paypal payment., After the payment, can ship within short time. 3 free shipping competitive price any size available accept the paypal exquisite watches 75$ 90X Extreme Fitness System ONLY ONLY 42 $$$$$$$ jordan shoes $ 32 nike shox $ 32 Christan Audigier bikini $ 23 Ed Hardy Bikini $ 23 welcome to: ------ http://www.happyshopping100.com/ [slashdot.org] " > http://www.happyshopping100.com/ [happyshopping100.com] Sm ful short_t-shirt_woman $ 15 ed hardy short_tank_woman $ 16 Sandal $ 32 christian louboutin $ 80 Sunglass $ 15 COACH_Necklace $ 27 handbag $ 33 0 AF tank woman $ 17 puma slipper woman $ 30 Believe you will love it. welcome to: -------==== http://www.happyshopping100.com/ [slashdot.org] " > http://www.happyshopping100.com/ [happyshopping100.com]

Cloud lol. (1)

unity100 (970058) | more than 3 years ago | (#36006940)

I cant see why it is too hard to realize that, if you end up tying everything into one major big structure, put everything in it, regardless of how much redundancy you designed, it will eventually flop grandly.

if not downtime, it will be security. if not, its something else. the idea is, you are creating one HUGE environment which contains everything. its inevitable that some issue affects all the participants in that environment eventually. those being the clients.

lets admit it - huge monolithic clouds, are a bad idea. there should be a certain size limit for clouds' sizes, and after that the customers should be placed to another discrete cloud unit.

p90x workout (-1, Flamebait)

kimic (2102120) | more than 3 years ago | (#36007040)

The p90x workout [p90x-p90x.org] program is a revolutionary system of 12 sweat-inducing, muscle-pumping exercises designed to transform your body from regular to ripped in 90 days.We promise we will help you to make a good body if you use the p90x [p90x-p90x.org] .Now the good chance is coming.We offer you the free p90x.The p90x on sale.

Whether you are looking to start a healthier life or get ripped abs you have come to the right place. P90X workout routines [p-90xworkout.com] is a revolutionary program that people all over the world have used to scalp the perfect body and make all of their friends jealous!I bly suggest you Pay attention to our information of p90x workout, limited low-cost snapping up about P90X workout schedule [p-90xworkout.com] .

Do you want to have a b body? p90x nutrition [p90xnutrition.org] is a good choice.a p90x nutrition plan [p90xnutrition.org] can improve your health in a step-by-step process.you can download a free p90x nutrition pdf from our website,come on everyone!

In our website,we have lots of P90X Reviews [p90x-review.org] ,You will begin to understand the product more deeply.with before and after pictures so you can decide for yourself if P90X is right for you P90X workout reviews [p90x-review.org] a complete workout that will get you ripped in 90 days.

Oh, so that's what the pause/break key does (0)

Anonymous Coward | more than 3 years ago | (#36007046)

^^^

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?