×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Can Maintenance Make Data Centers Less Reliable?

samzenpus posted more than 2 years ago | from the if-it-isn't-broken dept.

Hardware 185

miller60 writes "Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'The most common threat to reliability is excessive maintenance,' said Steve Fairfax of 'science risk' consultant MTechnology. 'We get the perception that lots of testing improves component reliability. It does not.' In some cases, poorly documented maintenance can lead to conflicts with automated systems, he warned. Other speakers at the recent 7x24 Exchange conference urged data center operators to focus on understanding their own facilities, and then evaluating which maintenance programs are essential, including offerings from equipment vendors."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

185 comments

In between maybe? (5, Insightful)

anarcat (306985) | more than 2 years ago | (#38182956)

Maybe there's a sweet spot between "no testing at all" and "replacing everything every three months"? In my experience, there is a lot of work to do in most places to make sure that proper testing is done, or at least that emergency procedures are known and people are well trained in them. Very often documentation is lacking and the onsite support staff have no clue where that circuit breaker is. That is the most common scenario in my experience, not overzealous maintenance.

Re:In between maybe? (5, Interesting)

Elbereth (58257) | more than 2 years ago | (#38183084)

I suppose that I'd agree. Back in the early 90s, I inherited from a friend a fear of rebooting, turning off, or performing maintenance on a computer. Half the time he opened the case, the computer would become unbootable or never turn back on. Luckily, as a talented engineer, he could usually fix whatever the problem was, but it was a huge pain in the ass. Of course, back then, commodity computer hardware was hugely unreliable, with vast gaps in quality between price ranges, and we were working with pretty cheap stuff. Still, to this day, I dread the thought of turning off a computer that has been working reliably. You never know when some piece of crap component is nearing the end of its life, and the stress of a power cycle could what pushes it over the edge into oblivion (or highly unreliably behavior). I used to be fond of constantly messing with everything, fixing it until it broke, but his influence moderated that impulse in me, to the point where I usually freak out when anyone suggests unnecessarily rebooting a computer. Surely, there's something to say for preventative maintenance, and I'd rather be caught with an unbootable PC during regularly scheduled maintenance than suddenly experiencing catastrophic failure randomly, but there's something to be said for just leaving the shit alone and not messing with it. Every time you touch that computer, there's a slight chance that you'll accidentally delete a critical file directory, pull out a cable, or knock loose a power connector. The fewer the times you come into contact with the thing, the better. If I could build a force field around every PC, I probably would.

Re:In between maybe? (3, Informative)

sphealey (2855) | more than 2 years ago | (#38183212)

===
Back in the early 90s, I inherited from a friend a fear of rebooting, turning off, or performing maintenance on a computer. Half the time he opened the case, the computer would become unbootable or never turn back on.
===

Neither you nor your friend are alone in thinking that:

AD-A066579, RELIABILITY-CENTERED MAINTENANCE, Nowlan & Heap, (DEC 1978) [this used to be available for download from the US Dept of Commerce web site; now appears to be behind a US government paywall (!)]

A more recent summary:

http://reliabilityweb.com/index.php/articles/maintenance_management_a_new_paradigm/ [reliabilityweb.com]

sPh

soft vs hard reboot (1)

Joe_Dragon (2206452) | more than 2 years ago | (#38183330)

Some times software / os needed at least a soft reboot from time to time to clean up stuck software and remove memory leaks.

Now some stuff like firmware updates may need a hard reboot.

As for power cycleing some times you need to do it to get back into a crashed system.

Re:soft vs hard reboot (1)

Pharmboy (216950) | more than 2 years ago | (#38183410)

Some times software / os needed at least a soft reboot from time to time to clean up stuck software and remove memory leaks.

What operating system are you using, Windows 98? The worst case I've had with Linux (CentOS 5.4) is NFS locking up and taking one of the CPUs for a ride to jump the load to around 8 to 9. Even then, a little patience killed the process after a while. There have been times when it was FASTER to hard boot (but the risks suck), but most modern applications and operating system ON THE SERVER don't usually leak memory in this day or age. I'm guessing most people restart processes regularly via CRON or the MS equiv. on a regular basis anyway.

Obviously you reboot for many firmware updates or kernel updates, but almost never do modestly maintained servers just "crash" without there being a hardware failure.

well some windows updates still need reboots less (1)

Joe_Dragon (2206452) | more than 2 years ago | (#38183472)

well some windows updates still need reboots it's less then it was in the past but still more then with linux.

Also a lot of NON OS software updates / installers at least say they need a reboot.

Re:soft vs hard reboot (3, Insightful)

Bigbutt (65939) | more than 2 years ago | (#38183714)

You must not deal with any Oracle database servers. They leak like a sieve.

[John]

Re:soft vs hard reboot (2)

hedwards (940851) | more than 2 years ago | (#38183856)

NFS locking up is ultimately a part of the spec. It was originally a stateless filesystem that operated over UDP. Unless you're using a more recent revision of the protocol and have it configured as such, you're going to have issues with it locking up regularly.

Re:soft vs hard reboot (0)

Anonymous Coward | more than 2 years ago | (#38184030)

You've never had a Linux system with a bunch of zombies stuck in IOWait? Not even SIGKILL will cause them to die.

Re:soft vs hard reboot (2)

PCM2 (4486) | more than 2 years ago | (#38184272)

Desktop PCs and servers seem to have largely overcome the need to reboot regularly, but other segments of the industry seem to be moving backwards. My Android handset actually says in the manual that you should power cycle it regularly. With a firmware upgrade, it even started giving me a warning from time to time, telling me I had not power cycled the phone in X amount of times and that I should do that now or risk instability. (Am I crazy for assuming that a phone OS is a markedly less complex environment than a Linux server? And here I thought Android applications ran in a fully memory-managed, garbage-collecting environment.)

Re:In between maybe? (4, Interesting)

mspohr (589790) | more than 2 years ago | (#38183564)

Do you know why satellites last so long in a hostile environment?... because nobody touches them.

"If it's not broken, don't fix it."

Re:In between maybe? (4, Insightful)

CyprusBlue113 (1294000) | more than 2 years ago | (#38183742)

Do you know why satellites last so long in a hostile environment?... because nobody touches them.

"If it's not broken, don't fix it."

Actually I'm pretty sure it's the millions that are spent engineering each individual one so that it specifically can survive many years in said hostile enviroment.

If we spent anywhere near what is spent on proper engineering in time and money, everyday crap would be pretty damn reliable too, just not nearly as cost effective

Re:In between maybe? (1)

Libertarian001 (453712) | more than 2 years ago | (#38183840)

Actually, the components would be pretty close to as cost effective. The executives just wouldn't get their uber-fat rewards.

Re:In between maybe? (2)

dave562 (969951) | more than 2 years ago | (#38183648)

I am still that way with firmware upgrades. I think it probably has something to do with our generation. In the 90s, computer hardware was touchy and was expensive to replace. If you're like me, you probably grew up blowing into Nintendo game cartridges when they did not work. But back to firmware, I only upgrade it when necessary. Over the last fifteen years I have seen too many firmware upgrades bork hardware that was working just fine. With security patches I do them monthly, but not firmware. And never CIsco IOS. Once the config is good, leave it be!

The threat today is automated updates (1)

syousef (465911) | more than 2 years ago | (#38183986)

One bad automated update can lead to your system hosed or obscure reliability problems, perhaps not showing up for a while and the worst ones again leaving you with little option but to rebuild a system.

So I turn off auto update on everything I can, and manually update periodically. I consider the security risk smaller this way. I get it stable, and let it run that way for a few months at least. Then update security fixes etc.

Re:In between maybe? (1)

Lorens (597774) | more than 2 years ago | (#38184330)

I've seen for myself that hard disks that run for a long time (years) have problems starting up again after a power off. I've long supposed that it had to do with some bearings wearing out or oil getting used up. RAID is of course the correct answer to that, but even if I have to offline a service for some reason, I've gotten into the habit of not powering off the second side of a HA pair until the first one is safely back up.

Re:In between maybe? (4, Insightful)

mabhatter654 (561290) | more than 2 years ago | (#38184530)

if that's the case, you don't have CONTROL over your equipment.

That was acceptable for Windows 95 but not even for desktop PCs anymore, let alone server equipment. My opinion is that your equipment isn't stable UNTIL you can turn it off and on again reliably. And yes... that is an ENORMOUS amount of work.

If you can't reliably replace individual pieces then you don't have control for maintenance... sure you can stick your head in the sand and just not touch anything... but that's just piling up all the things you didn't take time to figure out until come critical time later.

Re:In between maybe? (1)

datavirtue (1104259) | more than 2 years ago | (#38184542)

Buy good stuff, document, have on-line test systems, and keep replacement hardware on-hand. No maintenance required, why mess with stuff if it is working. If you break a critical system in the midst of maintenance you have to either lie about it, or fess up and explain to management that you were dinking with it.

Maintenance and prevention are not always the same (3, Interesting)

sandytaru (1158959) | more than 2 years ago | (#38182982)

I believe the article is referring to major hardware replacements, stress testing, etc. But there is other preventative or even detective work that needs to be done in data centers large and small that have nothing to do with equipment. You can't just blithely assume that things are always going to work as they are supposed to work. One time, we discovered that the camera server for one of our clients had stopped recording for no good reason, and upon closer inspection discovered that the hard drive failed and we had no alert system in place since it wasn't a "real" server but just a heavy duty XP machine. After that blunder, I was asked to check on all the cameras servers once a week and make sure I could actually open up and view recordings from days past. This is a preventative action, but not really a maintenance one.

Re:Maintenance and prevention are not always the s (2)

belrick (31159) | more than 2 years ago | (#38183010)

After that blunder, I was asked to check on all the cameras servers once a week and make sure I could actually open up and view recordings from days past. This is a preventative action, but not really a maintenance one.

No, it's not preventative. It does nothing to prevent the problem. It detects the problem earlier (before, say, a business user does). That's monitoring. It's proactive, not reactive - perhaps that's what you mean?

Re:Maintenance and prevention are not always the s (2)

sphealey (2855) | more than 2 years ago | (#38183056)

===
No, it's not preventative. It does nothing to prevent the problem. It detects the problem earlier (before, say, a business user does). That's monitoring. It's proactive, not reactive - perhaps that's what you mean?
===

It is deeply unclear whether what is traditionally termed "preventative maintenance" (intrusive work involving disassembling, eyeballing, software probing, etc) actually improves reliability over conditioning monitoring tests followed by break-fix work as described by the parent post. More PM, more procedures, more teardowns, and so forth are the standard prescription for improving reliability but there is metric tons of evidence the universe just doesn't work that way.

sPh

In the Army we had "PMCS". (1)

khasim (1285) | more than 2 years ago | (#38184206)

https://en.wikipedia.org/wiki/Preventive_Maintenance_Checks_and_Services [wikipedia.org]

Or to use a more common example, think about changing the oil in your car every (time interval) or (distance interval). Will it stop failures? Maybe. Maybe not.

On the other hand, every time you "work" on a system you introduce entropy.

As long as you remove more entropy than you introduce, you should have a more reliable system (than if you hadn't worked on it at all). But that gets into the training/knowledge of the person performing the PM.

Re:Maintenance and prevention are not always the s (3, Interesting)

bussdriver (620565) | more than 2 years ago | (#38183174)

Planned obsolescence has been promoted in all aspects of life since post WW2 and now it is hard to imagine the world without it. That line of thinking has been creeping into everything even in areas where it doesn't seem to apply.

Does this play a factor on the perception of preventative maintenance or its frequent application? I think it probably does in at least a couple ways, don't you?

Re:Maintenance and prevention are not always the s (1)

Anonymous Coward | more than 2 years ago | (#38183864)

It's planned to survive for however long it's estimated to be used by the consumer. That's because the one determining factor is technological advancement. Progress is a moving target. It's reflected in the market place. Once we hit a brick wall in progress, the focus will return to reliability and planned long-life. Similar to how my old 1950s Sunbeam toaster will be passed on from generation to generation like a family heirloom.

Re:Maintenance and prevention are not always the s (1)

Smallpond (221300) | more than 2 years ago | (#38183190)

Internal monitoring of components is a lot better now than it used to be. We used to go around and check all of the supply fans once a month because it was the highest failure rate component on the desktops we were using and there was no indication until the machine started crashing from overheating.

Security updates (5, Informative)

bjb_admin (1204494) | more than 2 years ago | (#38182990)

Sometimes I get the feeling that security updates can in most cases cause more problems than the issues themselves.

I can think of many occasions that a security update has broken a server/router/etc. Obviously the lack of a security update can lead to a bigger headache in the future. But the typical user doesn't understand and has the attitude "IT broke the server again".

If a virus or hacker causes an issue the attitude is "I hope they fix that soon. I hate viruses/hackers" (obviously this is a huge generalization).

Re:Security updates (0)

Anonymous Coward | more than 2 years ago | (#38183270)

Incompetent admins.

Re:Security updates (1)

kasperd (592156) | more than 2 years ago | (#38183938)

Sometimes I get the feeling that security updates can in most cases cause more problems than the issues themselves.

Some vendors will push security updates and other updates through the same channel, sometimes even without the user knowing if a particular update is fixing a security problem. Occasionally the update will even be installed without the user's accept.

If it was just a matter of only installing updates to fix known security problems and no other changes were made to the software, I think cases of an update causing a problem would be much smaller.

Reliability Centered Maintenance (4, Interesting)

sphealey (2855) | more than 2 years ago | (#38182994)

===
"Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'
===

It isn't just human error: the very act of performing intrusive tasks under the theory of "preventative maintenance" can greatly reduce reliability of systems built of reasonably reliable components. This was studied extensively by the US airlines, US FAA, and later the USAF in the 1970s when the concept of reliability centered maintenance was developed for turbine engines and eventually full airliners. Look up the classic report by Nowland & Heap. Very much counter-intuitive if one has been trained to believe in the classics of "preventative teardowns" and fully known failure probability distribution functions, but matches up well to what experience field mechanics have been saying since the days of the pyramid construction.

sPh

Of course, today there is a huge "RCM" consulting industry, 7-step programs, etc that bears little resemblance to the original research and theories; don't confuse that with the core work.

Re:Reliability Centered Maintenance (1)

sphealey (2855) | more than 2 years ago | (#38183008)

===
n the 1970s when
===

Sorry; that should be "1960s" not 70s.

Re:Reliability Centered Maintenance (0)

Anonymous Coward | more than 2 years ago | (#38183422)

FYI Slashdot has quote tags.

Maintenance took down Chernobyl (3, Informative)

ExtremeSupreme (2480708) | more than 2 years ago | (#38183000)

That being said, it was because their procedures were shit, not because they were doing maintenance.

Re:Maintenance took down Chernobyl (-1)

Anonymous Coward | more than 2 years ago | (#38183144)

It's not their procedures were shit, it's that a high level party functionary decided to "put another medal on his chest" and ordered a dangerous experiment with safety systems turned off. And BTW, you're full of shit by saying what you said in your post. Several firefighters died trying to put out the fire. See you in hell with your ass burning on the hot frying pan.

Re:Maintenance took down Chernobyl (5, Informative)

crankyspice (63953) | more than 2 years ago | (#38183156)

That being said, it was because their procedures were shit, not because they were doing maintenance.

Actually, no, the Chernobyl disaster was sparked with a 'live' test of a new, untested mechanism for powering reactor cooling systems in the event of a disaster that brought down the power grid. http://en.wikipedia.org/wiki/Chernobyl_disaster#The_attempted_experiment [wikipedia.org] (And even that test was delayed several hours, into a shift of workers that weren't properly prepared to conduct the test.)

Re:Maintenance took down Chernobyl (0)

Anonymous Coward | more than 2 years ago | (#38184172)

That is precisely the point. Maintenance procedures by their very nature are performed much less often than normal operations so the likelihood of failure either due to human error or erroneous procedures is much higher.

Re:Maintenance took down Chernobyl (1)

scamper_22 (1073470) | more than 2 years ago | (#38183680)

Unfortunately, that is where real world data and experience comes into play.

GIVEN that their procedures were shit, maintenance actually made things worse and thus cased Chernobyl.

Now theoretically, you just need better procedures to make maintenance be a net positive. However, that doesn't change the advice that you shouldn't do such maintenance... GIVEN that your procedures are bad.

Given that humans are error prone and the IT procedures are shit, it is probably good advice to not do the maintenance.

Re:Maintenance took down Chernobyl (2)

vlm (69642) | more than 2 years ago | (#38184220)

GIVEN that their procedures were shit, maintenance actually made things worse and thus cased Chernobyl.

I'm guessing you were going for the sarcasm points, but for those who don't know about nuke eng as much as myself and presumably scamper, they had perfectly good procedures for experiment engineering evaluation that they mysteriously chose not to follow, and there was no maintenance involvement at all. Its the opposite of what he was claiming.

The quickie one liner of what happened is a RBMK has an extremely sensitive control loop by the very nature of what it means to be a RBMK, and the engineers who know exactly what happens when you suddenly slam the gain of a control loop like that up to 11 were intentionally cut out of the loop; no one officially knows why; the negative oscillations to zero were not terribly impressive, but everyone noticed the final positive swing to 40 GW or so.

The ironic part is they were trying to improve safety by figuring out a ultra short term blackstart capability for the safety systems. It would actually have worked pretty well on a PWR design, which is probably what gave them that peculiar idea. One of the dead guys probably successfully did that "all the time" on his old PWR...

Re:Maintenance took down Chernobyl (1)

vlm (69642) | more than 2 years ago | (#38184268)

I thought about it a bit, and it would have worked on a PWR, but it would have worked REALLY WELL on a BWR, if you can survive the pressure fluctuations, which it probably could have. Of course if they're not running the reactivity control loop past engineering, they're not going to run the hydrodynamic control loop past engineering, so they might have industriously found a way to blow themselves up that way too.

PWRs are dead stable, not terribly sexy, quite heavy and bulky, and have more moving parts. If the coolant is boiling in the core, you're doin' it wrong in a PWR.

BWRs are way less stable (but seem rock solid compared to a RBMK), used to be the new sexiness before pebble bed reactors and all that, smaller and lighter, and have fewer moving parts. If the coolant is boiling in the core, you're doin' it right, in a BWR.

The point is if you survive the surge, the BWR is happy, actually thrilled, to have some core boiling, whereas the PWR might get royally pissed off if the pressurizer can't pressurize.

where's the car analogy? (4, Funny)

Anonymous Coward | more than 2 years ago | (#38183040)

The guy at the garage always recommends I do an $80 transmission flush.

More MBA Constultant BS... (3, Interesting)

Anonymous Coward | more than 2 years ago | (#38183072)

Seriously...I sometimes think the average IQ is dropping on a daily basis (and, yes, I get the irony)...Both with what I read, and my own experiences working in IT, I become more and more convinced that society will eventually collapse under the weight of bad advice from consultants (and, no, I don't own a fallout shelter)...and I spend more and more time thinking about ways that I can profit off of the stupidity of leadership.

Re:More MBA Constultant BS... (1)

ColdWetDog (752185) | more than 2 years ago | (#38183764)

Seriously...I sometimes think the average IQ is dropping on a daily basis (and, yes, I get the irony)...Both with what I read, and my own experiences working in IT, I become more and more convinced that society will eventually collapse under the weight of bad advice from consultants (and, no, I don't own a fallout shelter)...and I spend more and more time thinking about ways that I can profit off of the stupidity of leadership.

Read it [google.com] and grab your tin foil and duct tape. You're gonna need lots.

Another lesson relearned (2)

jimbrooking (1909170) | more than 2 years ago | (#38183102)

In days of old, running "big iron" from Control Data and Cray, the worst days of system instability were those following "preventive maintenance". Plus ca change....

Can faulty logic make data centers less reliable? (5, Insightful)

DragonHawk (21256) | more than 2 years ago | (#38183110)

From TFS:

"... poorly documented maintenance can lead to conflicts with automated systems ..."

That doesn't mean maintenance makes datacenters less reliable. It means cluelessness makes datacenters less reliable.

Sheesh.

Re:Can faulty logic make data centers less reliabl (3, Insightful)

HalAtWork (926717) | more than 2 years ago | (#38183260)

Exactly.

vigorous maintenance
excessive maintenance
poorly documented maintenance

Those are all qualified as out of the ordinary. Anything in excess (on either side of the scale, whether it is too much or not enough) is a problem. Of course maintenance must be performed, but I guess some data centers have a strange idea of best practices, or they do not follow them.

Re:Can faulty logic make data centers less reliabl (4, Insightful)

FaxeTheCat (1394763) | more than 2 years ago | (#38183338)

Precisely my thought.

Maintenance, like anything else you do in a datacenter or wherever you work, must be done correctly. If maintenance reduces the reliability of the maintained entity, then per definition, it was not correctly performed.

Doing something correctly requires knowledge, planning and training. Just like everything else.

Maintenance-induced failure. (5, Insightful)

Animats (122034) | more than 2 years ago | (#38183120)

There's something to be said for this. Back when Tandem was the gold standard of uptime (they ran 10 years between crashes, and had a plan to get to 50), they reported that about half of failures were maintenance-induced. That's also military experience.

The future of data centers may be "no user serviceable parts inside". The unit of replacement may be the shipping container. When 10% or so of units have failed, the entire container is replaced. Inktomi ran that way at one time.

You need the ability to cut power off of units remotely, very good inlet air filters to prevent dust buildup, and power supplies which meet all UL requirements for not catching fire when they fail. Once you have that, why should a homogeneous cluster ever need to be entered during its life?

Re:Maintenance-induced failure. (5, Insightful)

DarthBart (640519) | more than 2 years ago | (#38183274)

There's also been a shift in the mentality of how well computers operate. It went from not tolerating any kind of downtime to the Windows mentality of crashing and "That's just how computers are".

Re:Maintenance-induced failure. (2)

brusk (135896) | more than 2 years ago | (#38183382)

I think that predates Windows. Crashes of various kinds were frequent on Apple IIs, Commodores, etc. You just got use to various reboot/retry routines.

Re:Maintenance-induced failure. (0)

Anonymous Coward | more than 2 years ago | (#38184398)

I think that predates Windows.

Not true. I used many computers and programmable calculators before and during that time and they were all very reliable by comparison to windows. Unlike Microsoft the other vendors actually cared about the quality of their products and giving value in exchange for your cash. Microsoft was always in it just for the money. Their first products (Microsoft Z80/CPM Fortran and Assembler, Basic) were okay but it was swiftly downhill after that and they didn't care whether transactions were win-win or win-lose.

Crashes of various kinds were frequent on Apple IIs, Commodores, etc.

Not even close. There were of orders of magnitude more crashes under windows and this was true for decades afterwards.

Microsoft and Windows are almost entirely responsible for raising a generation of people who erroneously think computers crashing daily are somehow "normal" and that continues to this day.

You just got use to various reboot/retry routines.

No, it was all much the same, just had an okay to reboot or enter debugger popup.

Re:Maintenance-induced failure. (1)

FaxeTheCat (1394763) | more than 2 years ago | (#38183362)

The future of data centers may be "no user serviceable parts inside". The unit of replacement may be the shipping container. When 10% or so of units have failed, the entire container is replaced. Inktomi ran that way at one time.

I saw a while back (probably a year or two) that this is the way Microsoft will run (runs?) their Azure systems.

By the time 10% of the units are not working, it may be time to upgrade to the latest technology anyway. If you exclude disks, then I am certain you could run such a container for more than 10 years that way.

Re:Maintenance-induced failure. (2)

jklovanc (1603149) | more than 2 years ago | (#38184016)

Possibly because ten out of 100 units have failed because a $200 hard drive has failed in each one? Does that mean that the whole $100,000 cluster needs to be replaced? Spending $100,000 instead of $2000 is not a great decision.

In other news ... (1)

xs650 (741277) | more than 2 years ago | (#38183132)

Any maintenance done wrong or in excess can be more damaging to a system than no maintenance.

I agree... (1)

Anonymous Coward | more than 2 years ago | (#38183138)

I have done data center hardware maintenance, IMAC, troubleshooting, repairs, etc for the past 15 years. From my experience, the biggest problem has been with Sys Admins who do not know their hardware, do not follow procedure, quick to point fingers at hardware and to root problem, and and over-all belief that firmware upgrades are the "Holy Grail" to prevent and fix all problems.

With most systems redundant in power, storage, etc, everything does not always need to be power cycled. That causes fans and other components to fail when coming back up....yes, they were probably failing but if a repair can be done on the fly, why power cycle all the time?

It really gets fun when someone does not have their system alerts set up properly and someone takes down a server with updates to the OS/App side and then discover the problems and they all complain about how they are losing thousands of dollars in down time. If the system is that critical, why not design a fail over system, it would be cheaper in the long run....

Also, most hardware service outfits have several spares available for 7/24 repairs but when a Sys Admin or business unit decides to down several systems in a weekend that have been running for months/years and upon power up there are several drive failures, fan failures, PPM failures, power supply failures, cache battery failures, etc, most vendors including OEMs cannot have all parts available at a moments notice since we spare according to failure rates an part pricing. A pro-active heads up project with a list of hardware that is going to be worked on, would be nice. We could stand by with parts available for repairs and even on site "eyes and hands" if the systems crap out and need on site assistance.

There are several flaws in data center management from the single file and print server in a small office to large data centers with huge support groups like banks and government centers. The same problems exist on all levels...some are just better prepared for the "worst" where others think we are magician/mind readers....

As "Bones" would have said if he were in IT..."Dammit Jim, I am a technician....not a magician!"

The key to achieving high uptime ... (2)

zensonic (82242) | more than 2 years ago | (#38183148)

... is actually quite simple: You keep your hands off the systems. Period.

In detail, you plan, install and _test_ your setup before it enters production. You make sure that you can survive whatever you throw at it wrt. errors and incidents. You then figure out how much downtime you are allowed to have according to SLA. You then divide this number into equal sized maintaince windows together with the customer. And then you adhere to these windows! No manager should ever be allowed to demand downtime out of band. Period. In between you basically minimize your involvement with the systems and plan your activities for the next scheduled closing window.

And you ofcourse only deploy stable, true and tested versions of software and operating systems. And even though your OS supports online capacity expansion on the fly, you really shouldn't use the capability unless you absolutely have to. Instead you plan ahead in your capacity management procedure and add capacity in the closing windows. And you do not test and rehearse failures! It only introduces risks ... besides that you have already tested and documented them. And as you haven't changed the configuration, there is no need to test again.

So in essence. Common sense will easily yield 99.9%. Carefull planning and execution will yield 99.99%. The really hard part is 99.999%... /zensonic

Re:The key to achieving high uptime ... (4, Insightful)

Smallpond (221300) | more than 2 years ago | (#38183222)

Which means for every online server you need an offline test machine -- and a way to simulate the operating environment in order to test. Not many companies have the skill of cash to do that.

That's where planning comes in. (1)

khasim (1285) | more than 2 years ago | (#38184364)

You only need a server for each item that is different. So if you standardize on hardware / OS then you only need 1 server to test hardware drivers and OS updates and so forth.

Beyond that, you really should have a test database system and a test app system. You never want to deploy updates into a production environment without going through a test system first (which is NOT the same as a development environment).

Virtual systems can help a lot with the server requirements. But you still need to understand the hardware / virtual / OS / app differences and plan accordingly.

Re:The key to achieving high uptime ... (2)

darth dickinson (169021) | more than 2 years ago | (#38183604)

And I would have my own personal unicorn that craps Skittles on demand. Also, I could eat candy and poop diamonds.

Meanwhile, here in the real world... systems experience unexpected failures that will require them to be patched/rebooted/etc at the most inconvenient of times.

Missing the point... (1)

nitehawk214 (222219) | more than 2 years ago | (#38183154)

Obviously not doing maintenance is much worse than the risk incurred by doing maintenance. However in the 2 years of using the datacenter my company relies on, I can say the only 3 major outages have been due to non-routine maintenance. Once was during a power upgrade, the datacenter supposedly has redundant power company connections, but the plug was somehow pulled during the upgrade anyhow. Another was during network maintenance, our dual redundant internet connections turned out not to be so redundant when half the system went offline. And finally during AC upgrades that caused our entire server cluster to overheat and shut down.

So, the story here isn't "maintenance causes unplanned downtime therefore doing less maintenance causes less downtime". Its more like "unusual maintenance is more likely to get messed up". One would think you would be more careful when performing unusual procedures, but that is still when things get screwed up most often.

Far from obvious (1)

sphealey (2855) | more than 2 years ago | (#38183224)

===
Obviously not doing maintenance is much worse than the risk incurred by doing maintenance.
===

That's far from obvious, actually, and is demonstrably wrong for many types of systems and installations.

sPh

Re:Missing the point... (1)

dave562 (969951) | more than 2 years ago | (#38183700)

It is time to switch data centers. We're with AT&T (and AT&T is far from the best) and they do quarterly power tests without a single problem. They've done core infrastructure router upgrades with zero down time. All in all, I'm very happy with the service. Any competent co-lo provider should be able to handle the issues you've had without any hiccups.

Transfer switch ratings (4, Interesting)

vlm (69642) | more than 2 years ago | (#38183158)

Check your transfer switch ratings. I guarantee it will be spec'd much lower than you think. The electricians think it'll only be switched a couple times in its life. The diesel service provider thinks you're running it twice a week. Whoops. If you run it once a week, it'll only survive a couple years, then you'll get a facility wide multi-hour outage. I've personally seen it over and over again over the past two decades. The best part is "we have a procedure" so it'll only be run during maint hours and the desk jockeys 200 miles away will run it rain or shine, so its guaranteed that the xfer switch destroys itself at 2 am during a blizzard and it'll take half a day to repair.

Very few xfer switches are more reliable than commercial utility power. Installing a UPS actually lowers reliability in almost all professional situations.

My favorite power outage was caused by a gas leak a couple blocks away, where the utility co shut down the AC and then threatened to take an axe to the gen/UPS if not also shut off. This was not in the official written report, just word of mouth.

Useless article with no data. (4, Interesting)

Vellmont (569020) | more than 2 years ago | (#38183196)

I read through the entire article, and saw zero data to support his assertion. I'm sure he has the data, but the article didn't reference a single piece of it. Without any data to support the theory all we have is a fluff opinion piece. Shame on Data Center Knowledge for writing an article about a scientific investigation, and not presenting a single piece of scientific evidence.

This is well known from Formula One (5, Interesting)

igb (28052) | more than 2 years ago | (#38183228)

Some years ago, the F1 rules were changed so that cars were in parc ferme conditions, with strict limits on what can be done to them, from the start of qualifying on Saturday lunchtime until the race finishes on Sunday afternoon.

The purpose was partly to stop qualifying being its own arms race, with cars in completely different specification than for the race, and partly to reduce costs and the number of travelling staff. At the same time, "T Cars" --- a third car, available as a spare --- were banned, so that if a driver destroys a car in practice the team either have to rebuild it or not race. They're allowed to travel with a spare monocoque, but it cannot be built-up and it does not get pit space.

There were endless howlings from the teams, claiming that without a complete strip-down after qualifying, with a large crew working overnight to check everything on the car, reliability would go through the floor and races would finish with only a handful of stragglers fighting a durability battle (our US viewers may find this ironic in light of a certain US Grand Prix, of course).

The same argument was advanced, mutatis mutandis, over limitations on engines and gearboxes, limitations on the number of gear clusters available, limitations on certain forms of telemetry and a wide variety of "the cars can't just be left to run themselves, you know" interventions.

In fact, reliability is now far greater than ten years ago. It's not uncommon for there to be no mechanical retirements, certainly not from the longer-standing teams, and the days of engines imploding on the track are long gone. A front-running driver will probably only have one, if even that, mechanical DNF per season. The teams deliver a functioning car when the pit lane opens at 1pm Saturday, and that car then runs twenty or thirty laps in qualifying and sixty or seventy in the race, a total of perhaps 250 miles, without much maintenance work beyond tyres, fluids and batteries (section 34.1 on page 18 of the sporting regulations [fia.com] ).

So again, we see that "preventative maintenance" turns out to really be "provocative maintenance", and leaving working machines alone is the best medicine for them.

Re:This is well known from Formula One (4, Insightful)

scattol (577179) | more than 2 years ago | (#38183716)

Those cars, to be competitive, were engineered to fall apart on the other side of the finish line. Without maintenance they would have failed. They are now engineered to last a few races instead of just one. Odds are they are slightly slower in one form or the other but it being a level playing field, it doesn't matter.

Re:This is well known from Formula One (1)

vlm (69642) | more than 2 years ago | (#38184088)

Also don't forget driving style is modified. "F it, you guys are going to tear the thing down anyway so I'm gonna lean it out to sneak in a couple more laps per tank" is replaced with babying the car.

Re:This is well known from Formula One (0)

Anonymous Coward | more than 2 years ago | (#38184054)

Doesn't hold for the consumer market however. In F1, you have lots of R&D being pumped into performance and reliability. In the general consumer market, items are sold mainly on price. So in a general sense, you get what you pay for. For example, server hardware is more reliable than your $300 desktop special.

Re:This is well known from Formula One (1)

jklovanc (1603149) | more than 2 years ago | (#38184060)

So they removed the maintenance between qualifying and race. That does not mean that the race team does not do "maintenance" between races. Yes there is "less maintenance" but not "no maintenance".

If it's working (1)

tmpsantos (1161451) | more than 2 years ago | (#38183238)

Sounds like the "If it's working don't touch" maintenance policy.

If it ain't broke, don't fix it. (0)

Anonymous Coward | more than 2 years ago | (#38183682)

Age old proverb, true today as it was a century ago.

"The most reliable machine... (3, Funny)

John Hasler (414242) | more than 2 years ago | (#38183242)

...is the one farthest from the nearest engineer."

Consider The Pioneer and Voyager spacecraft and the Mars landers.

Remember Chernobyl (-1)

Anonymous Coward | more than 2 years ago | (#38183296)

Enough said

battery maintenance / changing out battery is nee (1)

Joe_Dragon (2206452) | more than 2 years ago | (#38183314)

battery maintenance / changing out old battery's is need as a dieing battery can fail to work when need or at the worst they can have a explosion.

Lesson learnt in Aviation (1)

Anonymous Coward | more than 2 years ago | (#38183318)

Commercial Aviation has learnt this lesson a long time ago, after analysis showed that maintenance induced failures exceeded the potential failures avoided.

Now most parts for airframes and engines are on IRAN schedules - Inspect and Repair/Replace as needed. Only a few filters and things are replaced on a fixed schedule.

Here's the car analagy ... (1)

Anonymous Coward | more than 2 years ago | (#38183368)

One of the Asian automotive manufacturers, Toyota, Honda, or Nissan, sent around a crew in their assembly plants to retorque any loose nut/bolt/screw they found. They saw a dramatic reduction in plant down-time. Ben Franklin 'a stitch in time' comes to mind. Computers will fail eventually. If the system is so unstable that 'maintenance' tips it over the edge then you were about to have a severe incident anyway (and probably not the team available to fix it, like the maintenance crew that are still right there). Make prudent backups before any major maintenance, plan for some random glitch getting things back on line, and you're covered. It's hard to record the 'things gone right" from maintenance .. but easy to record the number of issues of problems after maintenance.

Classic article on this problem: ValuJet 592 (1)

rbrander (73222) | more than 2 years ago | (#38183370)

http://www.theatlantic.com/magazine/archive/1998/03/the-lessons-of-valujet-592/6534/ [theatlantic.com]

William Langewiche, Atlantic Monthly, "The Lessons of ValuJet 592". It was basically done in because it was transporting safety equipment itself, which was vulnerable to a hard-to-predict failure. The more complex we make air travel, with its multiple checks and layers of protection, the more opportunities for failure. Adding another check to avoid 592, as they did, creates yet another opportunity.

It is, as they say, a Hard Problem. Yet, still: the US recently celebrated 10 whole years without a major airliner loss, despite a phenomenal amount of air travel. Things are getting better. Hard != Insoluable.

Laziness makes the datacenter unreliable (0)

Anonymous Coward | more than 2 years ago | (#38183378)

Lets face it... if 'we' administrators REALLY did our job and REALLY knew our stuff, we wouldn't have to patch nearly as often as we do, because our systems would be hardened and properly protected leaving many patches unneeded unless for the services our systems were providing required them.

And in those cases, our documentation would allow us to REALLY test exactly what we needed to and script the installs specifically for the systems that needed the patching.

Unfortunately, we DON'T REALLY know our systems the way we need to in order to PROPERLY secure them and maintain them. I see all too frequently the practice of 'getting it working' versus 'proper implementation' because we don't have the time or resources.

THIS unfortunately is what leads us to the regular patch cycles that we have, because when we don't really know and understand our systems and defend them independently by proper hardening, we must rely on patching everything, because in the end, everything is exposed. (Mind you, that's Windows, Linux, mobile, etc. etc.)

BUT, I don't want to leave it all to us administrators, lets face it, VENDORS and MANUFACTURERS have a hand in this as well. Nine times out of ten they don't even know their products well enough to tell us administrators what to do, or how to properly implement their solutions. They are so quick to market, their products stink. A proper and secure implementation with a product that is not written properly might as well be a waste of time. They are LAZY too. Their products don't work as advertised and they are so focused on sales, they don't even care that they aren't installed properly. When I have to work with top level support to get a security product to function as advertised because it 'isn't normally installed that way'; however, THAT IS they way it SHOULD BE installed to be properly secured, then there is a problem with the product, PERIOD.

    So, to my brethren in IT, let's make a statement to really and truly understand our environments and NOT let our product manufacturers or in-house developers BE LAZY and make our systems unreliable.

Its All In the Process ... (2)

TheGreatDonkey (779189) | more than 2 years ago | (#38183390)

Having been involved in Technical Ops of both large and small companies for many years, I have seen DR exercises and design that have run the gambit. I tend to think The key thing I have found to the success of any organization, exercise, or philosophy, is the underlying process that drives execution. The larger the team/org, the more change points, which in turn leads to more variables between tests. This creates complexity, as a test that ran fine a few months ago may not run the same today. However, ensuring change does not overrun process in understanding and applying the change into the greater design is a key to ensuring each test improves upon the last, until such time this is a finite process.

For example, when working for one of the big 401k's, the first DR exercise evaluated the data center completely being leveled and re-locating both technical services as well as the ~300 on site employees to another location. Long story short, the first exercise of this was scheduled for 2 days, and while it worked, we identified dozens of issues. We scheduled the next test 6 months later and addressed what we believed were all of the issues; on next test, we ran into perhaps ~10 issues. The next test we scheduled 3 months ahead and ran into ~2 issues. All awhile, things continue to change and innovation is occurring, change process control is ensuring that new things are being factored into the continual DR process/exercise. For a small telecom I worked for, the same type of testing was accomplished with ~2-3 week turn around time (smaller team, less change points, more dynamic response), but with same underlying principles.

Documentation of such things is critical, and employee turnover is often one of the greatest risk points. Having a diversified staff with overlapping knowledge should minimize the later risk to some degree, and if implemented fully, risk should be diminished.

So how does all this tie back into maint? Well, it is anticipated that if any system runs long enough, their will be opportunity for failure. It is preparation for when such failure occurs, one can balance the capability of providing a measured window of downtime (if any) and provide some degree of predictability (i.e. I test once a quarter). The counter to this can certainly be overzealous maint, so certainly their is a point to being reasonable. For example, what many of go through with our cars - the dealer wants us to come in every 3k miles for an oil change, whereas realistically most mfr's and my own experience dictates that ~5k (if not longer depending on circumstance) is much more cost effective. Either way, this is providing some degree of confidence that this should prolong engine life.

Re:Its All In the Process ... (1)

vlm (69642) | more than 2 years ago | (#38184146)

For example, what many of go through with our cars - the dealer wants us to come in every 3k miles for an oil change, whereas realistically most mfr's and my own experience dictates that ~5k (if not longer depending on circumstance) is much more cost effective.

LOL old Saturn cars were famous for a valve issue where the engine suddenly starts to burn about a quart of oil per 1k mile after about 125K miles of service ... engine capacity is 4 quarts oil... "most" owners don't even know what a dip stick is, much less how to read it... lots of saturn engines dead with an empty oil pan...

You'd be surprised how often the manufacturers actually know what they're doing with stuff like that.

Something I never understood about that whole mentality.. pay the bank $40000 in payments to buy a $30000 car then destroy the car by trying to save $30. I can see never doing maintenance on a $500 beater, but...

it's a zoo! (0)

Anonymous Coward | more than 2 years ago | (#38183432)

tell that to my spiders. they are trying to "maintain" their webs in my servers for some reasons (hint:very bad)!

It depends (1)

Glendale2x (210533) | more than 2 years ago | (#38183478)

If "maintenance" means doing a forklift upgrade of all the computer and networking equipment every year or two then of course your reliability is going to suck, especially the human error factor with all of that new, unfamiliar equipment.

On the other side of things if someone thinks that never changing the oil in the generator is going to make it more reliable then they're in for a surprise. When I think about datacenter "maintenance" I think: changing the CRAC air filters, cleaning any outdoor coils, changing the oil on the generator, loading the generator, replacing old lead acid batteries, checking building integrity, making sure birds aren't nesting anywhere stupid, and so forth. Physical plant won't last forever.

The quality of the people matters a lot (3, Insightful)

petes_PoV (912422) | more than 2 years ago | (#38183622)

Although everyone makes mistakes, some people make hundreds of times more errors than others. Whether that's due to inherent lack of ability, poor training, lacking oversight, laziness, time pressures or just a slapdash attitude varies with each person. One place I was involved with (as an external consultant) made over 12,000 changes to their production systems every year. It turned out that well over half of those were backing out earlier changes, correcting mistakes/bugs from earlier "fixes" or other activities (a lot that resulted in downtime, and far too much of it unscheduled or emergency downtime) that should not have happened and could have been prevented.

Re:The quality of the people matters a lot (1)

gweihir (88907) | more than 2 years ago | (#38183740)

Was just writing my posting while you did yours. I could not agree more. Additional aspects are
- Engineers and managers that try to justify their existence by performing a lot of maintenance
- Incompetence due to bad training, arrogance and inexperience

Example: I recently pulled an Ethernet cable with broken connector out of a mission critical server (was not in production, we were reviewing cabling correctness). Turns out that some brain-dead person did the cabling with old used cables. 1 Minute of downtime there likely cost more than the whole set of about 200 cables. Sometimes you really start to doubt that humans have intelligence.

Re:The quality of the people matters a lot (1)

jklovanc (1603149) | more than 2 years ago | (#38184124)

This sounds very much like a phenomenon I called "al dente programming"; throw code at a problem until something sticks. There is very little thought to the consequences of actions and assumptions that any issues will be fixed later. If people would just slow down a bit, do some research and think about the action maybe they would be starting fewer fires that need to be put out. It is circular logic; I don't have time to think about something because it is a fire but not thinking about something creates fires in the future so I am back in the same place.

Busy work (0)

Anonymous Coward | more than 2 years ago | (#38183650)

Talk like that will drive the economy into a (deeper?) recession.

Depends on the people (1)

gweihir (88907) | more than 2 years ago | (#38183674)

If you have rushed, underqualified people do the maintenance, then sure, it decreases reliability. If you have careful, non-rushed and competent people doing it, I doubt very much that the same is true. These people tend to be a bit more expensive, but cutting cost in the wrong places is a traditional occupation of managers in IT.

All I want... (1)

ibsteve2u (1184603) | more than 2 years ago | (#38183748)

All I want are systems with interchangeable fan/air inlet filters on the outside of the case that do not require a tool to remove and replace - let alone a power cycle. Is that so much to ask?

It's funny...I have cases where that sort of filter exists for a bottom-mounted power supply, but the case's own fans? Have to take 'em apart to properly clean the filters. And please don't say "Just lug a vacuum cleaner around." - they rarely do a good job if they actually are "luggable"; they (in a rare phrase) don't suck hard enough.

It is my experience that dirt and heat are the single greatest enemies of any electronic device, and will be as long as superconductivity without cooling is infeasible.

When I was a computer tech... (1)

Archtech (159117) | more than 2 years ago | (#38183756)

... a very long time ago, we had a saying about this.

"It's called preventive maintenance because it prevents the computer from working".

The Real Problem (0)

Anonymous Coward | more than 2 years ago | (#38183760)

Is that Documentation requires time, and thus money - something Management just can't be bothered to allocate resources for.
This article misses the Real Problem by miles.
FAIL.

Then there is incremental Dis-synergy (0)

Anonymous Coward | more than 2 years ago | (#38184384)

I work for a global pharma company with several key data centers (DC) around the world, and they are interconnected with each other and 100's of client sites. The problem is that this interconnection infrastructure and the internal infrastructure at each site are in a constant state of flux as new requirements are met (technical, business, regulatory, etc), and old ones go away. Thus the apps and infrastructure and their interdependencies are always several steps ahead of keeping documentation up to date despite rigorous requirements (not only per good IT practice, but also per medicinal regulatory requirements).

There have been "unintended consequences" every time there has been such maintenance, from upgrading a few switches to full DC power recycles to check UPS batteries/generators, and other parts and pieces - "oh, the US network to the rest of the globe goes through THOSE switches now? ... They use WHICH LDAP server?" The complexity is beyond any of most ambitious and determined efforts of a lot of very smart people working their tails off trying to make it all go right using all the IT buzzword-compliant techniques - ISO 9000, ITIL, LEAN, Six Sigma, etc. That is exacerbated by personnel turnover (or just plain too much laying off per the bottom-liners) throwing new support people into unfamiliar circumstances as someone mentioned above.

We have to do this stuff, but we have not figured out how to keep up with all the moving parts as they keep changing...

YMMV

Sounds like my desktops (1)

coldsalmon (946941) | more than 2 years ago | (#38184574)

My super-cool Linux box at home, which I work on all the time, is much less reliable than my work desktop running XP, which I never touch except to do my job. The most reliable, of course, is the headless Linux server that sits under my desk at home and never gets touched. In fact, I have this separate server precisely because I know that I will mess up my desktop by trying to fix/maintain it all the time, and I intend never to touch the server unless something goes wrong.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...