Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

The Decline and Fall of System Administration

CmdrTaco posted more than 3 years ago | from the hail-the-fallen-heros dept.

Unix 500

snydeq writes "Deep End's Paul Venezia questions whether server virtualization technologies are contributing to the decline of real server administration skills, as more and more sysadmins argue in favor of re-imaging as a solution to Unix server woes. 'This has always been the (many times undeserved) joke about clueless Windows admins: They have a small arsenal of possible fixes, and once they've exhausted the supply, they punt and rebuild the server from scratch rather than dig deeper. On the Unix side of the house, that concept has been met with derision since the dawn of time, but as Linux has moved into the mainstream — and the number of marginal Linux admins has grown — those ideas are suddenly somehow rational.'"

cancel ×

500 comments

Hyperviser (1, Informative)

betterunixthanunix (980855) | more than 3 years ago | (#35356074)

Someone still has to maintain the machines that are actually running the VMs.

Re:Hyperviser (2)

Larryish (1215510) | more than 3 years ago | (#35356094)

WHOOSH!

Re:Hyperviser (2)

Baki (72515) | more than 3 years ago | (#35356312)

With bare metal virtualizatoin, there is not that much to maintain, and there is pointy clicky software to do that. No real admin skills required.

Re:Hyperviser (4, Interesting)

Anonymous Coward | more than 3 years ago | (#35356434)

Because pointing and clicking inherently takes more skill than using CLI, right? Never mind that most CLI commands will readily assist you with syntax if your format incorrectly, whereas documentation for a GUI, if it exists at all, is often useless..,

Re:Hyperviser (1)

digitalchinky (650880) | more than 3 years ago | (#35356634)

Given your low UID I find your comment rather bewildering. Setting up a server so that it does exactly what you want is a complex task - add in a good bit of security and you're so far away from the mouse that it is utterly absurd to make this claim.

Someone still has to make the images that the point and click types use. That requires real sys-admin work.

Re:Hyperviser (1)

Anonymous Coward | more than 3 years ago | (#35356326)

It's VMs all the way down.

Re:Hyperviser (1)

bberens (965711) | more than 3 years ago | (#35356542)

This is the natural progression of technology across all industries. We'll be migrating to needing a very small number of highly skilled people and a lot of "sysadmin" drones who mostly do point and click things.

Clone my car! (2)

hart (51418) | more than 3 years ago | (#35356098)

TFA concludes with "But if all it takes is a few clicks of a mouse in vSphere's Windows-based client to pop out a cloned server instance (ostensibly built by someone who knew what they were doing), then what does it matter? It's all very convenient and cool, right? Wrong. If you don't understand the underpinnings, you're missing the point. Anyone can drive the car, but if it doesn't start for some reason, you're helpless. That's a problem if you're paid to know how to fix the car." While I agree in principle, the analogy here is off. If the car doesn't start in this case, I can just throw it away and clone a working one.

Re:Clone my car! (5, Insightful)

shawb (16347) | more than 3 years ago | (#35356210)

The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.

Re:Clone my car! (5, Insightful)

Ephemeriis (315124) | more than 3 years ago | (#35356470)

The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.

Exactly.

If the machine is in production it needs to be working. You don't have time to dig around and find the root cause. You need it to work. Now. If you've got a virtualized environment it is trivial to bring up a new VM, throw an image at it, and migrate the data.

Then you take your old, malfunctioning VM into a development environment and dig for the root cause, so that you don't see the same problem crop up on your new production machine.

Re:Clone my car! (0)

Anonymous Coward | more than 3 years ago | (#35356492)

I agree. With a VM, you can take a snapshot to investigate later AND reboot to get things running in a pristine state immediately.

Re:Clone my car! (0)

commodore6502 (1981532) | more than 3 years ago | (#35356264)

>>>If you don't understand the underpinnings, you're missing the point.

Where does a person go to learn those underpinnings and become a Unix or linux Server expert?

Re:Clone my car! (2)

tsm_sf (545316) | more than 3 years ago | (#35356444)

Well, pre-Xbox attention spans it was digging through man pages. I don't know how you're supposed to find that kind of focus now, when everything in your house either blinks, beeps, or vibrates. Good luck.

Re:Clone my car! (4, Insightful)

Isca (550291) | more than 3 years ago | (#35356498)

That's assuming your new tool that's vitally important actually has a man page. Very little is documented as well as it was 10 years ago.

Re:Clone my car! (3, Insightful)

bsDaemon (87307) | more than 3 years ago | (#35356464)

Traditionally? College. Way back when, long before I was born, system admins tended to be graduate students in computer science or other department staff, and those in industry did it in college first. System administration itself wasn't taught, but that's not the point. The point is several technologies grew up together and are generally described in terms of one another: Unix, C, TCP/IP, etc. -- You don't really get what's going on with one without the others in most cases.

C, of course, is the foundational building block. Unix is the cathedral and TCP/IP is the road that connects each building together. Most of the so-called system admins I've seen in the past have been "web developers" who have been put in over their head and forced to deal with things they don't fully understand. I learned C and Unix concurrently, starting by teaching myself in jr. and high school. Try explaining an mbuf to some kid who only knows PHP some time -- it's painful.

The lack of fundamental understanding which would enable them to be competent admins is the same lack of fundamental understanding which keeps them from writing secure code, debugging network issues, etc. But, because there is a large influx of semi-skilled people who think that the fact they installed Ubuntu on their PC at home makes them a sever admin, employers are less willing to offer up the salaries necessary to attract competent admins, and frankly the salaries need to be even higher to make dealing with idiots less of a hassle.

I'm so glad I'm not in web hosting anymore I can't possibly overstate it.

Re:Clone my car! (1)

Anonymous Coward | more than 3 years ago | (#35356294)

I would also say that the guy cloning the server is NOT a sysadmin. While it's a job that traditionally would have fallen to the sysadmin, advances in technology allow low level techs to now handle that kind of job. However, that does not mean you don't need a sysadmin, he's very likely the guy who knew what he was doing and built the server.

It also depends quite a bit and whether you can reimage or just build a new server. For a webserver or something like that where there's dozens or even hundreds of identical servers all with the exact same config, it doesn't make sense to troubleshoot each problematic server. You can spend hours tracking down the problem or minutes reimaging, and even if you find the problem it may take hours longer to fix it. You still need someone who knows what they're doing though, for several scenarios. Most commonly:
a) Server A is faulty, you reimage. The next day server B shows the same symptoms. You reimage, but ultimately need to spend some real time finding the root cause. If the next day Server A's problems resurface, or server X starts showing the same symptoms, then your reimaging monkeys damn well better have a good admin to call before the problem gets out of hand.
b) Not all servers are clusters. Citrix farms, web farms, db farms, yea, you take one down, reimage and move on. Core application servers you generally don't have that luxury, and without your core app servers working, your massive farms that feed data into and out of them aren't worth shit.

Re:Clone my car! (1)

camperdave (969942) | more than 3 years ago | (#35356656)

The problem with cloning is that if there is a flaw in the master, then there is a flaw in every single clone. We see it all the time in cars. A faulty part leads to a recall. Sure, the circumstances under which the flaw causes problems may never happen to you. On the other hand, if your processes and procedures cause you to hit the flaw, then replacing the server instance isn't going to help. I mean, imagine how Star Wars would have turned out if Jango Fett had been a good marksman.

Sad but smart (4, Interesting)

Anrego (830717) | more than 3 years ago | (#35356102)

I’m not a system admin but I don’t see how this is a bad approach.

I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.

But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.

I think you have this kind of problem in most jobs. New approaches that make more sense but require less skill (and imply less e-pene) are always hated by people who have already learnt how to do it “the hard way”.

I see this as a programmer all the time and have been a victim of it. I’ve seen a huge chunk of my chosen industry migrate from meat and potato problem solving to gluing libraries together and sprinkling in business logic.

I’ve been fortunate to land in a job where there’s still a lot of “from the ground up” work, but these jobs are getting scarcer as even the components that everyone uses are made from other components. And executable UML (or something of its ilk) is probably going to be the next thing to cut the legs off us.

Re:Sad but smart (1)

darjen (879890) | more than 3 years ago | (#35356208)

I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.,

That's why you have backup servers. Sometimes it simply isn't worth the time or effort to dig deeper. Re-imaging is completely rational from a business perspective.

Re:Sad but smart (1)

y86 (111726) | more than 3 years ago | (#35356578)

Re-imaging is completely rational from a business perspective.

So is treating cancer vs curing it.

Doesn't always work (2)

Baki (72515) | more than 3 years ago | (#35356350)

Sometimes a server is gradually degrading due to some issue. During that time, things are being modified. If you learn that the problem started a few months ago, you can't just re-image an old state and loose everything that had changed since then.

Of course to make app servers as stateless as possible helps against this problem. One of the reasons that my company enforces that data are kept on physically separate DB servers, and (virtualized) app server instances should be as dedicated to a single app as possible.

Re:Sad but smart (4, Interesting)

TheRaven64 (641858) | more than 3 years ago | (#35356384)

Add to that - no one (outside of the IT department) cares what the problem is, they care about the downtime. If you have some redundancy, stuff can fail periodically without the users noticing. An 'admin' capable of keeping it running can be someone paid to do something else who has responsibility for clicking the button every few months if required. An admin who can actually address the problem will cost, what, $60,000/year minimum (including associated costs, not just salary)? Is having ten minutes of downtime every few months costing your business $60,000/year? If not, then it's not worth the cost of doing it properly. It may be for a bigger company, but for a small business that would eat most of their profits. This is the advantage of a Windows or Mac server, with its pointy-clicky interface: it may be less reliable, and more expensive, but the cost saving from not needing to employ anyone who actually understands what's going on outweighs it. Especially if you buy a support contract, where the vendor will send someone competent out for the couple of time a year where something goes seriously wrong.

Re:Sad but smart (3, Insightful)

causality (777677) | more than 3 years ago | (#35356442)

I’m not a system admin but I don’t see how this is a bad approach.

I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.

But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.

I think the issue here is that the need for a business to get a production system back up and operational with as little downtime as possible can sometimes conflict with the principles that most effectively assure sound system administration.

Unix/Linux systems don't just break for no reason, particularly servers with enterprise hardware. The idea that a system just breaks for no apparent reason and a reboot, reset, or re-image is going to actually fix the cause and somehow prevent a future reoccurrence is alien to this realm. That's a mentality that comes from running Windows (esp. previous incarnations) on commodity hardware.

Something on that "known working" image is faulty or capable of breaking. Otherwise, normal use would not have led to a state of system breakage.

The ideal course of action would be to do whatever is necessary to get the system back online, which may include re-imaging, and then discover what is wrong with the "known working" image that eventually broke. That could be greatly assisted, of course, by saving the data (at least the logs) from the known-faulty system prior to re-imaging.

Cost and primary business (1)

xzvf (924443) | more than 3 years ago | (#35356112)

An expensive part of most IT budgets is people costs. Unfortunately, if your primary business is not IT, it is also the easiest one to cut.

Re:Cost and primary business (1)

trickyD1ck (1313117) | more than 3 years ago | (#35356674)

Unfortunately, if your primary business is not IT, it is also the easiest one to cut.

Fortunately, if your primary business is not IT, it is also the easiest one to cut.

FTFY

From personal experience (5, Insightful)

Xacid (560407) | more than 3 years ago | (#35356114)

"they punt and rebuild the server from scratch rather than dig deeper."

From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.

I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.

Re:From personal experience (0)

Anonymous Coward | more than 3 years ago | (#35356162)

Exactly. My cousin recently got a virus on his laptop and needed it removed asap so he could travel with it, and he got it in the same way we told him multiple times not to do anymore. So, rather than spend an entire night getting it working with work in the morning, it was nuked with the recovery disk.

This served two purposes - it was running in the desired timespan, and since he lost all his files, it will hopefully keep him from repeating his mistakes in the future.

I might just be overly optimistic here, though, on the second part.

captcha: stress

Re:From personal experience (1)

RocketRay (13092) | more than 3 years ago | (#35356168)

Reminds me of our Windoze guy in my previous job. I had run into some kind of problem with XP at home and spent a good amount of time banging my head against it. Finally I told him what was going on and asked what he recommended. "Reinstall XP" was the answer. I thought he was joking, nope he was serious. :P

Re:From personal experience (4, Insightful)

Darth_brooks (180756) | more than 3 years ago | (#35356484)

....and his was the right answer. With XP, you're almost certainly talking about a client machine. Why bother dicking with it? It's a hundred dollar OS on a four hundred dollar piece of hardware. Wipe, reload, move on to big boy problems. Even if you're talking about a problem that ends up affecting a number of users, and it happens to be a client side problem, you're farther ahead to nuke and reload.

In my last position I was the only end user support guy for 150 to 200 people. If I sat around and fucked with every little nuance of XP and it's associated ills, I'd have ended up even farther behind than I was when I left. I wrote up a quick backup script that grabbed anything the user didn't (against company policy) store on the network drive, grabbed their local e-mail (Notes), then nuked the machine and reloaded. I could take a user who was dead in the water and have them back up and running in 15-20 minutes. If they had a lot of data to restore, maybe 35-45. Spending an hour 'troubleshooting' was a waste of company time, and my time.

Re:From personal experience (5, Insightful)

causality (777677) | more than 3 years ago | (#35356666)

Reminds me of our Windoze guy in my previous job. I had run into some kind of problem with XP at home and spent a good amount of time banging my head against it. Finally I told him what was going on and asked what he recommended. "Reinstall XP" was the answer. I thought he was joking, nope he was serious. :P

Not only was he completely serious, he probably can't understand why you might have thought he was joking.

The idea that it's a black box and you shouldn't expect to understand how or why something happened is definitely one of the more subtle costs of Microsoft systems. It lends credibility to the (false) notion, so common among average users, that you're either a completely unskilled newbie or a serious expert who can discern the inner workings of the mysterious black box. It discourages middle ground for intermediary skill levels, the kind of thing that would otherwise occur naturally as users gain experience over time.

Most of all, it's supports the falsehood that it's unreasonable to expect the most basic competence from non-experts.

Re:From personal experience (1, Interesting)

VolciMaster (821873) | more than 3 years ago | (#35356190)

"they punt and rebuild the server from scratch rather than dig deeper."

From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.

I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.

Having witnessed this type of behavior across myriad companies and industries, I can say the rebuild/clone/redeploy approach is used NOT because of "pressure to get more bang for their bucks" - it's that it is inherently easier to do this approach than to deep-dive perhaps for days to find The Answer(tm). In an environment of thousands of servers (or even dozens), deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.

Re:From personal experience (0)

Anonymous Coward | more than 3 years ago | (#35356318)

it's that it is inherently easier to do this approach than to deep-dive perhaps for days to find The Answer(tm).

"42". HA! Got it first!

Re:From personal experience (4, Insightful)

Nerdfest (867930) | more than 3 years ago | (#35356334)

As I've said below, there is a benefit ... you can actually investigate and fix the problem rather than the symptom. The bonus with VMs though is that you can frequently do both. You can create a copy of the VM tio dig into, and create a new fresh instance for production to get them working again.

Re:From personal experience (3, Insightful)

laffer1 (701823) | more than 3 years ago | (#35356416)

There is benefit because there is downtime even if only a few minutes to restore the VM. What if the software running in the VM is old and someone has been attacking it? Restoring will result in the same problem a few days or hours later.

If there is a bug in a specific kernel version that's not playing nice with the VM, it will cause stability problem again.

Redeploying and finding the problem is the only real answer. In the long run, it may save work.

Re:From personal experience (3, Insightful)

jcoy42 (412359) | more than 3 years ago | (#35356494)

deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.

There can be a benefit. I generally try to get the system working first, then figure out what went wrong. And sometimes it takes a few days of poking at it to figure it out, but when a problem like that comes up again, I'm ready for it.

That's the benefit of an experienced system administrator. Anyone can just make it work again, but someone who has been doing that for a few years is going to be used to writing scripts that hunt for said issues and either correct the problem on the fly or send a notification with some details about where to look first.

I've seen the "make it work and move on" approach result in systems that become increasingly unstable because no one ever tracks down the root problem.

Re:From personal experience (1)

J4 (449) | more than 3 years ago | (#35356496)

You win a fine cigar!

Re:From personal experience (0)

Ephemeriis (315124) | more than 3 years ago | (#35356504)

While it is interesting intellectually, there is no other benefit.

Well, it'd be nice to find the root cause so that you don't see the problem pop up again on your new server image...

But, yeah. If you can get it up and running with a re-image in a matter of minutes/hours instead of digging around for hours/days... Don't waste the time.

Re:From personal experience (1)

Xacid (560407) | more than 3 years ago | (#35356522)

Valid point, but it does have its merits if it's a recurring problem. A wise manager will know when to call for deeper inspection.

And to be fair - I'm fine with reimaging a system to fix a problem if it's not recurring as the downtime typically isn't worth it.

I am not on Unix (1)

Shivetya (243324) | more than 3 years ago | (#35356540)

but know the teams that implement/admin them and I am constantly amazed.

Amazed in all that I read here and elsewhere points to incredibly resilient systems yet I have never been anywhere where they don't have scheduled down time on at minimum a quarterly basis and every major outage relied on a reload. So which is it? They make fun of the windows guys and just hope the windows guys don't look at their statistics (and no I am not on Windows either, think IBM Z and I).

My serious question is, is their a certain size system that reloads are valid on? When does it stop becoming a valid solution? When you get to enterprise level systems what are your options then?

I read all these articles but the one thing never clear is, are these large systems or just small servers (small being PC class hardware)

Re:From personal experience (4, Insightful)

Tom (822) | more than 3 years ago | (#35356614)

In an environment of thousands of servers (or even dozens), deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.

Except, of course, finding what the heck was wrong in the first place and fixing it, preventing future outtages.

Sometimes, rebuilding is faster than fixing, and in some contexts, it makes sense. Even then, the original machine should still be examined and the "root cause" (if you need a management buzzwod) identified. At the very least, a reasonable amount of time should be given towards the attempt. It's true that it is pointless to dig around for days and days - but that is not a reason to not at least start looking, as it might turn out you only need a few hours. And more often than not, finding the real problem tells you something that helps you
a) fix other bugs,
b) avoid the same problem on the next server,
c) avoid a repeat performance,
d) makes you realize what you thought was a random server crash was really a break-in / hardware failure / systematic problem and other, additional steps need to be taken.
All of the above have happened before, you would by far not be the first.

A proper incident management process does allocate resources towards follow-up examination. The right thing to do is not suppress it with generic blabla about wasted time, but to set the proper amount of resources for your organisation. Maybe it's half an hour and no money, so some sysadmin can check the logs and do a quick check-up. Maybe it's a full-out forensics analysis. That depends on your needs, your resources, your environment and context.

Re:From personal experience (3, Insightful)

Anonymous Coward | more than 3 years ago | (#35356260)

To a small degree, you are correct. The bigger problem is that in *nix pretty much all the tools you need are available to you, but in the Windows world everything costs money. So often the solution comes down to either spend money to fix it or spend time to rebuild it. Since management thinks computers are simple push button things, "just reboot" because the go to solution.

Re:From personal experience (0)

Anonymous Coward | more than 3 years ago | (#35356436)

very true - and please remember that the computers/network/I.T. department is there to help the company do whatever it does. I.T. is not an end to itself.

Knowing what happened/broke is essential so that we can prevent it from happening again, but "real server admin skills" include the ability to minimize downtime (*cough* re-image *cough*) ...

Re:From personal experience (1)

foobsr (693224) | more than 3 years ago | (#35356682)

I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.

Both? Maybe interaction?

CC.

Time is money (0)

Anonymous Coward | more than 3 years ago | (#35356116)

If the cost of re-imaging a machine in a production environment is less than digging deeper guess which one Im going to do ?

Re:Time is money (1)

Nerdfest (867930) | more than 3 years ago | (#35356186)

If it's something that happens repeatedly, it's nice to dig in, find the cause and fix it. The nice part is that with VMs, you can create a copy of the problem environment and have the best of both worlds.

Re:Time is money (1)

d3ac0n (715594) | more than 3 years ago | (#35356564)

If it's something that happens more than once in a small enough time period, then of course one would immediately dig deeper. However, if it's a one-off problem or a repeated but reasonably rare issue then either restore from backup, or nuke the server and rebuild.

Most of the admins I know (myself included) will still dig on the issue afterward, even if we've have to restore or rebuild from image. But the first responsibility is to get the system back up and running, not spend hours on bug hunts.

Gee, ya think? (0, Flamebait)

edremy (36408) | more than 3 years ago | (#35356126)

Let's see. When I have a security or performance issue I can

A: Pay a bearded guy in suspenders for hours while he incants various arcane phrases like "sudo" and "grep" and hope that he actually manages to clean up the problem at the end, or

B: Press a button and have a factory fresh install in seconds.

Assuming that you have a decent build done first (Pay the bearded guy big for that) why on earth would *anyone* pick A? It's hardly just Unix- we're a Windows shop and we're heavily virtualized because it makes sense from so many different angles- security, load balance/failover, ease of setup, etc.

Re:Gee, ya think? (5, Insightful)

rhsanborn (773855) | more than 3 years ago | (#35356170)

There are a lot of cases where pressing the button means that the problem will go away...for a few weeks. It will work right until you hit the same conditions that caused the problem in the first place. Suddenly, your using the refresh to cover up either a poor implementation, or a standing bug, and it isn't going to go away until you call that guy in suspenders.

Re:Gee, ya think? (1)

prefect42 (141309) | more than 3 years ago | (#35356556)

So you make a cluster of these things, where regular failures are normal but tolerated. Then when the cluster starts acting weird, you make a cluster of clusters...

Re:Gee, ya think? (1)

pnuema (523776) | more than 3 years ago | (#35356562)

If the problem only comes up every few weeks, press the damn button again. I see similar mentalities in my little corner of IT - testing automation. Most test automaters want to write test scripts that are robust and will be re-usable from build to build. My experience is that the amount of effort required to make scripts robust enough to last is exponentially greater than just doing the job over again, quick and dirty. I am looked down upon by the serious scripters - but I have three times the productivity. I am not looked down on by management. :)

A lot of times, the instinct to do a job "right" - to the best of you abilities - actually runs counter to the needs of the business. In that case, you are not actually doing the job "right". Doing the job "right" means getting the goal accomplished with the least effort possible over the medium term. If you end up rebuilding the server 5 times over the course of the year, at 2 hours per pop, you have spent less time than if you spent two days to fix the problem permanently.

Re:Gee, ya think? (1)

immovable_object (569797) | more than 3 years ago | (#35356176)

And a few hours after you've re-imaged the OS, the problem returns. Now what?

While re-imaging may fix some errors, it doesn't fix all, and rarely fixes performance issues.

Re:Gee, ya think? (0)

Anonymous Coward | more than 3 years ago | (#35356224)

The only problem with paying the bearded guy to build the decent system from the start is that bearded guy will not have incentive to actually build it properly from the beginning...even though you are paying him...because he doesn't have to actually maintain the thing. Just a thought...I've seen this many times in IT (and had to clean it up).

Re:Gee, ya think? (0)

Anonymous Coward | more than 3 years ago | (#35356266)

Problem is, if you apply that factory install without doing some root cause analysis and fixing the problem you will at some point have the performance issue again and you HAVE the (core) security problem RIGHT NOW.

That said there are lots of things that can go wrong with a box that might never happen again and are well understood. Suppose you had a power failure for a really long time over a weekend and the generator ran out of fuel, the box went down hard. You need to figure out a refueling procedure and notification system for sure. You might look at why the box won't boot for a few moments and see if a fix is trivial, but then you know exactly why this happened. The FS got hosed when the machine lost power.

In that situation you could spend hours repairing the file system and selectively replacing corrupt streams, or you could dump your image back and restore the data from that know good (tested) backup you have, all in a few moments. What possible justification would have for not re -imaging?

So like anything thing the key here is understanding and using your brain. A re-image is not a hammer and every problem is not nail, but sometimes it is.

Re:Gee, ya think? (0)

Anonymous Coward | more than 3 years ago | (#35356324)

Despite your five digit UID, I don't believe you frequent Slashdot enough to have missed Don't reboot UNIX boxes [infoworld.com] .

Mr. Beard is going to ensure your server remains online and connected without disrupting the workflow at your company. He's going to unload modules, install patches, reload them. He's going to do a few hours of real work but for the rest of the month (or year if he's worth his salt) Beardo is going to be warming a seat and making sure everything is perfectly running because he knows his stuff. He earned it unlike some guy who got the job from his frat brother because he knows Linux and reinstalls everything if they get "shpchp 0000:00:01.0: Cannot reserve MMIO region" upon booting up for the first time.

Re:Gee, ya think? (1)

tsm_sf (545316) | more than 3 years ago | (#35356528)

Let's see. When I have a security or performance issue I can

A: Pay a bearded guy in suspenders for hours while he incants various arcane phrases like "sudo" and "grep" and hope that he actually manages to clean up the problem at the end, or

B: Press a button and have a factory fresh install in seconds.

Assuming that you have a decent build done first (Pay the bearded guy big for that) why on earth would *anyone* pick A? It's hardly just Unix- we're a Windows shop and we're heavily virtualized because it makes sense from so many different angles- security, load balance/failover, ease of setup, etc.

Well part of it too is that nobody gives a fuck if your Etsy-inspired e-store is working well. The beards are there for critical systems.

Not a decline, but a reflection of the new normal (2)

zerofoo (262795) | more than 3 years ago | (#35356134)

As hosted services become more and more popular, sysadmins have less interest in spending the time to diagnose and solve a problem - this goes for Windows, Mac OS and Linux/Unix. When a fix is needed RIGHT NOW - the quickest way back up sometimes is a re-image.

When I was a small business IT consultant, I asked clients if they wanted to spend $125 per hour for me to diagnose and fix their system - with the understanding that it could take many hours to research and solve the problem - or if they wanted to spend ONE hour re-imaging the system to a known good point.

Almost everyone chose the "fix it now in under an hour" solution.

-ted

Blah blah blah (1)

L4t3r4lu5 (1216702) | more than 3 years ago | (#35356136)

Yet another story about how the old way was better.

What's better is whatever keeps your employer's company making money for the most time. If re-imaging the server every weekend gives them 100% uptime during the week, do it. If you can inject patches into the app during runtime, more bully to you, but I can't, so I'm going with "re-image to working state and roll forward." If that costs my employer less than you cost your employer, I know who's all of a sudden more employable!

Might want to shave off those neckbeards, folks.

Re:Blah blah blah (2)

Mr. Shotgun (832121) | more than 3 years ago | (#35356398)

Till the problem occurs in the middle of the work week and you still don't know what the actual problem is. Then your looking at an hour of downtime during business hours while you re-image yet again with your boss asking what the hell the problem is and what you were doing on the weekend if you weren't solving the actual problem.

Covering up a problem is not the same as solving it.

Expediency wins! (1)

Infernal Device (865066) | more than 3 years ago | (#35356140)

Seriously, which way gets the job done faster?

Being a sysadmin is not about you and the system and your marvelous detecting and repair skills, it's *always and only* about your users. If VM technology improves the speed of recovery so the users can get back to what they were doing (probably messing up your carefully architected system), then so be it.

Re:Expediency wins! (0)

Anonymous Coward | more than 3 years ago | (#35356602)

You cannot "carefully architect a system" unless you can investigate, learn and constantly improve (often this may even mean doing work in your own time!). Users should not be able to " mess this up" and if they can you have done something wrong.

The reimage reboot approach is for admin grunts - sort that adds new users and fixes printing problems, carrys our upgrades, and installs stuff - your average Mubai shoe shiner has these skills.

Overall it is about attitude and approach to problems - operating system has little to do with it.

The decline of language skills? (1)

Krakadoom (1407635) | more than 3 years ago | (#35356152)

I just thought it was amusing to post a headline with decline and fall both in the same sentence, when they are clearly the same thing in this instance. Should it actually have been "the rise and fall of ..." or "the decline of ..."?

Re:The decline of language skills? (0)

Anonymous Coward | more than 3 years ago | (#35356354)

No, "The Decline and Fall" do not mean the same thing: a decline has a progressive aspect, while a fall has a perfect aspect. Also, the title is an allusion to Gibbon's *Decline and Fall of the Roman Empire*, and frankly Gibbon's English was a HELL of a lot better than yours.

Re:The decline of language skills? (1)

satch89450 (186046) | more than 3 years ago | (#35356358)

There was a book by Will Cuppy (1894-1949) titled The Decline and Fall of Practically Everybody (1950; http://www.amazon.com/Decline-Fall-Practically-Everybody-Nonpareil/dp/0879235144 [amazon.com] ) that was an absolutely funny take on history. Will Cuppy's style was to write very straightforward articles, but pepper them liberally with very funny footnotes. I remember seeing a paperback version of this as a kid, and got hooked.

Actually, the phrase "decline and fall" describes the shape of a drop-off not unlike the shallow slope leading to a cliff. Perfectly good English.

Re:The decline of language skills? (1)

VolciMaster (821873) | more than 3 years ago | (#35356362)

I just thought it was amusing to post a headline with decline and fall both in the same sentence, when they are clearly the same thing in this instance. Should it actually have been "the rise and fall of ..." or "the decline of ..."?

No - you can "decline" and not "fall", so the headline is fine.

I can't tell you how many times I have heard this. (5, Interesting)

Noryungi (70322) | more than 3 years ago | (#35356158)

Many times, what I hear as "solutions" are simply variations on the theme: "Why can't we reboot the server?" or "Why can't we reinstall the server from scratch?".

And my answer usually was: "Listen, I don't care how many times you do this on a Windows machine, but this is UNIX - I'll only reboot this machine if I absolutely need to. In the meantime, watch and learn as I kill the offending processes. Oh, and re-installing the machine means 24h of downtime".

These days, I help run a (very) large application, which runs on top of a (very) large "enterprise" SQL database for a (very) large company. The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it. Logical solution (and the one recommended by sysadmins): upgrade application to version X, which is supposed to have a much better database management.

What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.

Did I mention the application is considered mission-critical and runs 24x7? And that downtime can cost more than 6 figures to said (nameless) company?

And, since you asked, yes, I am looking for another job. (Clueless admins and pointy-haired bosses: a match made in...)

Re:I can't tell you how many times I have heard th (1)

Junta (36770) | more than 3 years ago | (#35356286)

Oh, and re-installing the machine means 24h of downtime

I am with you except here. If re-installing a machine incurs 24h of downtime, you do not have a suitable contingency plan. Most environments I deal with are 15-20 minutes from offline to production on reinstall at the long end.

Re:I can't tell you how many times I have heard th (3, Insightful)

powerlord (28156) | more than 3 years ago | (#35356612)

Oh, and re-installing the machine means 24h of downtime

I am with you except here. If re-installing a machine incurs 24h of downtime, you do not have a suitable contingency plan. Most environments I deal with are 15-20 minutes from offline to production on reinstall at the long end.

I agree that if the system is as critical as they say, they should have a better failover in place, however in a lot of companies, very little importance is placed on Live Failover systems. More than likely he's including lots more than the OS/Application build in that 24 hour timeframe.

Probably database reload/recovery time, or file system initialization (inadequate RAID controller to Disk design?).

Re:I can't tell you how many times I have heard th (0)

Anonymous Coward | more than 3 years ago | (#35356330)

It costs them less to pay the DBA to write the script and inconvenience their users than it does to upgrade the system.

That all important profit margin is what gets in the way of things being done the right way.

captcha: income

Re:I can't tell you how many times I have heard th (1)

vlm (69642) | more than 3 years ago | (#35356510)

The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it.

Did I mention the application ... runs 24x7

So which is it, it crashes "often" enough to be a problem, or it never crashes ever?

The obvious solution is to reload it every day at the least inconvenient time.

If they will not "permit" a controlled reboot, then work around it by running health testing scripts that just happen to knock it out, sort of a euthanasia approach.

The next "solution" is a (caching?) sql proxy server in the middle, no one will notice if the reboot is fast.

Is the upgrade suggested by the admins themselves whom have tested it under load on a test server so they know it'll work, or suggested by the vendor dazzled by the vision of fat commission checks? "It'll work great, sure, it'll work great, great at paying for my sports car, yeah it'll work great"

Re:I can't tell you how many times I have heard th (1)

Ephemeriis (315124) | more than 3 years ago | (#35356676)

Logical solution (and the one recommended by sysadmins): upgrade application to version X, which is supposed to have a much better database management.

What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.

The new version costs money. And, no matter how important everyone thinks this application is, they obviously don't think it's worth that price. They're willing to deal with a reboot rather than spend the money. I'd recommend the upgrade, too... But I don't write the checks. Nor do I really use the app. I just keep it running. And if you tell me you can live without the app for 10-15 minutes while the server reboots, and you'd rather save $X instead of buying the new version, that's what we're going to do.

Listen, I don't care how many times you do this on a Windows machine, but this is UNIX - I'll only reboot this machine if I absolutely need to. In the meantime, watch and learn as I kill the offending processes.

That's great when you can get away with it... But sometimes it just isn't worth the trouble. Even on a UNIX system.

Yeah, I hate rebooting to fix problems. Seems like a crude approach. Especially when you've got so many nice tools at your disposal on a UNIX system.

And, I guess, I'm kind of wondering why it needs to be rebooted in your situation. You've got a script monitoring zombie processes... And those processes can apparently be killed manually... So why not have that script kill the processes instead of just monitoring them? Or write a second script to fire off a batch of zombie kills?

But sometimes it just isn't worth the time/effort involved. You can spend a couple hours digging for the problem while your users are without their app... Spend a couple hours developing and testing your script while your users are without their app... Spend a few days patching code while your users are without their app... Or you can just reboot the thing and go on with your life.

Oh, and re-installing the machine means 24h of downtime

This seems wrong to me. Or, at least, completely unrelated to the subject of re-imaging in a virtualized environment.

It takes maybe 5 minutes to provision a new VM complete with OS and default config/apps/whatever.

If I had a system that was as essential as what you describe, I'd have a base image of it stored and ready to go. Just bring up the new image, migrate the data, and make it live. That's what we do with all of our truly essential systems. And we can be running off a new image within about 30 minutes if we're able to migrate data off the old system. If we have to go to tape it'll take longer.

If you actually incur 24 hours of downtime to re-image a server, what's your plan if that machine simply dies? What if it takes more than a simple re-image to get it back up and running?

Faster (0)

Anonymous Coward | more than 3 years ago | (#35356160)

Often times resolving the issue will take longer than the time to re-image something. This is the benefit of running virtualized infrastructure, quick build up and tear down.

The OS itself shouldn't matter and I've been doing this since I was able to snapshot stuff. Often times it will allow me to go back and work on the broken image while the new image is running, but honestly from a Management view - the admin is there to make stuff work - they don't care how he/she does it. They are interested in quick resolution.

I'll probably get flamed for this (1)

youn (1516637) | more than 3 years ago | (#35356166)

But what's wrong with having images of servers ready as a viable disaster recovery strategy?

yes I agree it is good to know the system inside out. yes I agree that it's not because a simple minor server process configuration screwup that you should reimage the whole server... but sometimes it may be either time saving at a point where users need the servers immediately. sometimes it might actually be more secure and stable to restore from an image that has been tested for months rather than making tons of changes under the hood... especially if it is a system that has not been documented, where the last changes were made years ago... by diagnostic-ing the server under time constraints, it is possible to mess things up even more. It's not necessarily a pissing context... well I can fix my server without re-imaging in this case.

Now, if the problem occurs regularely and reimaging and putting blinds to the problem... then yes, I agree imaging is wrong. Yes, it is a good thing to do thing to know what is happening, find the problem... and most problems don't necessarily reimaging

my point is it is not necessarily a bad thing to restore a server from an image if you do things right... it may save time, be more secure and save tons in productivity/money.

Re:I'll probably get flamed for this (1)

betterunixthanunix (980855) | more than 3 years ago | (#35356316)

Nothing is wrong with it, as long as the following conditions are met:
  1. Spawning a fresh instance will happen quickly, faster than actually solving the problem (this is true of VMs, or situations where a backup system is available).
  2. The problem will not affect the fresh instance. What is the point of reimaging, if the problem was a faulty piece of hardware or some poorly designed software (e.g. software from decades ago that assumed an 8 bit counter was enough, on a day when you had to count higher than 256)?

Re-imaging != bad administration (2)

chrishillman (852550) | more than 3 years ago | (#35356172)

Sure it was cool, back in the day, to spend 72 hours working on "the server" because even rebooting was not an option. Back then I had 3 servers, 10 years later I had 15. I didn't have the time to get into why each little snowflake of a problem was happening, I knew reinstalling and upgrading components would be a more prudent use of time. If I can rebuild a server and restore a data backup in 4 hours or I can spend an infinite amount of time "fixing" the existing install, which option do you think my PHB would prefer? It is not bad administration, it is just different.

Re:Re-imaging != bad administration (0)

Anonymous Coward | more than 3 years ago | (#35356214)

It is just necessary unfortunately, like you said essentially, time is money. You're not fixing these servers usually for the vendors to make them better, you're fixing them for a business that only needs it work, fast, and as cost effectively as possible.

It will just get worse (depending on your view) (1)

Rooked_One (591287) | more than 3 years ago | (#35356202)

As VM's are virtualized and taking snapshots of them becomes so easy, why would you bother troubleshooting anything when you can just restore to a snap that is an hour old? Better than the server being down and spending who knows how long trying to figure out what's wrong.

Obviously employers (if they wake up to this) will realize "Hey, I can pay a kid to restore snapshots" instead of "Hey... I need to hire this super expensive IT veteran."

Re:It will just get worse (depending on your view) (3, Interesting)

vlm (69642) | more than 3 years ago | (#35356588)

As VM's are virtualized and taking snapshots of them becomes so easy, why would you bother troubleshooting anything when you can just restore to a snap that is an hour old?

The security exploit that cracked the old image in less than a second, will crack the "identical" new image in less than a second. Or data sample #1213 which overflowed the buffer and crashed image A will simply overload and crash image B.

What it really brings up is a class distinction in sysadmins. Theres the guy whom actually fix systems, like patching security holes in system libraries to work around app bugs, redesigning firewall ACLs to avoid a new threat, do scalability assessments before the overload crashes something, and there are the guys that fix individual things like motherboards and hard drives, not administer systems, basically help desk people with the fancy sysadmin job title. Virtualization means the helpdesk board swappers with the cool job titles are outta here, but the real sysadmins have little if anything to fear.

Nothing new (1)

bryan1945 (301828) | more than 3 years ago | (#35356212)

There are always people who are excellent, competent, and flat-out bad at their job. Unfortunately, the numbers of each group skew towards the lower end (well, not everyone is a genius). If this makes for an acceptable solution for the less-skilled, so be it. I hate to reward incompetence, but I hate down time even more. I want my servers running so my employees can do their work.

To be honest (4, Informative)

TheRealFixer (552803) | more than 3 years ago | (#35356222)

It sounds like this guy is just upset that technology has progressed to the point where we don't need to pay out the nose for some high-priced UNIX consultant to spend 3 days troubleshooting an issue that can be fixed in minutes or hours.

Just because you might learn more by spending days chasing down an issue instead of using your available tools to quickly redeploy the server and get the business back up and running, doesn't make that the correct decision. If you really want dig into the root cause, clone the broken VM off and research it after you get a fresh one deployed from template.

Not surprised at all (1)

Stenchwarrior (1335051) | more than 3 years ago | (#35356232)

It's funny how many admins out there can't even set permissions in *NIX. I was working with a guy who was very well-versed in the VM world. Several certs after his name, in fact. But when he had to actually set permissions on the .vmdk files on the ESX host from the command line, he was clueless. I explained to him the whole rwxX and how each numerical value changes the bit for that permission and it was a completely wasted effort. I guess Veeam will take care of all that from a GUI.

Still, seems like they would teach the basics.

Virtualization != marginalization of skills... (4, Interesting)

Shoeler (180797) | more than 3 years ago | (#35356276)

This seems to me to be a philosophical question. Indeed, if the uptime and more importantly availability is higher by the purported crash and burn (taking liberties with the slash and burn deforestation technique) method, who is to say it is less useful or less valid? Indeed, to espouse skills over delivering for the client seems to be missing the point. It seems to be standing on some pedagogical imperative that knowledge is somehow of more value in the workplace than delivery.

Now - having said that - don't get me wrong. I have seen entirely too many *nix sysadmins (full disclosure: I got an RHCE in 2003) who don't know where the network config files are because they only know the GUI, and are hired by a team of people who have never logged into a *nix box. However, I think the ill that is most egregious is not that it sets some moral and ethical imperative fo fixing rather than reloading (or in this case, recovering from a VM image) a server, but the fact that it misses the point that there has been a dearth of qualified IT candidates since the dawn of our industry and that the fixes to this don't have to do with how we fix a server, but how we hire and more importantly who we hire. As is everything in IT, garbage in == garbage out.

Finally - I absolutely agree with the Infoworld argument. It assumes an unexpected failure within the server, not some external thing that needs to be diagnosed and fixed. If your app crashes because the SQL table isn't there on the SQL server you don't control, rebooting ain't going to do a hill of beans worth of good.

Re:Virtualization != marginalization of skills... (2)

visualight (468005) | more than 3 years ago | (#35356572)

The problem is that the new "crop" of developers don't have any real problems to solve. They've all been solved, and solved well. So now we're adding unnecessary abstraction layers that hide what's really going on.

People that spent 3 days figuring out how to burn a CD back in the 90's tend to know how everything works, but the "kids" coming up in recent years only know (and only care to know) the flashy point-and-click abstraction layers, and only program within "frameworks".

Years ago I used to talk to people about the Windows approach vs the Unix approach, but sadly the people currently working at Redhat and Novell are work hard to make a liar out of me.

Fun at scale. (1)

Hawke (1719) | more than 3 years ago | (#35356278)

You have 1000 servers. You need to upgrade them to RHEL 6. Do you put a DVD in each of 1000 DVD drives?

NO!

You use an image server. Kickstart. Cobbler. Figure out how the new image looks like, and then pxeboot 1000 servers. That goes much faster. (to the sysadmin above, reimaging a server should take 25 minutes, most of which is spent surfing slashdot, not an hour).

So now, you've got a server that's misbehaving. One of 1000. Out of pure coincidence, honest, the one server you were manually futzing with last week, but that can't possibly be connected. Fixing that server yourself will cause more "configuration drift", and leave you with one server that's still different than the 999 other servers. And hey, that image server is still on your network. Just reimage the thing.

It's popular because it's the answer that scales. kthxbye.

Re:Fun at scale. (0)

Anonymous Coward | more than 3 years ago | (#35356392)

you should probably isolate that one server and figure out what's wrong. Then you can have one cake and keep 999. kthxbye.

Re:Fun at scale. (1)

buchanmilne (258619) | more than 3 years ago | (#35356638)

Maybe if RH bothered to ship rpm-4.6.x to RHEL5, you would only need to reboot once during an upgrade from RHEL5 to RHEL6 ...

Like you can on other distros, including other RPM-based distros.

If you used a VCS or a configuration automation tool (cfengine, puppet etc.), then you wouldn't need to re-image or re-install a server to get it's config in-line ...

Here we go again (1)

cpct0 (558171) | more than 3 years ago | (#35356282)

Is this the old geezer versus the new wet diapers yet again? (trying to be as evil on both sides ;) )

There are new technologies and we should embrace them. I am not a proponent of VMs, I don't like them in general, but I do see its uses and it's very effective. Like in C++, you got STL, with very similar and nearly interchangeable std::vector, std::list, std::deque and so on (and not talking about boost or 3rd parties here). You need to know when to apply them or else you'll get problems. Well, in the '10s, you have the same ridicule amount of technologies available to sysadmins, and you need to know when to apply it. That's the new Sysadmin job, not only know that you can code one in bash with grep, awk, echo, while read, pipes and rsync, but actually know there is a package all neatly made for you, available at your fingertips with a simple apt-get (or yum).

I keep my computer tidied-up, I love to know what runs where. Even then, I do a "spring cleaning" once every year, reinstalling everything. And incredibly, my computer runs faster and more efficiently. Why? new /etc defaults, new parameters, new software, old clinging software, things that are nearly impossible to update. Same for the files. Seriously, in today's computers, we get hundred of thousands of files, most of which have some arcane use we couldn't care less, but are necessary for some kind of weird reason. I'm a sysadmin, and I don't pretend to want to know all these files.

I read the article, and yes, there are things that are changing, and seriously, I do respect the One person who can understand the Sendmail configuration files... oh I'd even be impressed with the M4. :) And when there is a problem, I want to know why, because I love to learn. But then ... there are prerogatives, time constraints, servers need to be up, people need to work, and we have all these magnificient tools that will enable every computer to be segregated in their private little VM world (to return to that main article). So should be simply shrug, laugh and go back to The Ancient Ways? You can keep you "vi" editor, leave me my "vim", please. :)

Time is Money (1)

sheehaje (240093) | more than 3 years ago | (#35356296)

I used to scoff at reformatting and reinstalling, but today it's a simple calculation. Will the fix take longer than either reverting from a snapshot or cloning from a template? Many may cringe at that as a solution, but the bottom line is time is money. It used to be that reinstalling, restoring from backup simply took too long, and it was better to fix the problem at the console if possible. Today, that isn't so with automatic snapshots of virtual machines, SAN replication, etc. I don't scoff at it though, it means we can spend more time being proactive rather than reactive.

The fastest fix (0)

Anonymous Coward | more than 3 years ago | (#35356298)

I'm going to get flamed for this, but what the hell.

I've always thought that it is more important to get a server back up and operational as quickly as possible, then it is to keep the server down until you find the problem. Now don't get me wrong, you still need to find the ultimate problem, or at least find out if the problem is repeatable, and then find the answer to it.

So I'm in favour of any method that help me in getting the system back up and running; be it re-imaging or anything else.

Re:The fastest fix (1)

vlm (69642) | more than 3 years ago | (#35356632)

I'm going to get flamed for this, but what the hell.

I've always thought that it is more important to get a server back up and operational as quickly as possible, then it is to keep the server down until you find the problem. Now don't get me wrong, you still need to find the ultimate problem, or at least find out if the problem is repeatable, and then find the answer to it.

So I'm in favour of any method that help me in getting the system back up and running; be it re-imaging or anything else.

Doesn't apply to anything that outputs reports to management. Any chance that you're giving them provably wrong data that dude gets shut down till fixed.

Faster is nice, but... (1)

Junta (36770) | more than 3 years ago | (#35356310)

Sometimes a one-off mistake happens, and reinstall makes sense. Many other times, the reason you had to reinstall is due to a more persistent problem (program/script systematically messing up or an admin that just needs to not be doing admin work), and skipping root cause analysis means you'll lose more time in the aggregate.

Save a buck (1)

Goboxer (1821502) | more than 3 years ago | (#35356378)

How outrageous that these people don't explore the complex and time consuming issues. Don't they realize that the pursuit of knowledge is *way* more important than getting it done quickly. I mean, in a world where time equals money they shouldn't look at it as tossing money into a hole; they should look at it like investing in their collection of potential Jeopardy trivia.

I know if I had a boss hovering over me, not understanding what was wrong, and just pressuring me to get it done I would tell him to shove off so I could learn. Who cares that every minute I spend working on the issue is a minute I can't spend on other problems. Who cares that I could be replaced by a system admin who would get it done quickly. Knowledge and what other system admins think of me is what is important. After all, those pay the bills. /sarcasm

The real problem with the Clone approach (0)

Anonymous Coward | more than 3 years ago | (#35356402)

If you ask me, a major drawback is that fewer eyeballs are looking at the code -> less bugreports -> buggier software.

outsourcing (1)

roc97007 (608802) | more than 3 years ago | (#35356428)

I think part of this phenomenon might be due to outsourcing, which puts a layer of call center personnel armed with loose-leaf binders of procedures between you and the one or two remaining competent sysadmins, who are then regulated to firefighting. In this world, there isn't time to diagnose problems because the level of expertise and admin/customer ratio are kept purposefully low.

I series (0)

Anonymous Coward | more than 3 years ago | (#35356440)

Run a I series, they're like a tank. Slow and cumbersome but they just don't stop.

endless cycle (4, Insightful)

roc97007 (608802) | more than 3 years ago | (#35356518)

I'm not sure I buy everything in TFA, but have to admit to a certain extent this phenomenon is real. I've noticed, however a tendency to regenerate an instance, and when it doesn't work regen it again, and again and again because the purposely overextended and/or undertrained admin doesn't have time to figure out that the problem is in his template or due to something external like a dup ip. Come to think of it, this type of endless cycle seems to be fairly common in the Windows world. I guess we've caught up.

Sometimes the user has to diagnose the problem themselves, which is a win for the IT manager because the time didn't come out of the IT budget.

I'm hoping that at some point these practices will be recognized as the false economics they are. But I'm not holding my breath.

Servers became a smaller piece of the puzzle (0)

Anonymous Coward | more than 3 years ago | (#35356546)

I don't see why reimaging/rebooting a VM instance is different from restarting a service that is misbehaving. Now "services" are VMs, that's all.

You were very happy as a sysadmin of a couple big servers, and now you have to administer several dozens of VMs. The skill set is slightly different, that doesn't mean we're "losing skills". Your Unix wizardry will come in handy anyway. The base concepts about OS operation will be there too.

Things change. Learn and deal with it.

anonymous (0)

Anonymous Coward | more than 3 years ago | (#35356596)

The decline and fall? Can you decline and fall at the same time?
Where was the incline and rise?

OK Slashdot - I get it... (1)

acoustix (123925) | more than 3 years ago | (#35356600)

...I'm a poor, lowly Windows admin who doesn't know my ass from a hole in the ground. ALL HAIL THE 1337 *NIX H4X0R5!

Seriously...how long is this windows admin vs *nix admin comparison going to last? I can't help it that there are apps that absolutely need to run in a Windows environment. The job needs to get done. If I could run my industry specific software on Linux, I would. I would love to save my company money from licensing.

Now if you'll excuse me, I need to go back to flinging poo all over my server room walls.

Sounds like naive old schoolism (0)

Anonymous Coward | more than 3 years ago | (#35356664)

As a system administrator I don't understand why any option to make quickest and biggest win should be ruled out, even it would be in conflict with tradition. Some times the problem just is that biggest, quickest and easiest fix are not the same fix. Knowing which one to choose, and when, make a good system administrator.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...