Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Ask Slashdot: Unattended Maintenance Windows?

Soulskill posted about two weeks ago | from the wake-me-if-there's-fire dept.

IT 265

grahamsaa writes: Like many others in IT, I sometimes have to do server maintenance at unfortunate times. 6AM is the norm for us, but in some cases we're expected to do it as early as 2AM, which isn't exactly optimal. I understand that critical services can't be taken down during business hours, and most of our products are used 24 hours a day, but for some things it seems like it would be possible to automate maintenance (and downtime).

I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?

cancel ×

265 comments

Puppet. (4, Informative)

Anonymous Coward | about two weeks ago | (#47432007)

Learn and use Puppet.

Re:Puppet. (-1, Flamebait)

Anonymous Coward | about two weeks ago | (#47432059)

Yes puppet solves all of your failed reboot issues...

Fucking puppet bullshit, written by ruby retards who can't figure out that sysadmins have been automating this shit for years, with shell scripts.

But no, these ruby-tards think they need to reinvent the world written in ruby, because 90% of them couldn't hack their way out of a paper bag.

Re:Puppet. (1)

Rhys (96510) | about two weeks ago | (#47432103)

That's a failure to test* your code-as-infrastructure, not a puppet failure.

*: Exempting a small subset of physical device issues, though even those can be ignored if you're talking about a VM, so that the physical hardware is never actually in a not-live state.

Re:Puppet. (1)

Anonymous Coward | about two weeks ago | (#47432159)

And a kernel update has never blown up a grub install on a VM...

Nope..never has happened.

Re:Puppet. (1)

Rhys (96510) | about two weeks ago | (#47433131)

So... you didn't test... and you have only yourself to blame?

Especially with VMs, it is so easy to snapshot and test things.

Re:Puppet. (1)

nabsltd (1313397) | about two weeks ago | (#47433387)

Especially with VMs, it is so easy to snapshot and test things.

How, exactly, do you snapshot and test the production VM before the maintenance window and guarantee you won't affect (and by "affect", I mean anything that changes behavior in any way that is not expected by the users) any services running on that VM?

If you meant "clone" instead of "snapshot", that doesn't help either, as the clone will have to have a different IP address, can't connect to the production database, etc.

We've had VMs that have become corrupt in very strange ways so that they would not reboot. The corruption didn't affect any running services, but existed for at least six weeks (we had to go back that far to get a backup that didn't have the issue). Testing a kernel patch that requires a reboot wouldn't have revealed this corruption, as the dev and staging servers didn't have the problem. Testing it on the production server would have revealed it, but we would have to do that during scheduled maintenance anyway....

Re:Puppet. (2)

viperidaenz (2515578) | about two weeks ago | (#47433667)

So it's someone else's fault your test environment doesn't match production?

Re:Puppet. (3, Interesting)

dnavid (2842431) | about two weeks ago | (#47435595)

So it's someone else's fault your test environment doesn't match production?

People often fail to try hard enough to make the test environment (assuming they even have one) match the production environment, but for some problems test never matches production, and essentially never can: some problems only reveal themselves under production *conditions*. For example, I recently spent a significant amount of time involved in the troubleshooting of a kernel bug that only arose under a very specific (and still not fully characterized) set of disk loads. Test loads including tests involving loads several times higher than the production load did not uncover the bug, which caused kernel faults, and the faults randomly started occurring about a week after the software patch went live.

You should try to keep test as close as possible to production so testing on it has any validity at all, but you should never assume that testing on the test environment *guarantees* success on production. Its for that reason that, responding to the OP, I have never attempted to do any serious production upgrades in an automated and unattended fashion, and not while I'm alive will any such thing happen on any system I have authority over. As far as I'm concerned, if you decide to automate and go to sleep, make sure your resume is up to date before you do because you might not have a job when you wake up, if you guess wrong.

Even if you guess right, I might decide to fire you anyway if anyone working for me decided to do that without authorization.

Re: Puppet. (-1)

Anonymous Coward | about two weeks ago | (#47435745)

Pussy

Re:Puppet. (0)

Anonymous Coward | about two weeks ago | (#47440485)

Where do you work? I want to make sure I never apply there.

Re:Puppet. (3, Informative)

sjames (1099) | about two weeks ago | (#47434447)

How, exactly, do you snapshot and test the production VM before the maintenance window and guarantee you won't affect (and by "affect", I mean anything that changes behavior in any way that is not expected by the users) any services running on that VM?

Clone it. upgrade the clone and make sure it works. If so, wipe the clone, snapshot the production VM and upgrade it. If it fails, roll back. Make sure your infrastructure is set up so the clone CAN be properly tested. Yes, sometimes you will have to do that rollback, but with an adequate test setup, frequently you won't.

Re:Puppet. (1)

upuv (1201447) | about two weeks ago | (#47440603)

This pattern only works for single nodes.

if you have a complex infrastructure you can't rely on this pattern alone.

Re:Puppet. (2)

sjames (1099) | about two weeks ago | (#47440701)

But the solution will be just a more complex variant on this theme. Consider also that you might have allowed complex to become Rube Goldberg.

Re:Puppet. (1)

nabsltd (1313397) | about two weeks ago | (#47442577)

Make sure your infrastructure is set up so the clone CAN be properly tested.

If the clone isn't on the same VLAN, accessing the same data-gathering hardware, there is no way the infrastructure can have a test match production. The data-gathering hardware I'm talking about costs $20-70K in supplies to complete a run, and the hardware itself is in the $250K range, so there is no way to duplicate it.

I'm not saying that we don't try (use data from an old run, etc.), but there is no way to truly duplicate everything, and sometimes you just have to live with that.

Re:Puppet. (1)

sjames (1099) | about two weeks ago | (#47442633)

If the clone isn't on the same VLAN, accessing the same data-gathering hardware, there is no way the infrastructure

Sounds like you should make sure the clone is on the same vlan and accessing the same data gathering hardware, doesn't it?

I'm not saying that we don't try (use data from an old run, etc.), but there is no way to truly duplicate everything, and sometimes you just have to live with that.

So, use the clone with that setup to verify as much as you can. Then use the snapshot to allow you to roll back if things don't work out in production.

If you still can't afford the risk, then you don't need a Maintenance window at all. You need to airgap the network and never update anything on the expensive side.

Re:Puppet. (-1)

Anonymous Coward | about two weeks ago | (#47436597)

You'r shit is fucked up and you;r retarted. Dood

Re:Puppet. (-1)

Anonymous Coward | about two weeks ago | (#47432727)

That's what happens when would-be "web developers" without formal education in computer science and system administration go on to solve automation problems. Yep, Rubytards.

Re:Puppet. (0)

Anonymous Coward | about two weeks ago | (#47432719)

No, Puppet is a nasty, nasty hack!

Learn how to make packages, and do configuration management with them!

Re:Puppet. (3, Interesting)

bwhaley (410361) | about two weeks ago | (#47432723)

Puppet is a great tool for automation but does not solve problems like patching and rebooting systems without downtime.

Re:Puppet. (2)

Lumpy (12016) | about two weeks ago | (#47433351)

Just having a proper IT infrastructure works even better.

Patch and reboot secondary server at 11am. everything checks out, put it online and promote it to primary. All done. Now migrate the changes to the backup, Pack up the laptop and head home at 5pm... not a problem. Our SQL setup has 3 servers we upgrade one and promote it, the upgrade #2 #3 stays at the previous revisions until 5 days have passed so we have a rollback. Yes data is synced across all three, worst case if TWO servers were to explode, we will lose 15 minutes of data entry.

I NEVER do late at night or weekend maintenance anymore. Servers are dirt fricking cheap to not have redundants always running and ready to drop in.

Re:Puppet. (1)

x0 (32926) | about two weeks ago | (#47434051)

I NEVER do late at night or weekend maintenance anymore. Servers are dirt fricking cheap to not have redundants always running and ready to drop in.

Sure, hardware is cheap - software licenses, not so much. (An no, I don't have the option to use free/oss replacements.) When it costs my company $25,000 per license, deploying a primary and two 'backup' servers is not really an option.

m

Re:Puppet. (2)

Architect_sasyr (938685) | about two weeks ago | (#47435579)

Talk to your supplier. I've never had an issue getting "spare" licensing for testing servers. Regularly audited for it, sure, but never any real trouble getting the licenses.

Re: Puppet. (0)

Anonymous Coward | about two weeks ago | (#47437481)

If you pay that much get a spare license as already suggested. Chances are also that your company is big enough that you aren't the only IT person available to do it so rotate shifts. If you are too small then you are not important enough to warrant downtime that far outside office hours. Just do it after 5pm and send everyone home at 5 like they should anyway.

Re:Puppet. (0)

Anonymous Coward | about two weeks ago | (#47435765)

What about environments where things are not quite as "proper" as you describe?

Case in point: A couple of years back, I worked for a company that provided IT services for other companies. We live in a county that may be considered "rural". I lived in the giant city of 80,000 people. Other than a couple of smaller cities, the rest of the county contained mostly farms or less developed land, like mountainous areas.

Over 95% of the companies we served had a single switch. These were small networks. The vast majority (probably 90%) did not have any Wi-Fi whatsoever. (If they did have equipment that could support Wi-Fi, it was left disabled because there was just no desire for such a thing.) New customers would be frequently forced by us (or else we would refuse to provide ongoing services) to make sure they had sufficient equipment to back up critical data. Getting companies to do things "properly", by getting a decent firewall, was a significant effort because these non-Fortune500, non-Enterprise, non-computer companies just didn't see the need.
Deploying a new server to replace their 8-year-old server was a multi-month sales process once they decided to start agreeing to move in that direction.

Maybe that sounds absurd, so let me paint this picture. Our large clients had 30 to 40 computers. Over 90% of our clients were typically in just one single-story buildings, and many shared that building with another business. Smaller clients might have 6 computers, and some even had only 3 computers total, two of which were workstations. The number of staff members in the building would often be less than five. For many of the businesses we served, the number of staff members working at a time would generally be *less than three*. So those places with three computers had more computers present than people. Why would you expect these tiny businesses to invest so much into a hardware budget that their inventory of servers is sufficient enough to "have redundants always running and ready to drop in."?

I compare to this other post [slashdot.org] where David_Hart talks about being able to "test on the exact same switches with the exact same firmware and configuration". Some of our clients had multiple switches. Not more than one of those switches would be unboxed. The only reason they had a second switch would just be storing a spare part.

In my years of providing IT support, the only time I've ever come across a "router" was when I was in a college (or university). (I'm not counting equipment like a firewall that had capabilities of supporting multiple subnets, which is fair for me to do since such capabilities were not being used.) There just isn't the need for such an equipment to move traffic between subnets when the entire network is small enough to be on one subnet.

Our customers were businesses providing useful services to society, such as cleaning people's teeth, or assembling fresh ice cream cones, or packaging animal poop that they would then ship to their customers who were paying to obtain such packages.

So many posts by industry professionals seem to just assume that substantial money, regularly flowing into IT, is the de-facto default experience. Maybe so, for those of you who keep living in a metropolis (and I do understand that over 50% of the population now lives in a city with over a million people), and particularly if you've always worked for large corporations.

The original post said that he was performing an upgrade of CentOS. Not RHEL. CentOS. Could it be that he is working in an environment with a smaller budget?

Re:Puppet. (1)

Anonymous Coward | about two weeks ago | (#47435785)

Yup, I'm the same Anonymous Coward as the previous post, but I decided to break things up a bit.

Please, help me understand your perspective here.

The question at the end of the parent post [slashdot.org] was rhetorical. I assume you don't know that much about the poster. The next questions are not. I'm not just trying to provoke, or blow steam. It's something I've genuinely wondered for decades. Back in the 90s, I was in high school and I just figured I'd understand that more when I got a job in the tech industry. But now that I have some experience and have answered many other questions I had earlier in life, this still eludes me. This post just seemed like the perfect example for me to use to pose this question. I would sincerely appreciate any insight that could be shared about this.

smash's post [slashdot.org] indicates that people should be using "SAN snapshots. There's no real excuse." Sure, we'll get right on that. After a SAN is used. How about if the organization has never invested in a product called a SAN? That sounds, to me, like an excellent excuse to not be implementing SAN snapshots.

When you're posting to a public board like this, why do you just assume that such advice is feasible for readers to implement? Are you just completely blind to realizing just how impractical such an approach is for some of the less privileged environments? I know that sounds like flame-bait, but I've really wondered that, or assumed that is the case, from a lot of comments that I've read and heard.

Yeah, I expect that such procedures should be implemented by the companies that afford teams of full-time IT staff members because those companies are large ones that have names I recognize from the national stock price ticker. This is great advice for some people. But are you urbanites who serve those organizations just so disconnected from the experience of organizations (like commercial companies, or even lower-budget charities), that you've never conceived of how things are done in other organizations where technology is a non-focus?

When I grew up, some students started vegetarian diets when they learned more details about the extent of animal cruelty. I've heard that today, some students start vegetarian diets when schools teach them that meat comes from animals, because they never knew where meat comes from. (Where the heck else do they think the ingredients of "chicken", or a "fish" sandwich, come from?) I do get the benefits of redundant equipment. But, even with that understanding, I wonder if the advice that everybody should be doing such things is a symptom of this same sort of isolation from how many other people actually operate.

Again, I re-iterate, I'm not trying to attack here. I'm trying to build a bridge, by increasing my understanding. Please, I'm humbly requesting, help me to get your point of view.

Puppet. (0)

Anonymous Coward | about two weeks ago | (#47436727)

The best would be to have a system that is highly available like exchange dag for example that allows you to take one server offline without an outage.

Or Ansible, Chef, Rexify, SaltStack, .... (0)

Anonymous Coward | about two weeks ago | (#47437303)

Or Ansible, Chef, Rexify, SaltStack, ....

Sometimes any specific tool is worse than the root problem. In many locations, that applies to puppet. None of these tools are perfect, but they do support automation for UNIX-like OSes.

For Windows - I only have 1 suggestion. format c:

Re:Puppet. (1)

upuv (1201447) | about two weeks ago | (#47440597)

Puppet is not orchestration. This problem is an orchestration problem. A very simple one but still orchestration.

Puppet is declarative which can mean it has no order to events. Most people make use of some screwball dependency chain in puppet giving the illusion of orchestration.

Use something Ansible if you want to orchestrate a change

And if it doesn't work? (5, Insightful)

Anonymous Coward | about two weeks ago | (#47432021)

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.

And if it doesn't work? (2, Insightful)

Anonymous Coward | about two weeks ago | (#47432097)

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.

This is the correct answer. I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.

Re:And if it doesn't work? (2, Insightful)

Anonymous Coward | about two weeks ago | (#47432855)

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.

This is the correct answer. I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.

Use of monitoring and alerting can alleviate this - access to the system through VPN can provide near-immediate access. It also helps if critical services can be made not to be single points of failure.

Re:And if it doesn't work? (1)

Neil Boekend (1854906) | about two weeks ago | (#47447255)

access to the system through VPN

Unless the error brings down your VPN server.

Re:And if it doesn't work? (1)

0123456 (636235) | about two weeks ago | (#47432915)

I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.

Yeah, but this way, you won't be the one who has to fix it :).

Of course, you might have to start looking at job ads the next day...

And if it doesn't work? (1)

Anonymous Coward | about two weeks ago | (#47433259)

Sad but true.
And it's almost 100% likely that the night you decide to try LSD for the first time will be the night that your automation script fails.

Re: And if it doesn't work? (2)

Redbehrend (3654433) | about two weeks ago | (#47433747)

I agree we get paid REALLY well, and it's part of the job. Either develop it yourself, pay a minion or find a new job. Lol

And if it doesn't work? (1)

Pogie (107471) | about two weeks ago | (#47434479)

To the original poster: It is entirely possible, but you're going to need to learn a lot about modern automation and configuration management tools appropriate to the types of maintenance you're looking to automate. You also need solid vision and alignment on how you're going to achieve this level of automation across multiple parts of your business -- Development, IT, the Business, everyone. They all have to buy in and commit, because all of those folks have the ability to fuck it all up if everyone isn't on the same page. You can't do it alone on the admin side. As a start, I would suggest learning about Continuos Integration/Continuous Delivery and Agile and Devops methodologies to get started on the road to where you want to be.

To the rest of you:

The original comment ("Learn and use Puppet") is grossly oversimplified -- there is a lot more to it -- but with proper implementation of configuration management software (Chef, Puppet, Salt, etc), proper automated testing (think Jenkins, Teamcity, etc) and a real commitment in your organization to Continuous Integration and Delivery practices, you can easily do regular automated maintenance. Yes -- sometimes it will break and you'll have to clean it up. But properly and thoughtfully implemented in policy and practice, those times when it breaks will be the exception that proves the rule.

Forgive the argument from authority, but at our firm (International, thousands of primarily linux servers across 14 countries and 40+ datacenters, mostly bare-iron, some virtualization) we have regular daily and weekly automated maintenance. We handle all sorts of significant change -- driver updates, software upgrades, network switch configuration, even forklift OS upgrades involving the full re-imaging of a bare iron system combined with re-deployment of software (including things like databases and hadoop clusters) -- automatically and without human intervention on a regular basis. And by regular I mean daily.

The attitude that "Murphy always wins" or "something will fail and you will have failed by not being there to fix it immediately" is a relic of a time when the tools available to manage large scale infrastructure were inadequate or unavailable. Again, there are failures that will require manual intervention, but if you are doing your jobs well as developers, network admins, systems admins, 'devops' [NOTE: I strongly object to that term being used as a job title, but that's how folks have started using it] then you should be able to conduct automated hands-free production change at 2am on a Saturday and sleep like a baby knowing that when you check your upgrade report in the morning 99% of the time everything will have gone off without a hitch.

Frankly if you approach complex infrastructure management with that defeatist viewpoint of "things will always fail", you are doing yourself and your employer a disservice, and you are severely restricting your career prospects. My company is not in any way unique in our ability to automate and manage our infrastructure, and maintaining that type of outdated attitude is going to cause lots of doors to be slammed in your face. Do you really believe the Googles, Facebooks and Amazons of the world rely on having a human being white-knuckling every change in their infrastructure?

One additional note: If your infrastructure is designed such that you cannot push change without guaranteed downtime or the risk of downtime then you have failed to design your infrastructure properly.

Re:And if it doesn't work? (1)

dreamchaser (49529) | about two weeks ago | (#47433087)

Exactly, and when it comes to maintenance windows one should never forget Murphy. If something can go wrong it will, and being there with a console cable and a laptop or tablet to get into a problem device is a good thing.

Re:And if it doesn't work? (1)

Karl Cocknozzle (514413) | about two weeks ago | (#47433249)

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.

He might just need a better boss--it sounds like this one expects the guy to stay up all night for maintenance, then come in at 9am sharp, as if he didn't just do a full day's work in the middle of the night.

Rather than automating, he should be lobbying for the right to sleep on maintenance days by shifting his work schedule so that his "maintenance time" IS his workday. "Off-hour work" doesn't mean "Work all day Monday, all night Monday night Tuesday morning, and all day Tuesday." Or, at least, it shouldn't.

Re:And if it doesn't work? (0)

Lumpy (12016) | about two weeks ago | (#47433363)

Just tell the boss, "fuck you, I worked all night." If he does not like that complain to the state labor board, they will go after him.

Why dont you people stand up for yourselves? Stop being door mats.

Re:And if it doesn't work? (1)

nine-times (778537) | about two weeks ago | (#47433903)

No offense, but that's not a very sensible response. Your job may require off-hours work, but that depends largely on the needs of the company your supporting, and what you negotiate your job to be. Regardless, there's no reason why you shouldn't try to diminish the amount of off-hours work, and make it as painless as possible.

For example, let's say I have to do server updates similar to what this guy is describing, and my maintenance window is 5am-9am. The updates consist of running a few commands to kick the updates off, waiting for everything to download and install, rebooting, then checking to make sure everything was successful. Because the updates are large and the internet is slow, it sometimes takes 3 hours to perform the updates, but only 10 minutes to roll things back.

It's an exaggerated scenario, but given that basic outline, why wouldn't I just script the update process, and roll in at 8:30 with plenty of time to confirm success and roll things back if needed? What, I should still come in at 5am just because an Anonymous Coward on the Internet decided it was "part of the job"?

Re:And if it doesn't work? (1)

mjwalshe (1680392) | about two weeks ago | (#47434135)

employers don't want to pay for having professionals on call

Re:And if it doesn't work? (1)

sjames (1099) | about two weeks ago | (#47434573)

Sure, but there's no good reason not to minimize it.

just do your job. (0)

Anonymous Coward | about two weeks ago | (#47432027)

quit complaining.

Murphy says no. (5, Insightful)

wbr1 (2538558) | about two weeks ago | (#47432033)

You should always have a competent tech on hand for maintenance tasks. Period. If you do not, Murphy will bite you, and then, instead of having it back up by peak hours you are scrambling and looking dumb. In your current scenario, say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen. Bite the bullet and have a tech on hand to roll back the patch. Give them time off at another point, or pay them extra for night hours, but thems the breaks when dealing with critical services.

Re: Murphy says no. (0, Troll)

ModernGeek (601932) | about two weeks ago | (#47432081)

This guy probably is the tech but is wanting to spend more time with his family or something.

Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.

Re: Murphy says no. (4, Insightful)

CanHasDIY (1672858) | about two weeks ago | (#47432209)

This guy probably is the tech but is wanting to spend more time with his family or something.

Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.

OR, if you want to have a family life, don't take a job that requires you to do stuff that's not family-life-oriented.

That's the route I've taken - no on-call phone, no midnight maintenance, no work-80-hours-get-paid-for-40 bullshit. Pay doesn't seem that great, until you factor in the wage dilution of those guys working more hours than they get paid for. Turns out, hour-for-hour I make just as much as a lot of the managers around here, and don't have to deal with half the crap they do.

The rivers sure have been nice this year... and the barbecues, the lazy evenings relaxing on the porch, the weekends to myself... yea. I dig it.

Re: Murphy says no. (1)

hodet (620484) | about two weeks ago | (#47432375)

you've just described my life. amen brother.

Re: Murphy says no. (1)

LordLimecat (1103839) | about two weeks ago | (#47432551)

At least where I work maintenance is a once a month thing; Im led to believe this is normal by anecdotal evidence on the internet.

Your average work week ends up at like 42 hours if you factor that in; its really not that onerous.

Re: Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432879)

Where I work, this is impossible. Large systems get updated quarterly. Most systems get updates (OS patches) monthly. Apps get updates roughly monthly, but there are enough apps that this generally spreads out to cover the entire calendar.

Re: Murphy says no. (1)

master_kaos (1027308) | about two weeks ago | (#47432559)

yup same here, while my yearly salary isn't great I work 35 hour weeks, 4 weeks vacation, 10 sick days, multiple breaks per day, rarely ever any OT (and while we are salaried and don't get OT pay we instead get time in time and a half off). Hour-for-hour I probably make more than a lot of managers as well. Would I like to make double what I am making? Sure, but I would NOT be willing to put in double the work.

Re: Murphy says no. (2)

CanHasDIY (1672858) | about two weeks ago | (#47433057)

Would I like to make double what I am making? Sure, but I would NOT be willing to put in double the work.

Not for these fuckers, anyway.

Were I to strike out on my own, I don't think I'd mind all the extra hours, but it's easy to see things differently when you're your own boss.

Re: Murphy says no. (2)

gbjbaanb (229885) | about two weeks ago | (#47432249)

so once a week you have to get up early and do some work.

big deal.

The benefit is that you get to go home early too - and that mean you're there to pick up little johnny from school instead of seeing him when you drag your sorry arse in from a full day of meetings and emails and stuff.

Frankly, I wouldn't want to do it every day, but I can't see how the occasional early is anything but a good thing for family life.

Re: Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432513)

In a lot of places you don't get to go home early. I watch the devops team where I work, they have to stay up late for releases and still be in for 8-9-10 AM depending upon the shift of each one. There is no extra time off. The team also has extremely high turnover. It's unfortunate because I would love to work on that team for a bit I think it would be so fascinating.

Re: Murphy says no. (1)

Vlad_the_Inhaler (32958) | about two weeks ago | (#47432651)

I have no idea if once a week is realistic, it sounds far too high. I have around 5-10 such windows a year, some are stuff I can do from home (with support from the guys on shift) and some entail me being physically there, so there have been none of the second kind this year.
Major Outages of one of our production systems have been featured on national news and Slashdot before, although it requires an outage of several hours to cross that threshold. Our windows are at around 02:00 to 03:00 depending on which system is affected.

Murphy has really bitten us in the ass a few times:

  • Someone making an update (on a test system) which meant that the system did not come up properly after the next reboot which was days later. The symptoms made it look as though the test "window update" caused the problems. It was an accident but very annoying.
  • A weird error on one switchable hardware unit rendered it unusable on our main production system. That unit was one of 32 and the allocation system automatically only used it on other machines, the next reboot would have cleared the problem anyway. Someone decided to use *that* unit for a critical update and brought it up manually for that purpose. The update failed and our main system was down. I drove in at 03:30 and (I thought) fixed things by falling back. Shortly after I left again, one application stopped working and dragged the rest down with it. I went back in again and did the original update cleanly - over initial management objections - after which things were fine.

There have been others but they were even more arcane. The absolute worst cases we had were with virtually everyone there. They made the news, two of them made it to Slashdot. Different causes in each case.

Re: Murphy says no. (1)

nabsltd (1313397) | about two weeks ago | (#47433473)

so once a week you have to get up early and do some work.

I don't think that the "2am" listed in TFS is "getting up early". Instead, it's more like "staying up late".

For me, it's not really a problem, but I have had to do that kind of maintenance as a team, and some people are just useless if they stay up that long, or even got a short nap. My current job gives us all day one Saturday a month for maintenance, so you can sleep like normal and get up when appropriate (one hour worth of work, start at 2 in the afternoon if you want...7 hours of work, better start before noon). A lot fewer mistakes seem to be made with this sort of schedule.

Re: Murphy says no. (5, Funny)

PvtVoid (1252388) | about two weeks ago | (#47432307)

This guy probably is the tech but is wanting to spend more time with his family or something.

Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.

Congratulations! You're management material!

Re: Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432485)

lol

Re: Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47441081)

They hate you for it more if you put in the work, though. Lose/lose.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432127)

As someone who has spent many an all-nighter on hardware and software patches and upgrades for critical telecom and network systems, allow me to introduce you to your co-tech, Mr. Murphy. He will always be by your side, helping you find the tiniest potential for failure in your plans. Do not leave anything to chance. Pre-test and automate to your hearts content, but be there watching and confirming and double-checking everything. It will fail at some point. The difference will be if whether or not any users notice when they start rolling in. Your precious sleep lost is better than your job lost.

Re:Murphy says no. (2)

gewalker (57809) | about two weeks ago | (#47432867)

It's even more fun when the CEO stops by, in person, to see how long it is going to take to get things working again. Though not might fault either time I've been there actually fixing the problem, it certainly is attention getting. Neither CEO was being a jerk, he just really needed to know what was going on without any b/s filters by intermediate management. Try imaging that visit if you had just been running an automated script to apply the patch.

So yeah, if it is important, you need to be there, and if drive time is a potential issue, you need to physically be there, not just monitoring the change from home.

Re:Murphy says no. (5, Interesting)

bwhaley (410361) | about two weeks ago | (#47432267)

The right answer to this is to have redundant systems so you can do the work during the day without impacting business operations.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432429)

Yep. Redundancy is the only way to fly with services that have any sort of business significance. Two of everything is good, sometimes three are better. It allows you to do two things -- implement a change and have something to fall back on if it screws up or has unanticipated side effects. And allow change to happen during normal working hours -- provided adequate capacity was provisioned. Unattended patching maybe ok for your PC at home that just runs Facebook, but for any sort of business services Murphy is really in charge.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432431)

Thank you, dare I say it, Windows Server in a cluster or any number of other things can allow this to be done during the day with no down time.

Re:Murphy says no. (1)

mshieh (222547) | about two weeks ago | (#47432459)

Can't agree enough, regular downtime is the root of the problem.

Usually you still want to do it off-peak just in case you're caught with reduced capacity.

Re:Murphy says no. (1)

bwhaley (410361) | about two weeks ago | (#47432493)

Yup. Very dependent on the business, the application, the usage patterns, etc.

Re:Murphy says no. (1)

Zenin (266666) | about two weeks ago | (#47434093)

Or it's not at all dependent on those factors.

It's much more a matter of how much someone cares to put redundancy in place. Doing it right affects the entire stack: Code architecture, deployment tooling, infrastructure architecture and costing.

It's a large reason why PaaS is gaining momentum: This is all assumed and it ends up being easier to do it the right way (that includes all this) from the start than doing it any other way, given that most all of the boiler plate aspects are already built.

If you're building services that still require "regular maintenance windows" in 2014, you're doing it wrong.

Re:Murphy says no. (1)

bwhaley (410361) | about two weeks ago | (#47434133)

If you're building services that still require "regular maintenance windows" in 2014, you're doing it wrong.

This is a really nice sentiment but is in fact somewhat disconnected from reality.

In the web world, building zero downtime services that don't require maintenance is doable. In many enterprise IT environments with legacy or bloated software (hospitals, education, government) it's a non-starter. The staff do not have the skill, the applications don't have the support, and the political will within the organization is not there. Database migrations alone can be a major source of downtime, and that's largely true even for web services.

Re: Murphy says no. (1)

ranelen (2386) | about two weeks ago | (#47432475)

exactly. it doesn't really matter if you are there or not. eventually something is going to break in a new and interesting way that can't be fixed without a significant amount of work.

generally we try to have at least three systems for any production service so that we can still have redundancy while doing maintenance.

that said, I rarely come in for patching anymore. I just make sure I'm available in case something doesn't come up afterwards. (no binge drinking on patch nights!)

redundancy and proper monitoring make life much, much nicer.

Re: Murphy says no. (4, Insightful)

smash (1351) | about two weeks ago | (#47432595)

This is why you build a test environment. VLANS, virtualization, SAN snapshots. There's no real excuse. Articulate the risks that a lack of a test environment entail to the business, and ask them if they want you doing shit without being able to test to see if it breaks things. Do some actual calculations on cost of system failure, and explain to them ways in which it can be mitigated. Putting your head in the sand and just breaking shit in live... well, that's one way to do it, but I fucking guarantee you: it WILL bit you in the ass, hard one day, whether it is automated or not. if you have a test environment, you can automate the shit out of your process, TEST it, and TEST a backout plan before going live.

Re: Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47433401)

This is why you build a test environment. VLANS, virtualization, SAN snapshots.

Exactly, couldn't agree more using snapshots, clustering (active/active or active/passive) is the way to go.

And if you have a largish app it's better to have load balancing cluster (reverse proxen) in front of it, and that can could be virtualized too so there is no need buy hardware first but start with vm's and then migrate to hardware LB cluster later. Soooooo many benefits managing business critical services and not just http/https based but those are nowadays used with many other protocols too.

With SSL offloading (terminating ssl/tls to loadbalancers) and some support from application run as service behind ordinary upgrades and security patches are breeze goes without any outages. Most major upgrades and cut from old app to newer and back if needed goes in few seconds seconds. And there are plenty of options to scale system up to handle tens of thousands of simultaneous users easily, just add more load balancing appliances and backend servers. Many services can of course easily share one load balancing cluster ie. be hosted behind few of these boxen. You can even migrate services seamesly to a public/private/hybrid cloud service(s) behind these boxen without any disruption to your clients.

Re: Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47435943)

Back in the days when I was doing that, I had a complete second system to work on and test with. It only cost $30,000,000. We did use it for all of the grunt work, compiles, backups, development testing, etc, so they got their money out of it. But if wanted to use it all by myself, say for an hour at lunch, they gave it to me. For all this responsibility they paid me $8,000 a year.

Re:Murphy says no. (1)

Anonymous Coward | about two weeks ago | (#47432519)

Yeah, no shit. I love how everyone is all like: "quit whining and babysit that shit" When babysitting isn't necessary if you have a redundant system in place to recover from failure edge-cases. Get a steady state redundant system then maintain them on alternating weeks. If either one fails during automated maintenance then you can just switch over to the backup until you've had breakfast. Better yet, use Amazon EC2 for your infrastructure so you can spool up as many redundant systems as necessary.

Re:Murphy says no. (2, Insightful)

NotSanguine (1917456) | about two weeks ago | (#47432921)

...Better yet, use Amazon EC2 for your infrastructure so you can spool up as many redundant systems as necessary.

Exactly. Because if Amazon screws up, they won't blame you. That fantasy and a couple bucks will get you a Starbucks latte.

Using someone else's servers is always a bad idea for critical systems. Virtualization is definitely the way to go, but use your own hardware. Yes, that means you need to maintain that hardware, but that's a small (or not so small, in a large environment -- but worth it) price to pay because Murphy was an optimist.

Re:Murphy says no. (3, Insightful)

Zenin (266666) | about two weeks ago | (#47434227)

In general, don't do anything that isn't your core business. Or another way of saying it, Do What Only You Can Do.

If you are an insurance company, is building and maintaining hardware your business? No, not in the slightest. You have no more business maintaining computer hardware as you have maintaining printing presses to print your own claims forms.

Maintaining hardware and the rest of the infrastructure stack however, is the business of Amazon AWS, Windows Azure, etc. The "fantasy" you're referring to is the crazy idea that you, as some kind of God SysAdmin, can out-perform the world's top infrastructure providers at maintaining infrastructure. Even if you were the best SysAdmin alive on the planet, you can't scale very far.

Sure, any of those providers can (and do, frequently) fail. Still, they are better than you can ever hope to be, especially once you scale past a handful of servers. If you are concerned that they still fail, that's good, yet it's still a problem worst addressed by taking the hardware in house. A much better solution is to build your deployments to be cloud vendor agnostic: Be able to run on AWS or Azure (or both, and maybe a few other friends too) either all the time by default or at the flip of a (frequently tested) switch.

Even building in multi-cloud redundancy is far easier, cheaper, and more reliable than you could ever hope to build from scratch on your own. That's just the reality of modern computing.

There are reasons to build on premises still, but they are few and far between. Especially now that cloud providers are becoming PCI, SOX, and even HIPAA capable and certified.

Re:Murphy says no. (1)

NotSanguine (1917456) | about two weeks ago | (#47434559)

In general, don't do anything that isn't your core business. Or another way of saying it, Do What Only You Can Do.

If you are an insurance company, is building and maintaining hardware your business? No, not in the slightest. You have no more business maintaining computer hardware as you have maintaining printing presses to print your own claims forms.

Maintaining hardware and the rest of the infrastructure stack however, is the business of Amazon AWS, Windows Azure, etc. The "fantasy" you're referring to is the crazy idea that you, as some kind of God SysAdmin, can out-perform the world's top infrastructure providers at maintaining infrastructure. Even if you were the best SysAdmin alive on the planet, you can't scale very far.

Sure, any of those providers can (and do, frequently) fail. Still, they are better than you can ever hope to be, especially once you scale past a handful of servers. If you are concerned that they still fail, that's good, yet it's still a problem worst addressed by taking the hardware in house. A much better solution is to build your deployments to be cloud vendor agnostic: Be able to run on AWS or Azure (or both, and maybe a few other friends too) either all the time by default or at the flip of a (frequently tested) switch.

Even building in multi-cloud redundancy is far easier, cheaper, and more reliable than you could ever hope to build from scratch on your own. That's just the reality of modern computing.

There are reasons to build on premises still, but they are few and far between. Especially now that cloud providers are becoming PCI, SOX, and even HIPAA capable and certified.

Yes. AWS, Azure, etc. are focused on (and are actually pretty good at) providing compute services (whether that be PaaS or straight-up VMs). However, what they are not is contractually responsible for the safekeeping or integrity of your data.

There are definitely use cases for using "someone else's servers." Use them for external-facing resources like a web presence, customer portal, extranet services or even email. But when it comes to business critical systems and data, no one has a more compelling motive to secure and maintain them than an internal IT staff.

I imagine you'll disagree with me, which is fine. I would point out that despite the costs of implementing and maintaining a highly availabile internal virtualization environment, many of those costs are significantly offset by the usage and maintenance contracts as well as network connectivity required to support internal access to "someone else's servers."

In the end, it's a matter of balancing the costs against the criticality and confidentiality of the data, IMHO.

Assuming it would require me to provide personal information, remind me not to do business with whatever company you work for. Then again, if you're a shill for a "cloud" (marketing-speak for "someone else's servers), I understand. Either way, carry on.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47437223)

OP's example was updating a distro package. He would still have to do that on AWS...

Re:Murphy says no. (1)

dnavid (2842431) | about two weeks ago | (#47435633)

The right answer to this is to have redundant systems so you can do the work during the day without impacting business operations.

The right answer is you build in as much redundancy as you can, but you still do the work in as careful a manner as possible during downtime windows when necessary so that you don't waste the redundancy you have. You will look like the world's biggest idiot if spend a huge amount of money and design resources maing sure you have two of everything for redundancy, and while you're cavalierly upgrading the B systems because you have redundancy the A systems go down. Which they will, precisely when you bring down B for middle of the day upgrades, because the god of maintenance hates you, always has hated you, and always will hate you.

If you can afford N+2 or N+3 redundancy *everywhere* then you shouldn't be asking anyone else for availability advice.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47435741)

Must be nice to have funding for redundancy and testing. "Why do you need so many servers, switches, etc., etc.? Won't one computer do? Oh, can you make my Office print reports for each property, all year, just by itself?"

An ancient PE2500 here needed a raid battery recondition, so after about 6 hours upgrading BIOS, firmware, drivers, OMSA, etc. (boot, reboot, and boot yet again from a Win2k3 partition I had thought prudent not to reformat), and finding that OMSA and Array Manager of different vintages don't play well together, finally got the recondition started. Now the CentOS primary MX and DNS container the machine is host for will only be offline another 8-10 hours till the battery is ready. On the bright side, it's probably going to turn out not to have been the battery at all, just a full 120 reset (pull plug) needed to clear a hung gate somewhere on the PERC 3D/i. Plus I got to give the FOG setup I installed on the network yesterday a good workout.

I have a budget request ready for new servers, network gear, and storage, among other things. It ain't cheap, but the lease terms are good, and I'm keeping my fingers crossed.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47448193)

Yeah, cause management loves to spend double on hardware. Which is cheaper? Redundant hardware, or making a guy whose salary is already sunk work extra? Don't you understand? Hardware is expensive. The stuff you do any monkey can do, right? Welcome to management.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432355)

I modded in this thread so I'm posting AC. +1 for "Score:6, Insightful"

Re:Murphy says no. (4, Informative)

David_Hart (1184661) | about two weeks ago | (#47432357)

Here is what I have done in the past with network gear:

1. Make sure that you have a test environment that is as close to your production environment as possible. In the case of network gear, I test on the exact same switches with the exact same firmware and configuration. For servers, VMWare is your friend....

2. Build your script, test, and document the process as many times as necessary to ensure that there are no gotchas. This is easier for network gear as there are less prompts and options.

3. Build in a backup job in your script, schedule a backup with enough time to complete before your script runs, or make your script dependent on the backup job completing successfully. A good backup is your friend. Make a local backup if you have the space.

4. Schedule your job.

5. Get up and check that the job complete successfully either when the job is scheduled to be completed or before the first user is expected to start using the system. Leave enough time to perform a restore, if necessary.

As you can probably tell, doing this in an automated fashion would take more time and effort than baby sitting the process yourself. However, it is worth it if you can apply the same process to a bunch of systems (i.e. you have a bunch of UNIX boxes on the same version and you want to upgrade them all). In our environment we have a large number of switches, etc. that are all on the same version. Automation is pretty much the only option given our scope.

Re:Murphy says no. (1)

Anonymous Coward | about two weeks ago | (#47432709)

A few things to add:

6: Have some sort of rollback procedure. For example, if someone is doing maintenance on a database server, either export the non-system volumes or back the whole thing up so it can be bare-metal restored [1]. For VMs, this is easy. Snapshot the sucker, upgrade, test to make everything is hunky-dory, and perhaps move the snapshot off for an archive. For stuff in a SAN, it can be easy, or difficult. Snap a 4TB+ LUN on a VNX, and there may be a chance that the drive controller might just pop its top (which is why you have MPIO.) Have neither, do a backup, both on the application/DB level, and the system level. You have backups of production machines right? If not, stop what you are doing and get one. A downed production machine with no backups can kill an entire company, and you don't want to be the guy whose head that mountain falls on.

7: Have a changelog somewhere. Doesn't have to be an exact history, but when changing a setting like virtual CPUs in a VM, RAM allocated, VM priority, for the love of $DEITY, write it down and save it some place accessible. Even better, print it, stick it in a paper notebook so it is accessible in a lights-off scenario.

8: Let people know what you are doing. An admin decides to do a driver update. Little does he know that the application uses specialized low-level drivers on some hardware for performance reasons. Ka-blooey happens, and all the fingers point to the poor admin who just clicked a box in Windows Update. Having fingers pointed in your direction when downtime happens is not good for the career. It usually means that you get first in line to be booted.

9: Don't do too many things at once. That way, when something breaks, it isn't that difficult to find the culprit.

10: Make sure you have the hardware vendor, OS vendor, DB vendor, and app vendor's support IDs ready to go. This is production hardware, so it goes without saying it is on a support contract. Even better, make sure the contracts are current before the downtime window, so you don't have to wake up the bloke with the purse strings at 2:00 AM to renew before you can create a sev-1, all-is-down ticket.

11: If part of the upgrade is a power down, consider powering down the hardware completely. As in, after powering it down via the usual commands, walking to the box, and yanking the cords. Then let it sit for a minute or so, plug it back in and kick it off. Some pieces of hardware get into futzy states unless power cycled. I've encountered a disk array which would lose its management head access unless power cycled every 18 months.

12: Again, keep a log of events. If you have to get some support on the phone, a piece of paper or a Notepad document is better than your memory at 3:00 in the morning.

13: Don't do too much in a window. Better to schedule multiple outages than to try to do a major thing all at once. For example, if you need to upgrade the OS, application, DB, firmware, drive firmware, and DB updates... try small chunks, so you know they work, and if something does happen, it can be rolled back fairly quickly. Save yourself a margin of time.

14: It is assumed all of this has been tested in a test environment as similar to production as possible. If not, document that.

15: Did I mention documenting anything odd, big, or small? The machine seeing an blip with a transitory RAM error should be at least noted. The more documentation, the better cover for the derriere should something break and the blamestorming starts.

16: For network gear, did you save the firmware configs on every network device? A lot of people forget to move run config to start config... and things break in a spectacular manner. Are the configs stashed somwhere safe, accessible... but still secure?

17: Have you tested restore capability? I have seen virtually every backup program (be it enterprise utilities or whatever stuff someone digs up for "free") out there perform backups with no errors, but come restore time, it misses data because of a glitch or a user error like some stupid wildcard exclude.

18: Repeating... do you have a way to back off the changes? Even if it means calling the hardware vendor for a new motherboard because the flash update scrozzled things, and the backup flash isn't coming up.

19: If tinkering with AD or some replicated application, is there a way to neutralize changes with that? A hosed AD schema propagates quite fast.

20: Did you really make sure the DB is quiesced, and all users are off the boxes? There is always going to be that person who wants to do work, even while you are trying to do a major update or maintenance, and may hose things up. Come blamestorm time, he won't be the one getting replaced.

[1]: You want bare metal. Yes, you really want bare metal. With AIX, Solaris, heck, even Windows, it is easy. Linux good luck, as there are no useful tools for this unless you boot into Clonezilla. In any case, you don't want to be installing an OS, backup client, fumbling for license keys, finding and mounting media, clicking restore and hoping you got the right drives, and hoping the new OS doesn't conflict with the OS going on the drives. Trust me, sometimes even a simple update will hose things, so being able to roll back to the pre-hosed state and have the machine up before the window ends will be the difference between you, or a H-1B (who will never cause issues with downtime past maintenance windows) having your job in a week.

Re:Murphy says no. (1)

bmimatt (1021295) | about two weeks ago | (#47433105)

Yes.
Also, this is one of these scenarios, where virtualization pays. You can simply spin up a new set of boxes (ideally via puppet,chef, whatever) and cut over to it once the new cluster has been thoroughly tested and tested some more. Human eye watching/managing the cutover still recommended, if not required.

Re:Murphy says no. (1)

Slashdot Parent (995749) | about a week ago | (#47449747)

Even our network upgrades we do in the middle of the workday. After all testing is done, we just take a whole datacenter down and upgrade it. Once it's back online, we do the next datacenter until it's all done.

There's really no reason in 2014 to do 1am maintenance.

Re:Murphy says no. (1)

Vellmont (569020) | about two weeks ago | (#47432433)


  say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen

Yes, this happens all the time. And really it's a case for doing the upgrade when people are actually using the system. If the patch happens at 2am (chosen because nobody is using it at 2am), nobody is going to notice it until the morning. The morning, when the guy who put in the patch is still trying to recover from having to work at 2am. At the very least groggy, and not performing at his/her best.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47434583)

So upgrade a production system while thousands of people are using it because if there's a problem it'll be discovered quicker?

Re:Murphy says no. (1)

Vellmont (569020) | about two weeks ago | (#47445147)

I don't believe I mentioned the number of people, merely that upgrading when nobody was using the system creates another risk that you won't know about till much later.

People in IT seem to want the "perfect" solution, which doesn't exist, or at the very least a black/white kind of thinking. Everything is tradeoffs and it's important to understand what those tradeoffs are. I've also seen people seem to think all situations and organizations are the same. (Obviously very, very wrong).

But I will say this. In some cases the best solution might be to upgrade the system when people are still using it that it can be switched back quickly.

Re:Murphy says no. (1)

wisnoskij (1206448) | about two weeks ago | (#47432449)

This. No matter what you do this maintenance and downtime is hundreds of times more likely to go wrong than normal running times. What is the point of even employing IT if they are not around for this window.

Re:Murphy says no. (1)

smash (1351) | about two weeks ago | (#47432541)

Yup. Although, that said, if you have a proper test environment, like say, a snap-clone of your live environment and an isolated test VLAN, you can do significant testing on copies of live systems and be pretty confident it will work. You can figure out your back-out plan, which may be as simple as rolling back to a snapshot (or possibly not).

Way too many environments have no test environment, but these days with the mass deployment of FAS/SAN and virtualization, you owe it to your team to get that shit set up.

Re:Murphy says no. (1)

Culture20 (968837) | about two weeks ago | (#47432565)

say the patch unexpectedly breaks another critical function of the server.

When this happens, it usually takes a lot longer to fix than it takes to drive in to work, because the way it breaks is unexpected. The proper method is to have an identical server get upgraded with this automatic maintenance window method the day before while you're at work or at least hours before the primary system so that you can halt the automatic method remotely before it screws up the primary system. If the service isn't important enough, let your monitoring software wake you up if there's a failure or ignore it until you get in at your normal time. Most of the time, having a regularly well-rested sysadmin is more important to a company than having "light-switch monitoring server three" running between 4AM and 8AM.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47434607)

You should always have a competent tech on hand for maintenance tasks. Period. If you do not, Murphy will bite you,

If you don't, you just might end up having a single frame of a German porn movie with huge black penis inserted into a shapely, white ass of a lady, showing the whole morning all the way to the late afternoon, ultimately leading to the demise to the station. So there. Don't do it.

Momas don't let your babies grow up to do IT. (1)

GrantRobertson (973370) | about two weeks ago | (#47437885)

This is exactly why I don't do IT any more. All the responsibility to keep things working, no authority to make users or departments not screw things up, none of the credit when things go smoothly, but all of the blame when anything goes wrong (no matter who caused it), every department's poorly planned extra computer expenses come out of your budget, all that unpaid overtime means you are barely making minimum wage, you are constantly reminded that your job is hanging in the balance, AND you are expected to keep taking expensive certification classes on your own dime just for the priviledge of bending over for one more year.

Do upgrades during the day (1)

Slashdot Parent (995749) | about a week ago | (#47449713)

You should always have a competent tech on hand for maintenance tasks.

I agree with this, but who does maintenance at 1am anymore? What's the point in it? Users are worldwide, and 1am in the US prime business hours in Asia, so why bother patching/upgrading in the middle of the night?

I haven't done a late-night maintenance in at least a decade. It's all about rolling upgrades. Any problems? Rollback. Need to upgrade infrastructure? Take the entire datacenter offline and serve from your other datacenters. Every single upgrade I've done for as long as I can remember has been at 10am, which is the earliest I can get my lazy-ass junior devs to stumble into the office.

OP needs a process upgrade.

windows (0)

Anonymous Coward | about two weeks ago | (#47432037)

Phew, thought you were going to ask about it on Windows. Linux, go for it!

Re:windows (0)

Anonymous Coward | about two weeks ago | (#47432257)

I agree. Linux is much better than Windows.

Re:windows (2)

smash (1351) | about two weeks ago | (#47432627)

OS choice is irrelevant. I've seen plenty of critical linux fuck ups in my day, and OS choice doesn't account for human error. And, being human, you WILL make human errors. You need a test environment and a backout plan. If you don't at least have a back-out plan and an estimate of how much the fuckup will cost BEFORE proceeding (and balancing that against the cost/risk of leaving it the fuck alone), you should not be carrying out the work.

Sure, that sounds like management speak, but seriously... cover your fucking ass. Because one day it will fuck up (whatever, the OS, this isn't just a Linux or Windows problem) and whilst the fuck up may not necessarily be your fault, the extended downtime because you have not tested and have no backout plan will be.

Re:windows (0)

Anonymous Coward | about two weeks ago | (#47436455)

My company does this with Windows. We use Shavlik Protect. It has been renamed and changed owners a few times but it's basically the same product for the last 8 years and it is not very expensive at all.

It is a fully automated patch deployment and reboot. We decide what patches we want to deploy to what servers (we create groups of servers and groups of patches). Set up a schedule to scan them for patches and send results via email the day before and at 7pm on our maintenance window day it automatically pushes the patches that are missing. For most of our servers we have them automatically reboot after the patches are done installing and then do a rescan and email the results. For servers we do not want to reboot automatically, we reboot them manually through the patch console or through another script one at a time (like domain controllers and our MSCS clusters). The process takes about 45 minutes to patch and reboot about 500 servers in 12 different countries.

The patches can be standard MS patches and service packs for the OS and various applications (SQL, Exchange, Forefront), a lot of popular software (Firefox, Abobe, java, Winzip, MS office, VMtools, Filezilla and many others and you can create your own custom patches. If our maintenance window gets cancelled for some reason before the 7PM go time, login in to the patch console and stop the scheduled task.

Shavlik does other things like antivirus, assett tracking and such but we don;t use it for that.
Aside from scheduled maintenance, we also use it for patching systems not in production, like if I have a dev machine with 2008 server on it and I need to put SP2 on or Exchange 2010 SP3, instead of logging into the server and doing it manually, I push it through the patch management console right then.

Now.. all that being said, we are not fully unattended at all, we still have quite a few people online during our entire downtime window doing various things but the patch and reboot portion of that window is done very quick.

I've toyed with this concept.. (5, Interesting)

grasshoppa (657393) | about two weeks ago | (#47432041)

...and while I'm reasonably sure I can execute automated maintenance windows with little to no impact to business operations, I'm not sure. So I don't do it.

If there were more at stake, if the risk vs benefits were tipped more in my company's favor, I might test implement it. But just to catch an extra hour or two of sleep? Not worth it; I want a warm body watching the process in case it goes sideways. 9 times out of 10, that warm body is me.

Re:I've toyed with this concept.. (3, Insightful)

mlts (1038732) | about two weeks ago | (#47432809)

Even on fairly simple things (yum updates from mirrors, AIX PTFs, Solaris patches, or Windows patches released from WSUS), I like babysitting the job.

There is a lot that can happen. A backup can fail, then the update can fail. Something relatively simple can go ka-boom. A kernel update doesn't "take" and the box falls back to the wrong kernel.

Even something stupid as having a bootable CD in the drive and the server deciding it wants to run the OS from that rather than from the FCA or onboard drives. Being physically there so one can rectify that mistake is a lot easier when planned as opposed to having to get up and drive to work at a moment's notice... and by that time, someone else likely has discovered it and is sending scathing E-mails to you, CC:5 tiers of management.

Re:I've toyed with this concept.. (1)

pr0fessor (1940368) | about two weeks ago | (#47433251)

I always test in advance, have a roll back plan, only automate low risk maintenance, test the results remotely, and have a warm body on back up should the need arise. Saves a little sleep since I don't babysit the entire process just the result. I don't have physical access to most of the equipment since it's scattered across multiple data centers so I do most of my work remotely anyway.

Automated troubleshooting? (5, Insightful)

HBI (604924) | about two weeks ago | (#47432047)

Maintenance windows are at off-hours to accomodate real work happening. If every action was painless and produced the desired result, you could do it over lunch or something like that. But that's not the real world.

This begs the question of how the hell are you going to fix unexpected problems in an automated fashion? The answer is, you aren't. Therefore, you have to be up at 2am.

Re:Automated troubleshooting? (0)

Anonymous Coward | about two weeks ago | (#47432223)

Well there might not be ways of fixing unexpected problems in an automated fashion, but there are certainly a lot of ways of catching unexpected problems and send out an automated text/email to wake you up. If you carefully handle return codes and set timeouts in your scripts, as well as monitoring the machine from the outside, you should be able to sleep most of the time. Do I do it? No. My clients refuses. Beside there are DBAs and App admins on the a phone bridge waiting for me to hand over the updated server so they can start their DB, Apps and have the regression tests begin.

Automated troubleshooting? (0)

Anonymous Coward | about two weeks ago | (#47432225)

I'm seeing this increasingly often......misuse of the phrase "begs the question". Why don't you look it up?

Re:Automated troubleshooting? (2)

HBI (604924) | about two weeks ago | (#47432293)

How about looking up "pedantry".

Re:Automated troubleshooting? (1)

gstoddart (321705) | about two weeks ago | (#47432381)

I'm seeing this increasingly often......misuse of the phrase "begs the question". Why don't you look it up?

There are now two distinct phrases in the English language:

There is the logical fallacy of begging the question.

Sometimes, an event happens which begs (for) the question of why nobody planned for it.

You might think you sound all clever and stuff, but you're wrong. They sound similar, but they aren't the same. The second one has been in common usage for decades now, and has nothing to do with the logical fallacy.

Raises the question (1)

tepples (727027) | about two weeks ago | (#47432681)

Sometimes, an event happens which begs (for) the question of why nobody planned for it.

This raises the question of why people don't just avoid the pedantic bickering by saying "raises the question".

Re:Raises the question (2)

gstoddart (321705) | about two weeks ago | (#47432749)

This raises the question of why people don't just avoid the pedantic bickering by saying "raises the question".

Because, generally speaking, pedants are tedious and annoying, and nobody else cares about the trivial minutia they like to get bogged down in because it's irrelevant to the topic at hand.

At least, that's what my wife tells me. ;-)

Re:Raises the question (1)

NotSanguine (1917456) | about two weeks ago | (#47432959)

This raises the question of why people don't just avoid the pedantic bickering by saying "raises the question".

Because, generally speaking, pedants are tedious and annoying, and no one else cares about the trivial minutiae in which pedants like to get bogged down. It's irrelevant to the topic at hand.

At least, that's what my wife tells me. ;-)

There. FTFY. Pedantry and grammar nazism all in one pretty package. You're welcome.

Automated troubleshooting? (0)

Anonymous Coward | about two weeks ago | (#47432241)

We do some automation for maintenance, but the end result has to be able to be tested thoroughly automatically. If the automated tests succeed, I stay asleep. If they fail, I get paged and wake up to deal with it. 90% of the time, it works and I get to sleep through the night. But we can only really do this for simple maintenance.

Re:Automated troubleshooting? (1)

mshieh (222547) | about two weeks ago | (#47432471)

If you have proper monitoring, you don't need to be up at 2am. You just need to be willing to answer the phone at 2am.

Re:Automated troubleshooting? (1)

HBI (604924) | about two weeks ago | (#47432675)

That is not true. If my job is important and my systems are important, i'm on site to make sure that change is successful.

When I was with IBM, our policy was to open up a conference call and have all the requisite support staff on the call until the change window closed. You paid through the nose for that kind of support, but our downtime was minimal and some customers needed that.

When I am working in theater on critical systems in wartime, I don't sit in my fucking hooch and use automated tools. My ass is in front of the boxes in question to respond instantly. The alternative is broken tactical systems meaning bad information being used to make decisions meaning dead people.

Your slack attitude doesn't cut it in the places I work.

Re:Automated troubleshooting? (0)

Anonymous Coward | about two weeks ago | (#47433393)

Luckily you are less than 0.01% of the IT world so the chances of it being that critical is lower than me winning the powerball lotto.

Also that jarhead attitude is why I refused to be a part of the military.

Re: Automated troubleshooting? (0)

Anonymous Coward | about two weeks ago | (#47433433)

"We need this exchange server up so we can kill more babies!"

Re:Automated troubleshooting? (0)

Anonymous Coward | about two weeks ago | (#47435655)

Considering that the organization you work for is in the business of killing people and recently started employing automation to do such, I'd be careful about getting too high on your horse there, Kevin.

CAPTCHA: squads ... how ironic is that?

Re:Automated troubleshooting? (1)

CAIMLAS (41445) | about two weeks ago | (#47433847)

Or, chances are (if you're the ONLY sysadmin on staff), other people could stand not working for a while at 8pm once every other week while you do your maintenance at a saner hour. If you're not big enough to have multiple sysadmins and/or multiple tiers of redundancy, chances are you aren't big enough to justify 365/24 uptime. Someone else can not work so you can get work done, to enable them to keep working.

They probably work too much, anyway. No need for that to make you work too much, too.

This is why n+2 and Vmware are so useful. (1)

Anonymous Coward | about two weeks ago | (#47432051)

If you have a high availability system with more than one backup node then daytime maintenance becomes very doable.

Attended automation (3, Interesting)

Anonymous Coward | about two weeks ago | (#47432053)

Attended automation is the way to go. You gain all the advantages of documentation, testing etc. If the automation goes smooth, you only have to watch it for 5 mins. If it doesn't, then you can fix it immediately.

Schedule some days as offset days (1)

ModernGeek (601932) | about two weeks ago | (#47432057)

You just need to schedule some of your days as offset days. Work from 4pm to midnight some days so that you can get some work done when others aren't around. Some days require you being around people, some days command you be alone.

Or you can just work 16hour days like the rest of us and wear it with a badge of honor.

If you are your own boss and do this, you can earn enough money to take random weeks off from work with little to no notice so that you can travel the world, and do some recruiting while doing it so that you can write the expenses off on the company.

Re:Schedule some days as offset days (1)

DarkOx (621550) | about two weeks ago | (#47432201)

Pretty much this. If your company is big enough or drives enough revenue from its IT systems that require routine off hours maintenance they should staff for that.

That is not say that if its just Patch Tuesdays they need to; or the occasional rare major internal code deployment that happens a couple time a year or so. For that you as the admin should suck it up, and roll out of bed early once and while. Hopefully your bosses are nice and let you have some flextime for it. Knock out at 3p on Fridays those weeks or something.

If there is a regular maintenance window that is frequently used, say at least twice a week, then they need to make the regular scheduled working hours for some employee(s). Maybe some junior admin who can follow deployment instructions works 3a-10a Tuesdays and Wednesdays; but lets be fair to that person they have a life outside of work a deserve to have a predictable schedule. They should still work those hours even if there is nothing going on that week, and just use the time do whatever else they do; update documentation; test out new software versions etc, inventory, etc.

     

Re:Schedule some days as offset days (0)

brausch (51013) | about two weeks ago | (#47436095)

Yes. This.

Re:Schedule some days as offset days (1)

CanHasDIY (1672858) | about two weeks ago | (#47432235)

Or you can just work 16hour days like the rest of us and wear it with a badge of honor.

IMO, there is no honor in working more hours than you're actually being paid to work. Not only are you hurting yourself, you're keeping someone else from being able to take that job.

If you've got 80 hours worth of work to do at your company, and one guy with a 40-hour-a-week contract, you need to hire another person, not convince the existing guy that he should be proud to be enslaved. Morally speaking.

Re: Schedule some days as offset days (0)

Anonymous Coward | about two weeks ago | (#47432291)

Yeah, but what if you're making tried as much as the guys working 40 hours?

Re: Schedule some days as offset days (1)

jtmach (958490) | about two weeks ago | (#47432649)

In that case, you'll probably make silly spelling mistakes.

Re: Schedule some days as offset days (1)

smash (1351) | about two weeks ago | (#47432657)

Work smarter, not harder. This is the difference between an IT muppet and someone who actually goes places in this industry.

Re: Schedule some days as offset days (0)

Anonymous Coward | about two weeks ago | (#47434239)

I agree... When I did this, we had infrastructure in place where we could get on the console of ANY server/system regardless of it's behavior. We also had the capacity to interrupt the power so we could force a reset of ANY host regardless of what happened. We could do all this remotely, so we didn't need to have *anybody* on site unless we needed to physically change something (replace hardware or move cables). We also had a FULLY REDUNDANT configuration where there where always at least two systems for each function.

With all this in place, I could work maintenance windows from my living room. I could power off/on a system, interact with the BIOS settings, boot from a ISO image on my local machine, and re-initialize it from scratch. But that's what you have to do if the system is a minimum of 20+ hours worth of travel time away from you (Santiago Chile) and you cannot afford local help.

Re:Schedule some days as offset days (1)

rikkards (98006) | about two weeks ago | (#47432529)

Not only that but a company that lets someone do that is shooting themself in the foot. Sooner or later 80 hour a week guy is going to leave, good luck getting someone that is
A: willing to do it coming in
B: not taking the job until something better comes along.

It's not a badge of honor, just an example of rationalization for a crappy job.

Re:Schedule some days as offset days (1)

Lumpy (12016) | about two weeks ago | (#47433495)

Actually that 80 hour a week guy will quit without notice really fucking the company.

Hell I took my vacation time left and called in sick for the last days I had in my sick bank while I worked at my new job.. the day I was supposed to be back at my old job I walked in at 8 am dropped my badge and keys on my bosses desk and said, "I quit, hope you can hire the 3 guys you will need to replace me quickly." I had informed HR that I was quitting in writing that morning at 7am when they got there.

It screwed them over hard, really hard. I got a call from the VP begging for me to come back with a 30% pay increase 2 hours after I walked out the door. I said he doesn't have enough money in the world for me to work there anymore, I wished them luck replacing me.

Funny thing, 6 months later the 2008 crash happened and they closed up for good.

Re:Schedule some days as offset days (0)

Anonymous Coward | about two weeks ago | (#47434733)

And then they sued you for not sticking to the contract you signed when you joined then, which said how many days notice you needed to give them.

Re:Schedule some days as offset days (1)

Lumpy (12016) | about two weeks ago | (#47442471)

What moron would sign a contract like that? Are you really that dumb?

Re:Schedule some days as offset days (0)

Anonymous Coward | about two weeks ago | (#47434167)

Or you can just work 16hour days like the rest of us and wear it with a badge of honor.

IMO, there is no honor in working more hours than you're actually being paid to work. Not only are you hurting yourself, you're keeping someone else from being able to take that job.

Not always. I worked a job where we did maintenance windows regularly, usually starting at 2am because that's when our customer would let us. I routinely worked from home between 3 and 5 hours on these nights which happened multiple times a week. However, I was still expected to maintain normal business hours in the office. After doing this for a year, we got a new executive (my boss' boss) and she started having little breakfast meetings with all her employees in little groups. Wouldn't you know it, my number came up the day after a maintenance window so I showed up for her breakfast meeting about 10 min late and with 3 hours sleep. Like all "good" managers, she asked if we had any thoughts about how we could make things better and I responded with a complaint about having to maintain normal office hours when I was working nights on maintenance windows. It was a mistake. I got canned about 2 months later. The layoff package was nice, but I should have kept my mouth shut.

In short, I wasn't preventing anybody from having a job, I was maintaining my job. Yes, it wasn't fair, but at the time it was all anybody could do. This same place sent me to Costa Rica to work 18 hour days for 20+ days at a time, without giving me even half a day off either IN country or when I got home. When I left, they didn't replace me directly and ended up loosing the customers I serviced.

Re:Schedule some days as offset days (1)

Wolfrider (856) | about two weeks ago | (#47446591)

--I'll give you an AMEN on that!!

Re:Schedule some days as offset days (2)

QRDeNameland (873957) | about two weeks ago | (#47432319)

Or you can just work 16hour days like the rest of us and wear it with a badge of sucker.

FTFY

Re:Schedule some days as offset days (1)

Lumpy (12016) | about two weeks ago | (#47433405)

"Or you can just work 16hour days like the rest of us and wear it with a badge of honor."

Keep telling yourself that, someday you will believe it.

Want to know what another badge of honor is accepting a lower wage for those 16 hours. A real IT pro would take $16.00 an hour and sleep under their desk.

Depends on the Application layer / patch applied (1)

slacklinejoe (1784298) | about two weeks ago | (#47432065)

I do this for a lot of clients. Automatic Deployment Rules in Configuration Manager, Scripts, Cron jobs etc. For test / dev, it absolutely makes sense as I usually have a monitoring system that goes into Maintenance Mode during the updates. If things take too long or if services aren't restored post update, the monitoring system gives me a shout that something needs remediated. For production, it varies on the expected impact. If it's something I tested in pilot with zero issues and the application isn't something with an insane SLA, sure, I'll use an automatic deployment. When I'm working on hospital equipment such as servers processing imaging or vitals monitoring for surgery, that gets nix'ed no matter what due to the liability concerns. I usually suggest building up trust / experience by automating the less critical systems and phasing in more sensitive systems until you've both gained a lot of experience with it and have more management support to do so as when crap goes down, it's easier to say this is a tested processed we've been using for years vs yeah, oops, new script sorry that knocked down our ERP system.... Resume generating event right there... So, I guess it depends, just another tool for the toolbox and it's up to the carpenter to know when to pull it out.

Ansible (0)

Anonymous Coward | about two weeks ago | (#47432071)

I like ansible... alot. Chef, salt, something else if that is your preference. In any event, yes, an automated deployment framework allows you to test the maintenance procedure out, throttle the number of servers that get managed at one time, bail (and/or text you) if there is a problem.

Done right it can be run continuously so that you are always confident about the state of your servers and their maintenance procedures.

We do it all the time... (0)

Anonymous Coward | about two weeks ago | (#47432073)

We do it all the time... Schedule a snapshot, push patches, verify thing are up, and if not throw an alarm... Using Shavlik on, horror of horrors, Windows...

I've rolled back 2 or 3 in the past 5 years, usually do to Microsoft's inablilty to consistently write a patch that doesn't break something, and once because the vendor couldn't see to be hyper sensitive to .Net patch levels...

use some configuration management tools (0)

Anonymous Coward | about two weeks ago | (#47432075)

- cfengine
- puppet
- chef
- ansible
- salt

All should be able to do the work.

Offshore (4, Insightful)

pr0nbot (313417) | about two weeks ago | (#47432091)

Offshore your maintenance jobs to someone in the correct timezone!

Re:Offshore (0)

Anonymous Coward | about two weeks ago | (#47432177)

this is what my employer does

Re:Offshore (0)

Anonymous Coward | about two weeks ago | (#47432395)

This is a temporary solution if your business grows worldwide. It's 2014, time to start using virtualization/imaging and distributed services. You can keep everything running while you do maintenance on individual nodes. Some systems can even move a running virtual image to another physical server.

Re:Offshore (1)

smash (1351) | about two weeks ago | (#47432671)

Yup. Company I work currently has only a 4 hour window per day where we don't have active users actually on the clock. And if we win a job in say, south america (we're a mining company), that goes out the window entirely. VMotion, virtual networking, virtual filers/writable snapshots, are all beautiful things.

Re:Offshore (0)

Anonymous Coward | about two weeks ago | (#47432875)

Bingo. Test, test, test during business hours on your test/qa environment. Then hand it off to swing/night shifts to implement. Sure there's a chance that something will go wrong that they can't fix, but they can wake you up if they need to. Most of the time if you tested it well and they're reasonably competent, it'll go fine.

Sounds like a bad idea ... (4, Insightful)

gstoddart (321705) | about two weeks ago | (#47432095)

You don't monitor maintenance windows for when everything goes well and is all boring. You monitor them for when things go all to hell and someone needs to correct it.

In any organization I've worked in, if you suggested that, you'd be more or less told "too damned bad, this is what we do".

I'm sure your business users would love to know that you're leaving it to run unattended and hoping it works. No, wait, I'm pretty sure they wouldn't.

I know lots of people who work off hours shifts to cover maintenance windows. My advise to you: suck it up, princess, that's part of the job.

This just sounds like risk taking in the name of being lazy.

Re:Sounds like a bad idea ... (1)

pr0fessor (1940368) | about two weeks ago | (#47433371)

I automate low risk maintenance, it doesn't alleviate the responsibility for prior testing, a roll back plan, or monitoring the results, but it does save time. If you refuse to automate any of your work you would never make a deadline and wouldn't last very long where I work.

Re:Sounds like a bad idea ... (1)

gstoddart (321705) | about two weeks ago | (#47433441)

Oh, I automate deployments, and I automate some monitoring. Don't get me wrong, I'm not opposed to automation.

Like all programmers, I'm lazy and would rather code it once instead of doing it by hand many many times.

That doesn't mean I'd walk away from it and leave it unattended. To me, that's just asking to get bit in the ass.

These days, anything which is low risk maintenance is stuff I do during the daytime because it's not Production. For our Production environments, everything is considered high risk because the systems are mission critical. Any change at all is high risk, because if it breaks, it costs the company large amounts of money to be down.

You have to understand what your threshold of risk is, and what your actual risks are before you do any automation. Some systems you can play fast and loose with. Others, not so much.

This is why you need.. (3, Insightful)

arse maker (1058608) | about two weeks ago | (#47432101)

Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.

Having someone with little or no sleep doing critical updates is not really the best strategy.

Re:This is why you need.. (5, Insightful)

Shoten (260439) | about two weeks ago | (#47432207)

Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.

Having someone with little or no sleep doing critical updates is not really the best strategy.

First off, you can't mirror everything. Lots of infrastructure and applications are either prohibitively expensive to do in a High Availability (HA) configuration or don't support one. Go around a data center and look at all the Oracle database instances that are single-instance...that's because Oracle rapes you on licensing, and sometimes it's not worth the cost to have a failover just to reach a shorter RTO target that isn't needed by the business in the first place. As for load balancing, it normally doesn't do what you think it does...with virtual machine farms, sure, you can have N+X configurations and take machines offline for maintenance. But for most load balancing, the machines operate as a single entity...maintenance on one requires taking them all down because that's how the balancing logic works and/or because load has grown to require all of the systems online to prevent an outage. So HA is the only thing that actually supports the kind of maintenance activity you propose.

Second, doing this adds a lot of work. Failing from primary to secondary on a high availability system is simple for some things (especially embedded devices like firewalls, switches and routers) but very complicated for others. It's cheaper and more effective to bump the pay rate a bit and do what everyone does, for good reason...hold maintenance windows in the middle of the night.

Third, guess what happens when you spend the excess money to make everything HA, go through all the trouble of doing failovers as part of your maintenance...and then something goes wrong during that maintenance? You've just gone from HA to single-instance, during business hours. And if that application or device is one that warrants being in a HA configuration in the first place, you're now in a bit of danger. Roll the dice like that one too many times, and someday there will be an outage...of that application/device, followed immediately after by an outage of your job. It does happen, it has happen, I've seen it happen, and nobody experienced who runs a data center will let it happen to them.

Re:This is why you need.. (1)

MondoGordo (2277808) | about two weeks ago | (#47432417)

In my experience, if your load-balancing solution requires all your nodes to be available, and you can't remove one or more nodes without affecting the remainder, it's a piss-poor load balancing solution. Good load balancing solutions are fault tolerant up to, and including, absent or non-responsive nodes and any load balanced system that suffers an outage due to removing a single node is seriously under-resourced.

Re:This is why you need.. (0)

Anonymous Coward | about two weeks ago | (#47436795)

You mess up LB and High Availability.

Re:This is why you need.. (1)

mlts (1038732) | about two weeks ago | (#47432941)

There is also the fact that some failure modes will take both sides down. I've seen disk controllers overwrite shared LUNs, hosing both sides of the HA cluster (which is why I try to at least quiesce the DB or application so RTO/RPO in case of that failure mode is acceptable.)

HA can also be located on different points on the stack. For example, an Oracle DB server. It can be clustered on the Oracle application level (active/active or active/passive), or it can be sitting in a VMWare instance, clustered using vSphere HA, where the DB itself thinks it is a single instance, but in reality, it is sitting active/passive on two boxes.

Even if the backup stays up, failing back can be an issue. I've seen HA systems where it will happily drop to the backup node... but failing back to the primary can require a lot of downtime. For active/active setups, it can require a performance hit for resyncing.

Re:This is why you need.. (0)

Anonymous Coward | about two weeks ago | (#47434643)

"First off, you can't mirror everything. Lots of infrastructure and applications are either prohibitively expensive to do in a High Availability (HA) configuration or don't support one"

I guess from this comment that you've never worked in an environment that does mirror everything. Like a bank where not only servers within a data center are mirrored, but whole data centers are mirrored, and failover implemented as automatic for everything. If you really need this you use stuff that supports it and pay a fortune to do so.

If you can't afford it, you are not a 24/7 operation because your system can have outages. Believe me, after 20 years on the can afford it and can't afford it side of the divide, that's the way this works.

Re:This is why you need.. (1)

Slashdot Parent (995749) | about a week ago | (#47449787)

Go around a data center and look at all the Oracle database instances that are single-instance...that's because Oracle rapes you on licensing

Then stop using Oracle if you can't afford RAC/GoldenGate/TAF/whatever. Use what you can afford in order to architect a proper redundant system. Running a database on a single instance is malpractice in 2014.

Re:This is why you need.. (0)

Anonymous Coward | about a week ago | (#47469229)

Properly designed redundant systems (load balanced, mirrored, or otherwise), are always a benefit regarding planned maintenance, and also for unplanned outages.
Successful upgrades to redundant systems builds management confidence that planned maintenance can be done during peak/business hours. Having redundant systems means that for unplanned outages, there is a chance of minimal or no service interruption for the users.
Not having redundant systems means planned maintenance can only be done during non-peak/after hours for critical systems. Non redundant systems means that for unplanned outages, there is a guaranteed loss of service for users.

>>First off, you can't mirror everything.
Without getting too involved in the term "mirror", it is always achievable to maintain a redundant system, otherwise you would not even be able create/maintain a DR instance and you would be in a bad spot. Many apps are designed with HA in mind, while many others, I think as you are alluding to, are extremely challenging to maintain even a DR instance, let alone and HA instance, due to costs as you mentioned, application limitations, data integrity, etc. However, regarding uptime, it is always better to maintain a redundant instance of an application/system, regardless of what technology you use (load balanced, mirrored systems, visualization, etc). The redundant instance is better for unplanned outages, as well as planned maintenance.

>>Second, doing this adds a lot of work.
Building an HA/redundant infrastructure takes a lot of work up front, however it saves a lot work, and creates a lot more uptime down the path.

>>Third, guess what happens when you spend the excess money to make everything HA, go through all the trouble of doing failovers as part of your maintenance...and then something goes wrong during that maintenance?
Running a redundant system in a desegregated state, even if it's down to a SPOF, is better than not running a single system offline, regardless of whether the downtime was cause by planned maintenance or an unplanned outage.

Re:This is why you need.. (1)

CWCheese (729272) | about two weeks ago | (#47432287)

Several posts have alluded to high-availability, mirrored, load balanced, etc etc as being the solution to simply updating systems. The problem from a management point of view is to remain on guard when a patch or upgrade goes bad. Having turned into one of those 'old-guys', I'm quite sobered by the bad maintenance windows I've been a party to and will never consider unattended maintenance windows for my teams. It's better for me to schedule the work and let my folks adjust their work days to get to the maintenance fully alert and aware, and in full attendance for that time when things don't go as planned.

Re:This is why you need.. (1)

smash (1351) | about two weeks ago | (#47432695)

Yeah, don't get me wrong (i've been posting about setting up a test lab using vSphere, vFilters and vlans) - you can't replace the need to have someone on call or watching in case it all fucks up. But you can generally reduce the outage window and risk significantly by actually testing (both the roll out and roll back) first. And if you've got it to the point where you can reliably test, you can work on your automation scripts, test the shit out of them, and having been tested with a copy of live using a copy of live data, be reasonably confident that they will work.

If they don't? Snapshot the breakage, roll back to pre-fuckup, and examine at your leisure. Then re-schedule once you know wtf happened.

Re:This is why you need.. (1)

CanHasDIY (1672858) | about two weeks ago | (#47432327)

Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.

Having someone with little or no sleep doing critical updates is not really the best strategy.

Oh my $deity, this!

I've worked in environments with test-to-live setups, and ones without, and the former is always, always a smoother running system than the latter.

Immutable Servers (2)

skydude_20 (307538) | about two weeks ago | (#47432105)

If these are as critical services as you say, I would assume you have some sort of redundancy, at least a 2nd server somewhere. If so, treat each as "throw away", build out what you need on the alternative server, swing DNS and be done. Rinse and repeat for the next 'upgrade'. Then do your work in the middle of the day. See Immutable Servers: http://martinfowler.com/bliki/... [martinfowler.com]

Automate Out (2)

whipnet (1025686) | about two weeks ago | (#47432115)

Why would you want to automate someone or yourself out of a job? I realized years ago that Microsoft was working hard to automate me out of my contracts. It's almost done, why accelerate the inevitable?

Re:Automate Out (2)

smash (1351) | about two weeks ago | (#47432717)

This is why you move the fuck on and adapt. If your job is relying on stuff that can be done by a shell script, you need to up-skill and find another job. Because if you don't do it, someone like myself will.

And we'll be getting paid more due to being able to work at scale (same shit for 10 machines or 10,000 machines), doing less work and being much happier, doing it.

Re:Automate Out (1)

whipnet (1025686) | about two weeks ago | (#47433163)

Thank you, I did. I only have my baby toe dipped into the IT world now in an business ownership role. Good to know there are still people out there that haven't worked in IT long enough to realize that IT IS NOT A CRAFT and you will never perfect anything and will continue to have to learn things that don't mean a pile a beans just a couple of years later. Great gig for me in my 20's, not so much in my 40's and approaching 50's. I'm old school IT and done with it. Enjoy. *

Automate successful execution as well (1)

Boawk (525582) | about two weeks ago | (#47432123)

Setting aside the wisdom (or lack thereof) of automating maintenance, you should also have some process external to the maintained machines that confirms that the maintenance worked. That confirmation could be something like testing that a Web server continues to serve the expected pages, some port provides expected information, etc. If this external process notes a discrepancy, it would page/text/call you.

Re:Automate successful execution as well (1)

Neil Boekend (1854906) | about two weeks ago | (#47447379)

Disclaimer: I am not an IT professional.
Why not automate the deployment and go in via VPN afterwards to check if all is well?
Of course this should be done within driving range so that you can get there a couple of hours before business hours to fix the horrible, horrible mistakes that will be made from time to time. Or when the VPN doesn't respond.

Great idea! (0)

Anonymous Coward | about two weeks ago | (#47432137)

I am "the guy". The guy that your boss calls when your simple maintenance outage goes all sideways (and I like your idea). Positioning oneself so that any problem becomes a lingering outage that shakes your company's faith in your IT Director's ability to do their job competently is always a great idea. If you can chron it from work, why not chron it from an outsourced location? I mean, either it goes well and they don't need you or it goes sideways and they need me. Either way, you are screwed. PRO TIP: do not store your resume on any system that you chron after-hours updates to.

Fixing the wrong problem (1)

Zanthras (844625) | about two weeks ago | (#47432141)

By far the better solution is to figure out why that one specific server cant be offlined. Its far safer regardless of the tests and validations to work on a server thats not supposed to be running vs one that is. It obviously takes alot of work, but for all your critical/important services they should be running in some sort of HA scenerio. If you cant take a 5 minute outage just after normal business hours, you absolutely cannot take a failure in the service due to any sort of hardware failure(which will happen) This is coming from years of experience in a Software as a Service company/

Slashdot is a Bad Place to Ask This (4, Interesting)

terbeaux (2579575) | about two weeks ago | (#47432147)

Everyone here is going to tell you that a human needs to be there because that is their livelihood. Any task can be automated at a cost. I am guessing that it is not your current task to automate maintenance tasks otherwise you wouldn't be asking. Somewhere up your chain they decided that for the uptime / quality of service it is more cost effective to have a human do it. That does not mean that you can not present a case showing otherwise. I highly suggest that you win approval and backing before taking time to try to automate anything.

Out of curiosity, are they VMs?

Re:Slashdot is a Bad Place to Ask This (0)

Anonymous Coward | about two weeks ago | (#47432505)

snapshots FTW!

Snapshots are great, for some things (1)

mveloso (325617) | about two weeks ago | (#47432829)

Snapshots are great, but they assume all your data is on the snapshot. It's harder to roll back if your new version goes ahead and corrupts some database or something on the NAS.

It's even harder to roll back if your data stores are on some multi-clustered beast that wasn't designed to be rolled back.

Of course, you should have caught that in test, right?

Re:Slashdot is a Bad Place to Ask This (0)

Anonymous Coward | about two weeks ago | (#47433345)

I thought so too, until I was using a certain SAN, did a snapshot of a LUN as a test... and the drive controllers rebooted dirty. No harm, no foul, as it was just test data, but had that been a production filesystem... I'd be stuck with a dirty production cluster (as the filesystem was shared), and no way to roll back.

Re:Slashdot is a Bad Place to Ask This (1)

gstoddart (321705) | about two weeks ago | (#47432629)

Everyone here is going to tell you that a human needs to be there because that is their livelihood.

No, many of us will tell you a human needs to be there because we've been in the IT industry long enough to have seen stuff go horribly wrong, and have learned to plan for the worst because it makes good sense.

I had the misfortune of working with a guy once who would make major changes to live systems in the middle of the day because he was a lazy idiot. He once took several servers offline for a few days because of this. I consider that kind of behavior lazy and incompetent, because I've seen the consequences of it.

If you consider "doing our jobs correctly, and mitigating business risk" to be job security, you're right. If you think we do these things simply to make ourselves look useful, you're clueless about what it means to maintain production systems which are business critical.

Part of my job is to minimize business risk. And people keep me around because I actually do that.

Re:Slashdot is a Bad Place to Ask This (1)

smash (1351) | about two weeks ago | (#47432803)

Alternatively, perhaps somewhere up the chain they have no idea what can be done (this IT shit isn't their area of expertise), and are not being told by their IT department how to actually fix the problem properly. Rather, they are just applying band-aid after band-aid for breakage that happens.

It is my experience that if you outline the risks, the costs and the possible mitigation strategies to eliminate the risk, most sensible businesses are all ears. At the very least, if they don't agree on the spot, they are at least aware of what is possible and when the inevitable happens, be more keen to fix the problem next time.

Downtime cost adds up pretty fucking quickly. For example, my company: We have 650 PC users. pay rate probably ranges from 25 bucks an hour to 100 bucks an hour or more. Lets say the average is probably somewhere around 45 per hr.

1 hour of downtime, by 650 users, by 45 bucks per hour = $29,250 in lost productivity. Plus the embarrassment of not being able to deal with clients, etc. Plus potentially other flow on effects (e.g., in our case, possibly: maintenance scheduling for our mining equipment - trucks, drills, etc. didn't run. Plant therefore didn't get serviced properly, $500k engine dies).

If you fuck something up and are down for a day? Well... you can do the math.

Re:Slashdot is a Bad Place to Ask This (1)

grahamsaa (1287732) | about two weeks ago | (#47433081)

OP here. Yes, they are VMs in most cases. The only machines we don't virtualize are database servers.

Re:Slashdot is a Bad Place to Ask This (1)

Slashdot Parent (995749) | about a week ago | (#47449843)

Oh, a human definitely needs to be there for maintenance. You can't automate fixing up a screwup in the automation.

I just see no reason why maintenance windows have to be done at 1am. In today's world of redundancy and failover, there is just no reason for it. Every upgrade my team has done for as long as I can remember has been at 10am local time because we don't allow downtimes anyway. Why work at 1am?

Do a risk assessment. (0)

Anonymous Coward | about two weeks ago | (#47432149)

What's the impact if it all goes wrong and you're not there? If impact is huge and you're fired if it all goes bad, be there. If it doesn't matter and it can fail with no consequences, script it.

Disclaimer: Today (Friday) I found out my company is doing a DR exercise from 10PM tonight to 9AM tomorrow. I'm an ITSec manager, and they wanted to know if they could make a "few firewall changes if they need to". I said no, and told them I would stay up late to review and approve any emergency changes they want but they were NOT getting "carte blanche" with no ITSec oversight, as that would be really irresponsible and break SOX, etc... You do what you have to in order to get the job done properly!

(Posting Anonymous so !humblebragging.)

How about using a remote console? (0)

Anonymous Coward | about two weeks ago | (#47432155)

If you were to set up a hardware remote console, you could do it from home. So yeah, it's 15 minutes out of bed, but then it's right back to bed.

I have automated maintenances in the form of ... (2)

spads (1095039) | about two weeks ago | (#47432157)

...service bounces that are happening all the time. When it occurs and/or if any other issues, I can send myself a mail. My blackberry has filters which allow an alarm to go off which can wake me during the night. That would seem to meet your needs.

Re:I have automated maintenances in the form of .. (0)

Anonymous Coward | about two weeks ago | (#47432643)

I am updating your outward facing mail server, the update fails, where is your god/email now?

I've worked IT at all levels, from mom and pop shops to 5 9s (99.999% availability).

If it's a production system, it's pretty important to have hands and eyes on-site in case something goes wrong. ANYTHING can go wrong... KVM has a hiccup and all of a sudden the pc things someone is pressing space bar 24x7... guess what isn't booting? last tech forgot to change the KVM to another input and has a disk in the included disk drive... guess what server may not be booting....

There is plenty you can do without people on-site but when your label something as production critical it is because you can't afford to wait that 30 mintues to an hour for someone to wake up and get on-site.

Re: I have automated maintenances in the form of . (1)

drummerboybac (1003077) | about two weeks ago | (#47432791)

Most of these patches are happening on systems that are in some remote data center that's not in your physical office location anyway. So I see no difference connecting remotely to the servers from your house vs connecting remotely from your office

Re:I have automated maintenances in the form of .. (1)

Wycliffe (116160) | about two weeks ago | (#47435681)

I am updating your outward facing mail server, the update fails, where is your god/email now?

If at least some part of your paging and monitoring system isn't independent from your servers then you're doing it wrong.
We use multiple third party companies to monitor our website. It's highlevel checks but one of the checks is to check
that our internal monitoring software is working. You can purchase third party monitoring software or spin up an instance
somewhere like amazon or digital ocean for a few dollars a month. Depending on how critical your systems are you
could spin up a few dozen. The point is that you should be monitoring your servers from outside your network for
multiple reasons. The first being that it doesn't really matter if everything is up if the outside world can't connect to it
and the second being that you still want to be paged if your entire datacenter goes up in smoke.

Nature of the beast (1)

Danzigism (881294) | about two weeks ago | (#47432175)

Although I do feel this is the nature of the beast when working in a true IT position where businesses rely on their systems nearly 100% of the time, there are some smart ways to go about it. I'm not exactly sure what type of environment you're using, but if you use something like VMware's vSphere product, or Microsoft's Hyper-V, both allow for "live migrations". Why not virtualize all of your servers first of all, make a snapshot, perform the maintenance, and live migrate the VMs? You could do it right in the middle of the day and nobody would even know. This kind of setup takes a lot of planning however. I personally wouldn't want any maintenance performed on my servers without manual approval. Unattended maintenance sounds a bit too scary for my likes, and in my experience with even small security updates for both Linux and Windows servers, there's bound to be a point where something would fail and you could potentially get in a lot of legal trouble if you fail to meet you SLA, or cause a loss-of-profit due to downtime with a business.

It depends on the size of your operation... (4, Interesting)

jwthompson2 (749521) | about two weeks ago | (#47432183)

If you really want to automate this sort of thing you should have redundant systems with working and routinely tested automatic fail-over and fallback behavior. With that in place you can more safely setup scheduled maintenance windows for routine stuff and/or pre-written maintenance scripts. But, if you are dealing with individual servers that aren't part of a redundancy plan then you should babysit your maintenance. Now, I say babysit because you should test and automate the actual maintenance with a script to prevent typos and other human errors when you are doing the maintenance on production machines. The human is just there in case something goes haywire with your well-tested script.

Fully automating these sorts of things is out of reach more many small to medium sized firms because they don't want, or can't, invest in the added hardware to build out redundant setups that can continue operating when one participant is offline for maintenance. So, depending on the size of your operation and how much your company is willing to invest to "do it the right way" is the limiting factor in how much you are going to be able to effectively automate this sort of task.

Simmilar experiences ... (4, Insightful)

psergiu (67614) | about two weeks ago | (#47432187)

A friend of mine lost his job over a simmilar "automation" task on windows.

Upgrade script was tested on lab environement who was supposed to be exactly like production (but it turns out it wasn't - someone tested something before without telling anyone and did not reverted). Upgrade script was scheduled to be run on production during the night.

Result - \windows\system32 dir deleted from all the "upgraded" machines. Hundreds of them.

On the Linux side i personally had RedHat doing some "small" changes on the storage side and PowerPath getting disabled at next boot after patching. Unfortunate event, since all Volume Groups were using /dev/emcpower devices. Or RedHat doing some "small" changes in the clustering software from one month to the other. No budget for test clusters. Production clusters refusing to mount shared filesystems after patching. Thankfuly on both cases the admins were up & online at 1AM when the patching started and we were able to fix everything in time.

Then you can have glitchy hardware/software deciding not to come back up after reboot. RHEL GFS clusters are known to randomly hang/crash at reboot. HP Blades have sometimes to be physically removed & reinserted to boot.

Get the business side to tell you how much is going to cost the company for the downtime until:
- Monitoring software detects that something is wrong;
- Alert reaches sleeping admin;
- Admin wakes up and is able to reach the servers.
Then see if you can risk it.

Re:Simmilar experiences ... (0)

Anonymous Coward | about two weeks ago | (#47432351)

I just modded you up on this, but I'd love to know more details on how an upgrade script can wipe out the core windows directory!

Re:Simmilar experiences ... (0)

Anonymous Coward | about two weeks ago | (#47432457)

I'd hate to have been the guy who wiped \windows\system32 on hundreds of servers...wow.

Re:Simmilar experiences ... (0)

Anonymous Coward | about two weeks ago | (#47432901)

Just happened to delete the system32 directory, huh? Bullshit, never happened.

Re:Simmilar experiences ... (1)

SuiteSisterMary (123932) | about two weeks ago | (#47433455)

I've had 'yum update' do things like change completely where data files for a service are stored, update the configuration, but not move, link or otherwise do anything with the existing data. I've also had 'yum update' introduce kernel level file system bugs that result in data corruption. Both on vanilla Centos installs with no extra repos.

Prepare for failure (1)

davidwr (791652) | about two weeks ago | (#47432193)

One way to prepare for failure is to have someone there who can at least recognize the failure and wake someone up in time to fix it.

Another way to prepare for failure is to have a system that is redundant enough that a part could go down and it wouldn't be more than a minor annoyance to users or management.

There are other ways to prepare for failure, but these are two common ones.

Re:Prepare for failure (1)

gstoddart (321705) | about two weeks ago | (#47432325)

Some of us would argue that doing maintenance unattended is preparing for failure -- or at least giving yourself the best possible chance of failure.

I work in an industry where if we did our maintenance badly, and there was an outage it would literally cost millions of dollars/hour.

If what you're doing it so unimportant you can leave the maintenance unattended, there's probably no reason you couldn't do the outage in the middle of the day.

If it is important, you don't leave it to chance.

Sometimes the reasons aren't technical (1)

davidwr (791652) | about two weeks ago | (#47432499)

Maybe back when the maintenance window was created it was created for a valid technical reason, BUT technology moved on and management didn't.

In other words, in some environments, the technical people won't have a sympathetic ear if they ask to cancel the off-hours maintenance window simply because of local politics or the local management, BUT if the maintenance gets botched and services are still down or under-performing through normal business hours, nobody outside of IT will notice.

Re:Sometimes the reasons aren't technical (1)

gstoddart (321705) | about two weeks ago | (#47432555)

BUT if the maintenance gets botched and services are still down or under-performing through normal business hours, nobody outside of IT will notice

Then you're maintaining trivial, boring, and unimportant systems that nobody will notice. If your job is to do that ... well, your job is trivial and unimportant.

The stuff that I maintain, if it was down or under-performing during normal business hours ... we would immediately start getting howls from the users, and the company would literally be losing vast sums of money every hour. Because our stuff is tied into every aspect of the business, and is deemed to be necessary for normal operations.

Sorry, but some of us actually maintain stuff which is mission critical to the core business, and people would definitely notice it.

As one of the technical people who does cover after hours maintenance ... if a technical person suggested we automate our changes and not monitor them, they wouldn't get a sympathetic ear from me either.

There may be systems like you describe. And, as I said before, if that's the case, do your maintenance windows in the middle of the day.

Re:Sometimes the reasons aren't technical (1)

davidwr (791652) | about two weeks ago | (#47432947)

As I said, sometimes the problems are not technical in nature.

Set alarms (1)

MrL0G1C (867445) | about two weeks ago | (#47432203)

Can't you make some kind of setup that triggers if the update fails and alerts you / wakes you up with noise from your smartphone etc.

Or like the other poster who beat me to it - off-load your work to someone in a country where your 5am is mid-day in their country.

Security Availability vs Availability Security (0)

Anonymous Coward | about two weeks ago | (#47432219)

While just about everybody does availability over security, it just depends on what you sell your clients.
We do at least all Security-Updates fully-automatically without review or maintainance, just whenever cron-apt picks up a new Debian security update it gets installed automatically. If something goes wrong, our customers understand as that is what they want or need: Security - even if Availability stays behind.

Perception of Necessity (1)

bengoerz (581218) | about two weeks ago | (#47432229)

By proving that your job can be largely automated, you are eroding the reasons to keep you employed.

Sure, we all know it's a bad idea to set things on autopilot because eventually something will break badly. But do your managers know that?

Re:Perception of Necessity (0)

Anonymous Coward | about two weeks ago | (#47432347)

Well, it could be that you'd actually spend a lot more time (though during regular hours) testing/tweaking the automated solution. So it might actually mean more work overall. Having said that, we feel happier having one or two people do most stuff like that, and that's partly due to not having the luxury of an exact replica as a test environment, nor the time to do it.

Re:Perception of Necessity (1)

smash (1351) | about two weeks ago | (#47432869)

Automating shit that can be automated so that you can actually do thing that benefit the business instead of simply maintaining the status-quo is not a bad thing. Doing automate-able drudge work when it could be automated is just stupid. Muppets who can click next through a Windows installer or run apt-get, etc. are a dime a dozen. IT staff who can get rid of that shit so they can actually help people get their own jobs done better are way more valuable.

The job of IT is to enable the business to continue to function and improve. Never forget that. People don't spend up big on computer stuff just because. They do it in order to save money by improving process. Improving process is where you should be focused, anything to do with general maintenance of the status quo is dead time.

Re:Perception of Necessity (1)

bengoerz (581218) | about two weeks ago | (#47433301)

My point is not that you should never automate things. Rather, when you automate things, you should make sure your managers know (1) that you were smart enough to improve processes and are therefore valuable to future projects and (2) that the things you automate could one day break (process changes, etc.).

At very least, the poster should be able to articulate a better reason for automation than "I wanted to sleep in".

Depends; and not like the adult diaper (0)

Anonymous Coward | about two weeks ago | (#47432251)

This has always been a contention. Some systems can be automated through SMS, System Center or even a vb script. However, I've had windows updates corrupt IIS web servers before requiring me to uninstall all .net frameworks, reinstall IIS, and reinstall the .net framework. This is one of those situations you don't want to wake up in Monday morning with customers down. For critical systems, I always manually test on test systems, push to production and test after updates applied to make sure everything is running as intended. For low impact updates like ccleaner, automated pushes are much more viable because of the impact to the system is relatively low. So as the subject says, "Depends". Hope this helps with your inquiry.

No. Do your maintenance *in* working hours. (0)

Anonymous Coward | about two weeks ago | (#47432261)

If you do your maintenance out of hours and something goes wrong, who's going to fix it? Some bleary-eyed administrator that's has 2 hours sleep? If they need to escalate it, who are they going to call at 2am? Also, are these guys being paid double time to work these hours, given time off to compensate, or just expected to suck it up and work a normal day shift after working half the night? Whichever way you look at it, it's full of problems.

Instead, rearchitect your solution. If you care about a service enough to not take planned downtime in working hours, you probably care enough that unplanned downtime in working hours should not be business affecting either. So you should double up on servers (which should be pretty cheap if you're running everything in a virtualised environment) and arrange for services to fail over to a secondary if the primary is unavailable. If you're doing this in Windows (my condolences to you) it should mostly support this anyway. If you're doing it in something unix-like you can use things like keepalived to fail a service from one node to another.

Once you have a solution like this, maintenance is easy - you patch/upgrade/reboot your backup server, check that it's OK, then promote it to primary, then do the other server, and then promote it back again. You do it *all* in working hours so that (a) people get a decent nights sleep and (b) if something goes wrong you can call on your support provider without having to pay over the odds for 24x7 support.

Re:No. Do your maintenance *in* working hours. (1)

smash (1351) | about two weeks ago | (#47432877)

Yup. Same reason changes on the weekend are bad, as are changes on (in my opinion) Thursdays and Fridays.

Re:No. Do your maintenance *in* working hours. (0)

Anonymous Coward | about two weeks ago | (#47433605)

oh yeah that makes sense - because the cost of downtime during peak hours is far less than the cost of the business being down during off-peak hours...

as long as it's convenient for you though...

facepalm...

Testing fails (0)

Anonymous Coward | about two weeks ago | (#47432275)

In my experience testing, no matter how thorough you think it is, will fail to account for all possibilities. That one possibility you missed will bite you in the ass when you automate your maintenance.

Don't automate yourself out of a job. (0)

Anonymous Coward | about two weeks ago | (#47432283)

It's wise to maintain human-on-site because it maintains the employers idea that you are worth keeping rather than outsourcing your position to someone who can do the job from a distance.

I work for money, not some blithering ideal of efficiency which may not include me.

When I trim my trees I don't cut off the branch I'm sitting on.

Good for the Goose (1)

Cigamit (200871) | about two weeks ago | (#47432317)

Simple.

You stipulate that for every maintenance, there has to be a full regression testing of any affected applications. You will require the application owner, QA folks, and any other affected personnel online during and after the maintenance to test and ensure everything is working. Bonus points, require them to be on a conference call, and breathe heavily into the mic the entire time (maybe occassionally says "Oops"). When you have enough other people complaining about the 2 am times instead of just you, they magically get moved to move sensible times in the late afternoon.

Your best is to get out of Managed Services and into Professional Services. You just build out new environments / servers / apps and hand them off to the MS guys. Once its off your hands, you never have to worry about a server crashing, maintenance windows, or being on call. Plus, you are generally paid more.

Re:Good for the Goose (1)

gstoddart (321705) | about two weeks ago | (#47432813)

Your best is to get out of Managed Services and into Professional Services. You just build out new environments / servers / apps and hand them off to the MS guys. Once its off your hands, you never have to worry about a server crashing, maintenance windows, or being on call. Plus, you are generally paid more.

In my experience (personal and professional), those people do a half assed job of building those systems, have no concept of what will be required to maintain them, and are then subsequently unavailable when their stuff falls apart.

They're hit and run artists.

But, they sure to get paid lots of money.

Re:Good for the Goose (0)

Anonymous Coward | about two weeks ago | (#47436591)

Then your experience is clouded by very limited experience.
The PS guys are those who specify the systems, do all the integrations, business analysis, capacity forecasts.
A true mark of a great professional is to recognize other great professionals, whatever their fields. If you are deluded in thinking that your little ant hill is the only place worth working, or that you are smarter than other professinals, then you are likely not one of those great professionals.
If you are talking about the average Accenture consultant, then yes, they're near worthless. But if we are talking about an excellent admin vs an excellent Professional services person.

Many of the consultants I know could easily maintain or manage a data center full of servers, but the opposite is rarely true.
Almost all of the few admins I have seen make the move into professional services end up failing and quitting. They have gotten accustomed to the routine job of maintaining systems, thinking PS is a greener and more exciting pasture. They very quickly find themselves in deep water, or bored. PS requires a more holistic view of the world, an analytical mind and a keen interest in how business is run. Most admins are admins by choice. They have an aptitide for their job and can be very god at it, But they are for the most part ignorant abut the business critical applications running on the servers they maintain. I find the same is often true for dbas. Most dba's are only capable of keeping the database up and running. Few dbas know how to tune it, or know the complete flow of the applications being served by their databases. Specialization is fine, but my view is that all of these specialists would do a much better job if they expanded their horizons just a little bit.

Its your network (1)

sasquatch989 (2663479) | about two weeks ago | (#47432335)

I think automating maintenance is a smart move but still requires you be awake and available for it. The question is do you want to be awake at work for 10 minutes or 2 hours? Plan accordingly.

Testing. Validation. (2)

mythosaz (572040) | about two weeks ago | (#47432353)

Do you plan on automating the end-user testing and validation as well?

Countless system administrators have confirmed the system was operational after change without throwing it to real live testers only to find that, well, it wasn't.

Testing. Validation. (0)

Anonymous Coward | about two weeks ago | (#47433069)

Yeah, I second that. You can automate testing also, but it should be thorough. An external system should be used for this and it should default to fail. Any failure should notify you so you can wake up and fix it.

Nope. (1)

ledow (319597) | about two weeks ago | (#47432365)

Every second you save automating the task, will be taken out of your backside when it goes wrong (see the recent article where a university SCCM server formatted itself and EVERY OTHER MACHINE on campus) and you're not around to stop it or fix it.

Honestly? It's not worth it.

Work out of normal hours, or schedule downtime windows in the middle of the day.

Re:Nope. (2)

smash (1351) | about two weeks ago | (#47432895)

That example was due to incompetence, not due to automation. Whilst recover from that would be a pain in the ass, if you are unable to recover at all, you have a major DR oversight.

Think of it a slightly different way (3, Informative)

thecombatwombat (571826) | about two weeks ago | (#47432369)

First: I do something like this all the time, and it's great. Generally, I _never_ log into production systems. Automation tools developed in pre-prod do _everything_. However, it's not just a matter of automating what a person would do manually.

The problem is that your maintenance for simple things like updating a package is requiring downtime. If you have better redundancy, you can do 99% of normal boring maintenance with zero downtime. I say if you're in this situation you need to think about two questions:

1) Why do my systems require downtime for this kind of thing? I should have better redundancy.
2) How good are my dry runs in pre-prod environments? If you use a system like Puppet for *everything* you can easily run through your puppet code as you like in non-production, then in a maintenance window you merge your Puppet code, and simply watch it propagate to your servers. I think you'll find reliability goes way up. A person should still be around, but unexpected problems will virtually vanish.

Address those questions, and I bet you'll find your business is happy to let you do "maintenance" at more agreeable times. It may not make sense to do it in the middle of the business day, but deploying Puppet code at 7 PM and monitoring is a lot more agreeable to me than signing on at 5 AM to run patches. I've embraced this pattern professionally for a few years now. I don't think I'd still be doing this kind of work if I hadn't.

Re:Think of it a slightly different way (1)

pnutjam (523990) | about two weeks ago | (#47432927)

Sounds awesome. I embraced solutions like that before I ended up in the over segmented large company. Now, not so much. I have to open a ticket to scratch my ass.

Re:Think of it a slightly different way (1)

0123456 (636235) | about two weeks ago | (#47432939)

1) Why do my systems require downtime for this kind of thing? I should have better redundancy.

True. Last year we upgraded all our servers to a new OS with a wipe and reinstall, and the only people who noticed were the ones who could see the server monitoring screens. The standby servers took over and handled all customer traffic while we upgraded the others.

Having problems with this on windows. (0)

Anonymous Coward | about two weeks ago | (#47432439)

We have many thousands of linux and windows desktop clients hosted in data centers accessed by thin client protocols. With linux, no problem. We have our update schedule and everything pretty much works.

However on the windows side we need to use a bunch of custom tools to try and beat the systems into line. We often have things blocked by pending reboots, windows updates and advertised software being pushed to the systems. So we get such a different mix of systems to deal with, it's not always working well.

Please note we also do not have access to the SCCM backend (this has been outsourced). Any suggestions? Except for maybe having a monthly window where we disable wsus and sms host agent, reboot, do our updates, re-enable and reboot again. It's clunky.

GF

Re:Having problems with this on windows. (1)

smash (1351) | about two weeks ago | (#47432905)

Get access to SCCM or get your outsourcer to fix the fucking problems they created.

If the machine is virtual.... (0)

Anonymous Coward | about two weeks ago | (#47432453)

If the machine is a VM, why not bring it down, take a snapshot, boot it up and do your update, etc and then reboot. If the machine is not up by 10 minutes or so, boot up the snapshot you made. You can do all of this via an external machine and use the Perl API to vmware or use the standard KVM/Xen virt tools. This way, if your maintenance fails, you can come in the next morning and figure out what went wrong. I think VMWare actually provides a script called called "snapshotmanager.pl" in it's Perl SDK so you don't need to write your own. (If you're using VMWare)

Re:If the machine is virtual.... (1)

smash (1351) | about two weeks ago | (#47432911)

"It boots" does not necessarily constitute success. You really need a test environment. There's no real getting around it.

Convenience in place of Caution (1)

div_2n (525075) | about two weeks ago | (#47432495)

You're trading caution for convenience.

I have automated some things such as patch installation overnight only to wake up to a broken server despite the patches being heavily tested and known to work in 100% of the cases before only to not have them work when nobody was watching.

I urge you to only consider unattended automation overnight when it's for a system that can reasonably incur unexpected downtime without jeopardizing your job and/or the organization. If it's critical -- DO NOT AUTOMATE.

You've been warned.

Perl (1)

Murdoch5 (1563847) | about two weeks ago | (#47432511)

Just write a simply perl script to handle it, it would take about 1 hour to develop and test and you'd be good to go.

Lean Sigma Six to the rescue (0)

Anonymous Coward | about two weeks ago | (#47432561)

Hello OP,

    First things first, we need to discuss some of the less the fun stuff about maintenance windows in production environments. For starters what is the process that your company follows exactly? Do you test on dev box before hand? Do you have a roll back plan in case it goes to hell?

    Now for the less the fun part... you will want to be on-site, or at least have someone on-site, whenever you are doing any type of maintenance work on a production system. Not so much a problem now a days but a good example for the olden days was, what if someone left a disk in the A drive? Well that means your box won't boot most likely because it's most likely set to boot from A before C. Just an example but goes to show that having hands an eyes on-site is almost next to non-negotiable (there are some exceptions like completely virtualized environments where you can use the controller to do whatever you need but even then... like another poster said... Murphy will hunt you down and make you regret not having someone on site).

    Now moving on to the next part, Lean Sigma Six... if you have no training in it ask your company to pay for your courses. It's a really fun course and it applies to EVERYTHING. I know of a major VoIP solution that uses LSS (Lean Sigma Six) approaches to update systems that are on the 5 9s level (99.999% up-time). you end up breaking down your maintenance into a couple of different steps. Step one is like 7 days before the update (moving any required files to the machine putting them in the right spots, blah blah blah), step 2 is the day before generally making sure no updates where released for the files that you moved the previous week (most likely will not apply to you but still a step never the less) this step normally also includes checking and documenting system health to make sure everything is up to snuff (example wouldn't want to start an update if the raid array is down and dirty or various other things like low disk space, high mem/cpu usage, missing user accounts for maintenance). Step 3 is the actual update and checking for to make sure everything got the update and started back up, step 4 is confirming the end results works as intended.

    Really you can make these into as many steps as you want but the goal is to have as much ready and done before the actual update so that your work load at 4am in the morning is as small as possible.

   

Somewhat (0)

Anonymous Coward | about two weeks ago | (#47432581)

While you can't quite afford to do it fully unattended, you can spin up another (presumably close to identical) machine with everything that needs to be on there, prepare a last sync, let it sit until maintenance, do the last sync, swap out the boxes. Test, if not ok swap back until next time. That way all the hard stuff gets moved out of the dark hours and the rest you can do whenever.

Of course, this requires extra hardware or at least allocated virtualised resources, but since these are regarded as close to free these days, and spun-down instances can be re-used next time or for some other task, well, you know.

Updates can't be left unnatended (1)

cloud.pt (3412475) | about two weeks ago | (#47432605)

I'm not familiar with CentOS or Redhat, but in Debian it's not uncommon to get the odd update that requires configuration wizards. There's no shortcut to those, and in the event of it happening, you are gonna have some early risers complaining.

And even the supposedly safe, unattended updates aren't that safe: For example I updated to the latest linux-image from Debian's repos yesterday. I didn't expect some core services to depend on a computer reboot to start working again, but 5 minutes in people were complaining a Jboss web app wasn't working.

No single points of failure (1)

jader3rd (2222716) | about two weeks ago | (#47432625)

Are you talking about servers/services? If so, every service should have some sort of failover strategy to other hardware. That way anything you need to work on can be failed over during business hours and brought back.

I don't get paid for things that work right (1)

prgrmr (568806) | about two weeks ago | (#47432701)

I get paid for cleaning up after things that don't work right the first time.

3 am is better (1)

djupedal (584558) | about two weeks ago | (#47432743)

That way, when things go south, you have time to right the ship before the early birds start logging in at 5:30.

VMs help (1)

normanjd (1290602) | about two weeks ago | (#47432773)

Do you use VMs? ALL of our servers are now running on VMware at remote locations. I can't automate maintenance, but it does not matter if I do it from the office or at home as I am remoting either way... Set up a snapshot to roll back if there is a problem, and you can at least make it a bit more comfortable if you have to be up at odd hours...

Automation is necessary (2)

dave562 (969951) | about two weeks ago | (#47432799)

If you want to progress in your IT career, you need to figure out how to automate basic system operations like maintenance and patching. Having to actually be awake at 2:00am to apply patches is rookie status. Sometimes it is unavoidable, but it should not be the default stance.

My environment is virtual, so our workflow is basically snapshot VM, patch, test. If the test fails, rollback the snapshot and try again (if time is available) or delay until later. If the test is successful, we hold onto the snapshot for three days just in case users find something that we missed. If everything is good after three days, we delete the snapshot.

We have a dev environment that mirrors production that we can use for patch testing, upgrade testing, etc. Due to testing, we rarely have problems with production changes. If we do, the junior guys escalate to someone who can sort it out. Our SLAs are defined to give us plenty of time to resolve issues that occur within the allocated window. (Typically ~4 hours)

In the grand scheme of things, my environment is pretty small. We have ~1500 VMs. We manage it with three people and a lot of automation.

OS packaging for configuration management (0)

Anonymous Coward | about two weeks ago | (#47432801)

I have done it, on a large scale (20,000+ systems) using OS packaging for configuration management, Solaris JumpStart(TM), and Solaris Flash(TM) and Sun Cluster.

I have also done it with RPM's and Kickstart on CentOS and SuSE Linux Enterprise server, with AutoYaST and Kiwi.

Not only is it doable, it works beautifully. We are a 100% Puppet, CFengine and Chef-free environment - we do not use any such solution.

We can take a node down and "re-flash" it automatically while the other nodes in the cluster continue to provide the service without any interruptions whatsoever.

The entire environment supports rolling upgrades so we no longer need service windows. The entire environment is completely hands-off, automated. System administrators have no need to log in, unless they are doing software RAID administration or servicing hardware failure(s).

Scheduling (0)

Anonymous Coward | about two weeks ago | (#47432847)

Is it that difficult to juggle schedules for your IT staff?

Name things that shouldn't be automated. (1)

holophrastic (221104) | about two weeks ago | (#47432881)

Consider all of the tasks that you do as a part of your job. Identify which ones should absolutely never be automated -- maybe they're too dangerous, maybe the risk is too great, maybe they're too much fun. I'd bet that upgrading the OS would be pretty well the top of your never-automate-this list.

automate away (0)

Anonymous Coward | about two weeks ago | (#47432903)

cf engine or puppet will be best.

Easy To Do (0)

Anonymous Coward | about two weeks ago | (#47432985)

Server virtualization, redundancy, test and development environments, backups or snapshots before upgrades, testing, and automation with monitoring should allow you to either do the updates during work hours or sleep during the maintenance window in some update scenarios. Configure an external monitoring system to verify functionality when itÃ(TM)s complete and notify you if anything fails. The notification should default to failure if anything goes wrong.

Reboot? - Load Balancers and multiple systems (3, Insightful)

Taelron (1046946) | about two weeks ago | (#47433019)

Unless you are updating the Kernel there are few times you need to reboot a centos box. Unless your app has a memory leak.

The better way to go about it has already been pointed out above. Have several systems, load balance them in a pool, take one node out of the pool, work on it, return it to the pool then repeat for each remaining system. - No outage time and users are none the wiser to the update.

Re:Reboot? - Load Balancers and multiple systems (0)

Anonymous Coward | about two weeks ago | (#47434169)

Unless, of course, there's a software upgrade that makes that node incompatible with the rest of the pool...

fsck'd lately/ (0)

Anonymous Coward | about two weeks ago | (#47433129)

Whenever you reboot, you have to take fsck into account? If the uptime has been more than 180 days, you'll need an fsck, if it is a small file system sure just a few minutes, but I'd schedule an hour just in case. Larger file system, we could be talking 4-6 hours of down time.

It can be done (1)

sentiblue (3535839) | about two weeks ago | (#47433159)

I definitely want to automate this... but not automate alone... I would build in some monitoring and notification... such as result of each step summed into an email/SMS report.

I would also use a remote host, to send an alert if the host in maintenance doesn't come back within the expected window....

And of course.. even though I'm not gonna be doing the maintenance activity manually/live, but I would still want to find a proper way to know/confirm that the maintenance window was successful before sleeping all the way through.

Dumbass Company (0)

Anonymous Coward | about two weeks ago | (#47433231)

My last job at a UK Military Contractor (With 340+ workstations) was as IT Manager, though I had to take all direction from the IT Executive (A git with no actual IT knowledge) and we all would be told to DO only what he asked us to do, until then we just had to sit around doing nothing. One day the main server lost it RAID board and the IT executive comes storming into the office demanding to know why we were not fixing it as the whole company was down.... We replied that under your expressed guidance we have not been told to look at the issue which drove him mad and then the MD turned up to enquire what was happening and the IT executive started to brown nose the MD so we got out the department guidelines to show the MD that the IT executive was lying. Then we were told by the MD to order a new RAID board and I responded that a new RAID board was a 3 week lead time (which really upset the IT executive and MD) and the entire department was asked to convene in the IT executive office to continue this now heated discussion. When in the IT executive office I asked the MD to look into the IT Executive "in-tray" and said he would find a requisition form for said RAID controller board and it was date 4 week ago. The MD gave me the requestion form after signing it and we were all asked to leave the office. We never did see the IT executive again and now I could sign requisitions and run the department...

Look into Continuous Deployment (0)

Anonymous Coward | about two weeks ago | (#47433343)

Continuous Deployment is what you want to do in some degree. Puppet, already mentioned, is one tool to be used in such scenarios. However, it is not sufficient to run automatic updates and reconfigurations. If you also develop your own software, you should look at Jenkins continuous deployment extensions and in general into DevOps.

Headline fail of 2014 (0)

Anonymous Coward | about two weeks ago | (#47433551)

Maintenance Windows sounds like upgrading to Windows 8.1 - this is one of the most unfortunate headlines of the year.

How about...

Ask Slashdot: Best time to do unattended maintenance on servers?

Dev Test Prod (0)

Anonymous Coward | about two weeks ago | (#47433591)

If you aren't phasing your patches and updates through a full-cycle test in Dev, testing in Test, fixing what needs to be done in Dev again, and eventually making it bullet proof enough for Prod, you're doing it wrong.

Puppet, Chef, Ansible aren't the answer.

They may enable you you manage the components, but they won't solve your problem. You need environments that exactly mirror your Prod.

If you've managed to bring this up, proper Disaster Recovery is a recipe/manifest away.

Ansible (1)

ewieling (90662) | about two weeks ago | (#47433809)

We use Ansible, it seems to fit well with our needs, but others use Puppet or Chef.

We do this all the time (1)

Bugler412 (2610815) | about two weeks ago | (#47433941)

All sorts of automated security updates and patches during the regularly scheduled maintenance window. Couple of key things that make it work: 1. A valid and representative DEV environment or host(s) to vet and test deploy the updates using the same methods as production hosts. 2. A solid alerting system for when the inevitable couple of hosts fail and needs help to get running again. 3. A qualified and responsive on call person to review the results at or near the end of the maintenance window to make sure everything came back online properly and take action where necessary. It doesn't so much eliminate the after hours work as to reduce the volume of the after hours work to a level manageable by a single qualified tech.

VPN (1)

Crizzam (749336) | about two weeks ago | (#47434039)

Get yourself a VPN to your workstation and do it from home, in bed. If you can get a quicky while you're at it, good on you.

Missing the point (2)

ilsaloving (1534307) | about two weeks ago | (#47434185)

The OP is missing the point. Of *course* you can automate updates. You don't even need an automation system. It can be as simple as writing a bash script.

The point is... what happens when something goes wrong? If all goes well, then there's no problem. But if something does go wrong, you no longer have anyone able to respond because nobody's paying attention. So you come in the next morning with a down server and a clusterf__k on your hands.

You have redundancy right? (1)

UrsaMajor987 (3604759) | about two weeks ago | (#47434253)

The last place I worked at had redundancy both within the data center and across data centers. That is they could survive the loss of a data center. If the service you are supplying is so critical you should have redundancy. This will give you a little more leeway on when maintenance is done.

Call me unsympathetic... (0)

Anonymous Coward | about two weeks ago | (#47434537)

5 and 6 am maintenance windows would be a blessing after working 21 years for a large Telco/ISP. We usually start at midnight or 2 am and there is typically one or two events going on in a week...

Dont let me down Bruce.. (1)

tempest69 (572798) | about two weeks ago | (#47435009)

Always go in with a well considered plan, and be there when it happens.

Even if your planning is awesome, you'll look unprofessional not being in a position to fix a problem when it is most likely to occur.

If something does happen, and your not there.. There will be crankiness.

Load Balancers (0)

Anonymous Coward | about two weeks ago | (#47435109)

You put all your shit behind load balancers. Then you tell the target machine to stop accepting new connections. When the existing connections die, you are free to do whatever upgrades are necessary. Swap the tested box back in, and move to the next target.

At the very least, you should configure two boxes such that the backup box has the same MAC so you can easily swap it in without anyone noticing. Then upgrade the backup, swap it in, upgrade the former primary. Probably good to leave things as-is so both boxes get equal time as primary.

Thanks for the feedback (OP response) (2)

grahamsaa (1287732) | about two weeks ago | (#47435115)

Thanks for all of the feedback -- it's useful.

A couple clarifications: we do have redundant systems, on multiple physical machines with redundant power and network connections. If a VM (or even an entire hypervisor) dies, we're generally OK. Unfortunately, some things are very hard to make HA. If a primary database server needs to be rebooted, generally downtime is required. We do have a pretty good monitoring setup, and we also have support staff that work all shifts, so there's always someone around who could be tasked with 'call me if this breaks'. We also have a senior engineer on call at all times. Lately it's been pretty quiet because stuff mostly just works.

Basically, up to this point we haven't automated anything that will / could be done during a maintenance window that causes downtime on a public facing service, and I can understand the reasoning behind that, but we also have lab and QA environments that are getting closer to what we have in production. They're not quite there yet, but when we get there, automating something like this could be an interesting way to go. We're already starting to use Ansible, but that's not completely baked in yet and will probably take several months.

My interest in doing this is partly that sleep is nice, but really, if I'm doing maintenance at 5:30 AM for a window that has to be announced weeks ahead of time, I'm a single point of failure, and I don't really like that. Plus, considering the number of systems we have, the benefits of automating this particular scenario are significant. Proper testing is required, but proper testing (which can also be automated) can be used to ensure that our lab environments do actually match production (unit tests can be baked in). Initially it will take more time, but in the long run anything that can eliminate human error is good, particularly at odd hours.

Somewhat related, about a year ago, my cat redeployed a service. I was up for an early morning window and pre staged a few commands chained with &&'s, went downstairs to make coffee and came back to find that the work had been done. Too early. My cat was hanging out on the desk. The first key he hit was "enter" followed by a bunch of garbage, so my commands were faithfully executed. It didn't cause any serious trouble, but it could have under different circumstances. Anyway, thanks for the useful feedback :)

Re:Thanks for the feedback (OP response) (0)

Anonymous Coward | about two weeks ago | (#47448067)

You can avoid the cat-related issues by simply locking your desktop everytime you get up.

You're a whiny-assed bitch (1)

msobkow (48369) | about two weeks ago | (#47435181)

The whole reason we used to get paid extra to provide support was to provide support. That meant weird hours, weekends, and late nights.

If you don't like it, get another job.

Silence or I will replace you with a shell script (0)

Anonymous Coward | about two weeks ago | (#47435299)

Automating server maintenance is something most IT departments try to do. But you never want to make it too automated.

How about meeting it half way with MOPs? (1)

xushi (740195) | about two weeks ago | (#47435605)

You said you had 24/7 personnel on call. Let's refer to them as the NOC.

Are they trained to type and follow commands, along with basic (i mean basic..) skills in *nix?

You could cut the cost of investment in automation which can be costly, and focus more on well documented (and tested as much as possible) steps or instructions that they can perform since they're up anyway.. You can use some of the allocated cost in training the NOC a bit more on *nix, scripting etc..

If anything goes wrong then they can call you and you can follow up. Depending on how good your steps are (and a bit of luck) you might end up waking up less than usual.

Of course I ask that silly question up top because you don't want to be awaken at the start of the maintenance saying "Hello sir, your MOP failed at `ls /vra/log`.

2 AM? (1)

PPH (736903) | about two weeks ago | (#47435637)

Sleep in?

I don't understand. This just means you swing by and do the update after they close the bar and throw you out.

That's SOP around here.

manual updates in 2014? (0)

Anonymous Coward | about two weeks ago | (#47435731)

that to my is red flag all by itself.

the hard truth is:

if you can't be sure that you production system will work after the patch it means that you don't know its state for sure. Which means someone has done manual updates prior to that. Which means your processes are generally not up for a production environment.

What you need to have is a fully automated build, integration and deployment, for the full system, from OS to configuration to apps, database setup etc. Basically you should be able to issue a "deploy" command form some machine, targeting some other machine and then be able to walk away in confidence. Post installation, production test scripts should verify everything that you would by hand, then let you know of the results. Only then you might maybe do some basic manual checks just on the off-chance something went horribly wrong and fooled the test scripts.

Btw, of course you will have tested the deploy commands plus all scripts many times before even thinking to deploy to production.

I know this is far from being a standard approach, but really there is no excuse any more for not doing it this way.

The cloud dude (0)

Anonymous Coward | about two weeks ago | (#47436327)

        uh.... one word. CLOUD. My idiot boss thinks there is a nebulous magic button called the cloud that can do it all...

Automate but cover your bases (1)

MatthiasF (1853064) | about two weeks ago | (#47436449)

Only automate tasks on systems that can be quickly snapshotted and simply QC'd using scripts.

For instance, if you have a web server you want to update weekly, then setup a script on the virtual host that snapshots the virtual machine before the upgrades and then runs a series of checks on the web server after the upgrades. If the web server does not respond as expected to the post-upgrade checks, the virtual host can revert back to the pre-update snapshot and send a message to you notifying you of the upgrade failure. You could also snapshot the failed virtual machine, spin it up on another machine or instance without networking to check the logs for any errors that occurred during upgrades.

If the virtual machine is *nix based, you could mount the snapshot directly on the host and browse the logs as well, or even automate the collection of failed logs too.

Any upgrade procedure that cannot be easily scripted or delayed in such a fashion should be done manually and well attended by someone knowledgeable.

Two words: Time Zones (0)

Anonymous Coward | about two weeks ago | (#47436553)

You should have more than 1 admin anyways in case 1 has to call in sick.
1 on west coast and one on east coast would give you 3 hours extra window to avoid the 2 am and 7 am windows.
Or one i Europe, and another in Asia.

Still automating as much of it is a good idea, allowing to test out the procedure before performing the maintenenance, minimizing your chance of screwing something up.

It can be done... (0)

Anonymous Coward | about two weeks ago | (#47436639)

I do stuff like this a lot at my job.

What I'd do is this:
-Write a script to do the package stuff and the reboot...
-Write another script that's running on a completely different machine/VM... whatever... that pings/wgets/curls/nmaps whatever you need to see that the machine has indeed rebooted and the services you're expecting to be up are operating... like wget through part of your website for example...
-If the script detects an issue via the monitoring machine then it sends out an email to your email address and texts your cell phone
-set this up to run every 5 minutes in cron on your monitoring machine... and if you want text you every 5 minutes at your house while you're sleeping to wake you up....
-get your ass to work if you need to, but if all goes well you get to sleep in most times...

You are just doing it wrong (1)

elmer at web-axis (697307) | about two weeks ago | (#47436691)

With Virtualization you should have no real need to do server upgrades out of hours. If you need to upgrade a package/service on offer you should just spin up a brand new instance, have some type of automation piece install and configure everything that the instance needs, have some auto testing application confirm that it's all added, then just add the instance to the load balancer, and decommission the old instance. No more out of hours work unless dealing with hardware issues and with HA these issues usually can be dealt with during business hours. If you are restricted by a limit on resources you should at least be using products like Docker or Solaris Zones to isolate guests from the core OS and separate out application vs core OS needs (the bulk of change usually happens in the application layer so this seperation again means less downtime out of hours). Need to update the hyper visor? live migrate the guests to another piece of hardware and do the maintenance again during business hours. If you don't have the budget you can always spin these kinda solutions up. (DRBD/KVM work a treat). Or as others have said host everything in the cloud.

Do you get paid overtime? (1)

Ryanrule (1657199) | about two weeks ago | (#47436731)

If not, tell them you will quit.
When they call your bluff, quit.
Accept a 50% raise the next day.

Off-shoring (0)

Anonymous Coward | about two weeks ago | (#47443089)

It's interesting how people are not really discussing utilising off-shore resources. Most large organisation utilise employees in different time zones because of the cost savings and the possibility to have at least some of the team working at any given hour of the day. So you can be in bed in New York at midnight your time, and a worker in India is doing your maintenance activities because it is only 9:30am their time.

Monitoring Team here (1)

weweedmaniii (1869418) | about two weeks ago | (#47444533)

I work for a monitoring team. We are 24/7 and I can guarantee you from experience this is a terrible idea. The first time the servers drop out of the monitoring suppression and suddenly a half dozen alarms are going off because your automated server program decided to drag down a series of other servers, or kill the switches at the office I get to call you at 4AM you are going to wonder why you didn't just catch a nap and go back in. Anytime we get an e-mail from a "Senior Sever Manager" stating that "a change will be made this weekend at 2AM but will not affect system uptime," we note in in our shift logs because as sure as we are sitting there Murphy will creep up and jump on that managers back and chew until someone can beat him off. Usually to minimize the damage to our team, we will politely e-mail that manager and ask exactly what systems and what times will this happen as a warning that we really do not want to have to go through the late night procedures to alert someone. Most managers who have experience will actually send us a separate e-mail saying "server XYZ123 will be down from 1AM to 3AM, if we get it up sooner we will call to verify it is up on your end." We monitor 10 different companies of all sizes from a single server room to worldwide systems, and Murphy is a board member for every one and always gets a vote.

Build a redundant infrastructure, upgrade whenever (0)

Anonymous Coward | about a week ago | (#47468663)

Unattended maintenance is not a good idea. Although in many cases you can automate everything, including the step that verifies that the maintenance was successful, you can't automate tshooting all issues when it's unsuccessful. However, automation wherever you can is almost always a good idea, including attended upgrades.

Work towards building a redundant (and/or highly available if possible) production and test infrastructure that minimizes downtime for users, regardless of whether the downtime was caused by unplanned outages, or planned maintenance. With that in place, you can build management confidence that upgrades can be performed during regular hours, since the expected user impact is minimal if any. However, in IT, there is always maintenance that will need to be performed during non peak/after hours, either because that particular user service is that critical, or because the maintenance is that risky, or because the expected downtime for the maintenance is too great.

Build redundant infrastructure, upgrade whenever (0)

Anonymous Coward | about a week ago | (#47468705)

Unattended maintenance is not a good idea. Although in many cases you can automate everything, including the step that verifies that the maintenance was successful, you can't automate tshooting all issues when it's unsuccessful. However, automation where you can is almost always a good idea, especially if it's attended.

Work towards building a redundant (and/or highly available if possible) production and test infrastructure that minimizes downtime for users, regardless of whether the downtime was caused by unplanned outages, or planned maintenance. With that in place, you can build management confidence that upgrades can be performed during regular hours, since the expected user impact is minimal if any. However, in IT, there is always maintenance that will need to be performed during non peak/after hours, either because that particular user service is that critical, or because the maintenance is that risky, or because the expected downtime for the maintenance is too great.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...