Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Ask Slashdot: Getting a Grip On an Inherited IT Mess?

timothy posted more than 2 years ago | from the can-is-open-worms-are-everywhere dept.

Networking 424

First time accepted submitter bushx writes "A little over a month ago, I assumed the position of programmer and sole IT personnel at a thriving e-commerce company. All the documentation I have is of my own creation, as I've spent most of my time reverse-engineering the systems in place just so I can understand how everything works together. Since I've started, I've done everything from network and phone upgrades to database maintenance with Perl, and thus far it's been immensely rewarding. But as I dig deeper, I notice the alarming number of band-aids applied by my predecessor, and it seems like the entire company's infrastructure is just a few problems away from a total meltdown. The big question now is, how do I, as a single person, effectively audit the network, servers, databases, backups, and formulate a long-term plan that can be implemented by one person? Is it possible? Where do I begin?"

cancel ×


Sorry! There are no comments related to the filter you selected.

Explaines a lot (5, Funny)

p43751 (170402) | more than 2 years ago | (#38281160)

You work at RIM?

Re:Explaines a lot (5, Funny)

Anonymous Coward | more than 2 years ago | (#38281406)

So you are asking him if he got a RIM job?

methodically and late into the night (5, Insightful)

sentimental.bryan (2489736) | more than 2 years ago | (#38281170)

say goodbye to your life for the next year. hope you're getting paid to mislay it....

Re:methodically and late into the night (4, Insightful)

mattventura (1408229) | more than 2 years ago | (#38281780)

This. From what I've heard, it often involves weekends too.

Quit (0)

Anonymous Coward | more than 2 years ago | (#38281178)

Nuff Said.

Re:Quit (5, Insightful)

Anonymous Coward | more than 2 years ago | (#38281284)


This is actually the kind of career building stuff one should leap at. What would you rather say in an interview for your next job:
- I took this system that was falling apart and made it run like clockwork.. downtime and issue frequency went from "it's down again" to "been up all year" ..
- Yeah it was pretty good when I got there, and I maintained the status quo

My thoughts on original question:

First step is comprehension. You can’t fix what you don’t know you have/need. Identify the key components of your system. Then for each key component, break it down to it’s parts and dependencies. Then break each one of those out, and so on, until you have a pretty damn good idea of what you have.

Next part is assessment. For each component you’ve identified, what is its current state.

And then it’s time to do triage. Prioritize stuff by largest potential impact.

And finally carry out your well thought out pla.. ok, can't say that one with a straight face. Basically try to fix stuff when you can, between putting out the daily fires.

Re:Quit (5, Insightful)

Anonymous Coward | more than 2 years ago | (#38281636)

I worked in this environment for one year as to not tarnish my resume. I toughed out the last 4 months absolutely burned out and bitter. You cannot communicate to management that outages and issues aren't your fault; they're adopted. When you fix things, you'll inevidably miss something (I did because of the pace, not dictated by me). Get out. It's not worth the challenge to get proper budgeting to get the right tools in place or the organization as a whole wouldn't let things get how they are in the first place. The business model I came from is failing. If you're good, there are better paying, better rewarding, less "heart and soul" companies out there. You're doing basically startup work for at will employment pay.

Re:Quit (5, Insightful)

The Moof (859402) | more than 2 years ago | (#38281642)

I'd amend that to a big "maybe" for sticking around.

All of what you said (and the initial reaction to quit in the GP) all hinges on the root cause of the mess. If it's a result of the predecessor not doing things correctly and flying by the seat of his pants, you're correct at jumping at the opportunity. However, if it's caused by management screwing IT every chance they get with poor timelines, lack of funding, no foresight, and so on, run like hell.

Re:Quit (5, Insightful)

Archangel Michael (180766) | more than 2 years ago | (#38281812)

It is probably a combination of the two. Because MGMT always assumes IT can do something with very little, and often the Impossible with Nothing.

We are skilled (most of us anyway) problem solvers, and they rely upon that to function. I hate to say it, but to the original question should be answered this way: HIRE outside consultants to evaluate your system(s), and give you a hard copy report on their findings that you can present to MGMT.

If the situation is as I believe, it is worse than he even suspects. He needs more help than he can do by himself, to get ahead of the curve.

Re:Quit (5, Informative)

v1 (525388) | more than 2 years ago | (#38281758)

Yep. Hop into the waders and get to work. It can be a very rewarding experience turning a steaming pile back into a smooth running good looking machine.

To add to the above, document everything. Though it sounds like you're already doing that. Make sure it's documentation that works for anyone not just you. Don't take anything for granted. Automate whatever you can, including problem detection and notification. (save yourself from having to check things daily or weekly, have it shoot you an email or something if a common issue crops up again)

Make sure your employer fully understands the situation you and they are in, so they don't expect you to be doing improvements and striking things off their sore to-do-list that they were probably hoping you'd tackle the day you started. Get them a timeline as soon as you get something of a grip on the situation, tell them where you're going to be spending your time to start with, and the reasons why it's essential and going to delay their getting their bells and whistles and visible bang-for-the-buck of hiring you. Otherwise they may think you're just sitting on your butt because they're seeing no tangible benefits.

If you've got a LOT of things that need to be fixed, things that can be done by closer to trained-monkey level, consider getting a temp assistant to help you dig out. Someone to run around and reimage machines, fix networks, repair stations, do RMAs, etc while you pull up your sleeves and unhack the servers. But if they're not in that big of a hurry this may not be appropriate.

Good luck with it, sounds like fun actually, a challenge at the least.

Re:Quit (3, Interesting)

KermodeBear (738243) | more than 2 years ago | (#38281816)

Best advice, right there. It's a challenge for certain, but making things better is the best thing you can do - for the company (ha) and, far more importantly, for yourself.

Hang in there!

And although it may feel like the whole place is going to fall apart any moment, it hasn't yet, you're in charge, and it sounds like you're gradually making it all better. Take a deep breath, Don't Panic, it'll be okay.

Re:Quit (3, Insightful)

1s44c (552956) | more than 2 years ago | (#38281438)

Quit? Do you give up on every task before you start?

Some of us like a challenge.

Re:Quit (2)

Jibekn (1975348) | more than 2 years ago | (#38281648)

As someone whose been fucked by a sociopath boss for trying to take on a similar project, I agree with the OP, QUIT, run far and fast. Its not worth it, being able to say "I took a broken system and made it run like clockwork" is worthless at most job interviews, they're not hiring someone to fix their broken system, they're hiring an admin (from the impression I got) if a company knows their system is falling apart, there's contracting companies that specialize in that, unless you're trying to get into one of them, don't bother.

Its really not fun being blamed for a critical failure caused by the idiocy of a predecessor, and then fired for it.

1 suggestions (5, Funny)

Anonymous Coward | more than 2 years ago | (#38281180)

start drinking

Configuration management (4, Informative)

Neil Watson (60859) | more than 2 years ago | (#38281182)

Automate your servers so you can focus your time elsewhere. I use Cfengine. []

Re:Configuration management (4, Informative)

1s44c (552956) | more than 2 years ago | (#38281322)

Yes, automate everything, monitor everything, backup everything, document everything.

I used to use cfengine but find puppet an easier tool to work with. Nagios and BackupPC are also wonderful tools but you might want to choose alternatives if they better fit your needs.

You might want to express some concerns to management just in case something critical does fall over you don't look quite so bad.

Re:Configuration management (4, Informative)

vajrabum (688509) | more than 2 years ago | (#38281598)

Lots of folks here have talked about backups but if you're company is really successful then restores could be more of a problem than backups. Large databases and system configuration can take a loooong time. Develop a plan for restore and execute it regularly as a test. Make sure management understands the time for restoration. Two other things--virtualize (that reduces the coefficient of friction for moving things considerably) and consider using Amazon or some other cloud provider in your restore plan to in case your cage/server room/whatever burns. Some of those services are low or no cost until you start loading things up. If you go the cloud route be sure to get a read on your traffic, storage and other billable numbers. If that's the disaster plan then if the numbers are of any size at all you need to run the cost by the CFO to make sure that it's sustainable.

Before you do that (1)

Colin Smith (2679) | more than 2 years ago | (#38281426)

Make sure you know what you've got. That means a db and an automated inventory tool. Installed & managed by above management engine.

Then decide what services are missing.

Then automate the rest with the management engine.

Re:Configuration management (2)

SCHecklerX (229973) | more than 2 years ago | (#38281738)

Inherited my mess about a year ago. I've done much to clean it up and monitor it.

I may have to investigate Cfengine soon, but for now, since I am comfortable with creating my own RPMs since all of our servers are CentOS, I simply use yum with rpm. It works very nicely. If I make changes (I use git to track/branch/etc), I then just rsync the repository to our production server once I am happy that everything is correct. Building, git, etc, is all automated from within vim with some simple scripts that I wrote.

Nagios to monitor the whole mess, including MySQL replication, DRBD clusters, Backups, Firewalls, Mail Relays, Web Servers, Whether we somehow ended up on a RBL, etc.

Dunno how I got stuck doing it, not being a developer, but I will also be training our 'development' team (php web developers) on how to use git from a shared repository.

I'm managing approximately 50 enterprise server systems this way, and the load is no big deal for just me to manage now that I'm slowly beginning to reign in our developers and my boss. They all have root access still :-( A political fight I'm not yet prepared to have. I was able to take it away on the web servers, at least, and that's the only thing our developers touch, so life is a bit better.

Wiki (0)

Anonymous Coward | more than 2 years ago | (#38281194)

Been there, done that. Start with a simple wiki. Document everything including lists of things that need to be done. Put time and dollar costs on everything along with your idea of the priority. Present the list to the CEO or whomever it is that is above you and work on prioritizing it, then get to work.

Re:Wiki (1)

1s44c (552956) | more than 2 years ago | (#38281370)

Also a list of what needs to be done can be used to justify extra people, new software, or new servers.

Re:Wiki (2)

yakatz (1176317) | more than 2 years ago | (#38281442)

Short of blowing it up and starting fresh, this is the best way. Kidding aside, I was in the same situation as you several years ago.
We happened to have Sharepoint already installed (as part of SBS2008), so we started using its Wiki feature for our documentation.
We use its lists feature to keep track of license keys and firewall settings (not in the same list of course).
Just make as comprehensive a list as possible.

you don't (1)

yincrash (854885) | more than 2 years ago | (#38281196)

you hire more people and do a thorough job of cleaning it up or rebuilding.

Re:you don't (3, Insightful)

Synerg1y (2169962) | more than 2 years ago | (#38281400)

Or bring in contractors / consultants and have them serve their part and then part ways, the biggest mistake you can make is taking everything on your shoulders, that = loss of life & health. It's a job and work != life.

Denning Cycle (1)

Colin Smith (2679) | more than 2 years ago | (#38281200)


Sorry; Demng Cycle (1)

Colin Smith (2679) | more than 2 years ago | (#38281302)


3rd time's a charm (1)

Colin Smith (2679) | more than 2 years ago | (#38281488)


We don't need to know you're single.... (0)

Anonymous Coward | more than 2 years ago | (#38281202)

We don't need to know you're single....

Good luck at fixing other people's mistakes...

"A little over a month ago I assumed the position" (5, Funny)

Anonymous Coward | more than 2 years ago | (#38281214)

Dude, that is to easy. There are serious wiseacres on this board.

any management software? (1)

alen (225700) | more than 2 years ago | (#38281220)

is there any software that alerts you something is wrong? at the minimum it will tell you what is out there? what about the backup software reports?

i use netbackup to back up our MS SQL servers and didn't like the built in reporting. i wrote my own procedure to import a few tables from the msdb database where the backup data is kept into a central database and use SQL Server Reporting Services to send daily emails of the latest backup times of each server/database. along with a few alert reports of databases that were never backed up or haven't been backed up for 7 days.

Re:any management software? (2)

Synerg1y (2169962) | more than 2 years ago | (#38281290)

PRTG monitor.


related story (2, Funny)

shentino (1139071) | more than 2 years ago | (#38281232)

Did the last guy outsource everything to india?

One thing at a time (2)

UberJugend (2519392) | more than 2 years ago | (#38281238)

Assess your most vulnerable items. If that is a server, a network component, application, database etc. Give them all a critical score. Share that list with your boss/manager and work the list one item at a time. You can't spread yourself too thin when working a project like this, so focus on one or two items at a time until you see a light at the end of the tunnel.

Re:One thing at a time (2)

RagingFuryBlack (956453) | more than 2 years ago | (#38281578)

Probably better to start with "Well, what are my assets?" What things are currently plugged into my network? What purpose do they serve in meeting the objectives of the business? Once you know what is on your network, then you can start assessing what items are vulnerable. I've used NeXpose for vulnerability assessment as of late, but I'm sure there are other solutions, both proprietary and FOSS that can do the job. Many of these products will give you a total risk score and a CVSS score of the vulnerabilities on the system. Of course, I'd just start using these as a justification for more staff. You've got much larger problems than just security vulnerabilities. You've got infrastructure issues from the sound of it that need some serious man hours to fix. Use the security portion to light a fire under management's ass and get some new staff under you. Management reacts pretty quickly when they realize their compensation list may be available for download by the outside world from their internal servers.

Escalate (5, Insightful)

sociocapitalist (2471722) | more than 2 years ago | (#38281240)

Brief your management on the situation. Explain what condition things are in and what is needed to get them into a manageable state. Give them a list of projects / tasks that you have to deal with and get them to prioritize.

Re:Escalate (2)

vikingpower (768921) | more than 2 years ago | (#38281586)

Mod parent up. Make sure you have management backing - but truly, I am serious: DO make sure you DO have management backing ! - ,then get to work. Document as much as you can, then identify the hottest points, fixing them, one after the other. As you make your way through the mess, your understanding will deepen. Ask if you can hire an intern ( ? ). So, to summarize: 1) get political backing 2) document, document, document 3) fix, in single-tasking mode 4) if necessary, GOTO #1) For your next career move, this is great to have on your CV.

I wouldn't worry... (0)

Anonymous Coward | more than 2 years ago | (#38281242)

IMO, don't loose any sleep over it. The description covers most IT systems.

Re:I wouldn't worry... (3, Informative)

mikael_j (106439) | more than 2 years ago | (#38281366)

Sadly there's a lot of truth to this. In my experience the difference between most "good" and "bad" networks is whether the WTFs are vendor-blessed hacks or in-house hacks.

Of course, there are always those places where this is not the case but I've seen enough IT environments to believe that for a majority of companies this is sadly the state of things. If maintenance in the average factory was handled the same way IT is handled at the average company most machines would consist of approximately 30-50% duct tape, newspaper, string and glue...

Getting a Grip (1)

Dan East (318230) | more than 2 years ago | (#38281246)

Get a firm grip on your steering wheel, and keep your car pointed away from that company.

Re:Getting a Grip (3, Insightful)

Hadlock (143607) | more than 2 years ago | (#38281432)

This is the only solid advice I've read so far. Band-aid solutions are indicative of two things: too shy to ask management for a bigger budget, or management's reluctance to improve their budget. Generally it is the latter.

Re:Getting a Grip (1)

berashith (222128) | more than 2 years ago | (#38281502)

agreed. As soon as I saw this was an IT department of one, I could tell the exact amount of care that management has on getting things like this corrected. These things are in place because management does not want to provide what is needed. If they only want to pay for band-aids, that is all they will have.

Present a quick hit list of what is most bad, and how much it will cost and how long it will take to correct. This will get lip service at best, and then witha clear conscience spend your working days getting paid to find a new job.

Re:Getting a Grip (5, Informative)

Atanamis (236193) | more than 2 years ago | (#38281660)

agreed. As soon as I saw this was an IT department of one, I could tell the exact amount of care that management has on getting things like this corrected. These things are in place because management does not want to provide what is needed. If they only want to pay for band-aids, that is all they will have.

This isn't necessarily the case though. I have a friend who took over IT at a small business. When he walked in they were using pirated software and their IT was a complete mess. After he put in hours to get it fixed up (with personal support from the owner), they ended up offering him an executive position with a massive pay increase. Some small shops with one IT guy really just don't know what they are doing, and haven't had a person in the job to tell them what is being done wrong. Your advice is still good though. A person in that situation needs to test whether they have management support to do things better. If so, it can turn into a career making opportunity to turn things around. If you can't get the management on your side though, it very well could be time to start looking for another job with more supportive management.

Get management buy in... (5, Insightful)

Lumpy (12016) | more than 2 years ago | (#38281258)

You need to document it and get management to approve spending money.

I'll bet you $100.00 the band-aids are there because management refuses to spend money on Infrastructure and its' why it is a mess and the guy there beforehand has left.

99% of the time a hosed IT infrastructure is because management refused to spend any money so it had to be half assed.

Re:Get management buy in... (1)

javilon (99157) | more than 2 years ago | (#38281462)

And once you have confirmed this fact, you should start searching for jobs instead of trying to redo the full IT structure of your company by yourself.

Re:Get management buy in... (1)

Anonymous Coward | more than 2 years ago | (#38281510)

This. Present the business case. Use whatever statistics you can get, get decent cost estimates. IT is traditional a money sink. If you want the budget to get it fixed, and you need budget to get it fixed, you've got to either show management the liability of not fixing it, or the profit in fixing it. Oh by the way, the answer is likely "no, and don't ask" so you've got to put a decent sales job into getting your business case read.

Part B: Does "thriving" mean "not losing money today", "making money", "growing" or "exploding"? That hugely affects the business case for IT upgrade, maintenance, clean-ups, etc. Your business case needs to tie into the coroprate growth plan. Read your management and see what they want to hear, and then tailor your business case to what they want to hear. In my case, that means writing the gold-plated business case that meets what management thinks they want to beleive that they can achieve, and then providing a business case that's the 80% solution when they balk at the bottom line. The 80% is what I need to do the job well and meet current needs, the gold-plated is assuming their growth predictions. Your management is (hopefully) different, so read what they want.

This probably means "as our sales increase, our IT support will follow this roadmap to increase profitability" instead of "this shit is all bandaids and about to fall apart." ... the first one makes you a team player, the second means you can't maintain the system that's obviously working, and should be replaced with someone who can.

Good hunting.

Re:Get management buy in... (0)

Anonymous Coward | more than 2 years ago | (#38281526)

I agree that this is my experience as well. The best thing you can do is to identify and document 2 or 3 business stopping potential failures. Next identify the most vulnerable systems and order them for a diagnostic process based on most critical. Prepare a plan with 2 alternatives. Initial plan is a fully funded review and upgrade of systems. Alt 1 is a review and repair of most critical systems with assistance. Alt 2 is continue with band aids. Be sure to identify the business activities that will be affected by the stoppage of specific systems.

Present that to Sr Staff and let them choose. If they choose Alt 2, find another job while applying band aids.

Re:Get management buy in... (1)

Anonymous Coward | more than 2 years ago | (#38281632)

99% of the time a hosed IT infrastructure is because management refused to spend any money so it had to be half assed.

In my experience, nine out of ten companies I walk into are more than thrilled to spend metric assloads of money.

99% of the time a hosed IT infrastructure is because ones predecessors were fucking morons.

Re:Get management buy in... (1)

LostOne (51301) | more than 2 years ago | (#38281706)

The "refuse to spend money" thing assumes the company has the money to spend at all. This is often just not the case in the current economic climate. Consider how many small businesses are hanging on by a thread and simply do not have the resources to commit to even the most desirable improvements. And before you say that is obviously the mark of poor management, consider that many times the choice is between going bankrupt by spending money to fix a looming problem that has not yet materialized or staying in business another three months. How is choosing the bankruptcy option beneficial to anyone?

Sure, spending the money on a clean system that is not loaded with kludges, bandaids, and binder twine is the correct choice for the long term, but you cannot get to the long term if you completely ignore short term needs.

Money is the last step not the first (1, Insightful)

sjbe (173966) | more than 2 years ago | (#38281788)

99% of the time a hosed IT infrastructure is because management refused to spend any money so it had to be half assed.

It is certainly true that a great many companies are penny-wise-pound-foolish when it comes to IT but it is VERY premature to jump to that conclusion here. I've seen almost as many cases where companies over spent on IT for things they didn't really need. My current company has a piece of accounting software that is seriously overkill for our relatively pedestrian needs. Cost our company $80,000 when $3000 on Quickbooks Enterprise would have done the job fine. ( Bought by the previous owners who were all engineers without a lick of business savvy)

In any case it is much more likely that any "half assed" solutions were due to a lack of competence rather than a lack of money. It sounds like this guy has done a lot to improve things without throwing big bucks at the problem so I'm inclined to suspect his predecessor was not especially gifted.

Money whipping a problem should always be the solution of last resort. While it is certainly possible this company isn't spending enough, you don't spend money on anything without a reasonable expected ROI. Spending money as a first impulse usually means you haven't really thought about the problem sufficiently and are just assuming that a more expensive product will solve all your problems. If I hired an IT guy and the first thing out of his mouth was that I wasn't spending enough I'd be seriously worried.

2 Steps (0)

Anonymous Coward | more than 2 years ago | (#38281266)

2 steps need to be done: 1. Tell your boss about these terrible problems. 2. Demand more money.

this is a majorly funny story (5, Insightful)

roman_mir (125474) | more than 2 years ago | (#38281270)


1. The job has lasted for 1 month so far.
2. The e-commerce company is 'thriving' apparently'.
3. All of the systems have been "reverse engineered" in that 1 month.
4. All of the documents are written in that 1 month.
5. In 1 months there have been: network and phone upgrades and database maintenance with Perl and it all has been 'immensely rewarding'.
6. The entire infrastructure is 'a few problems away from a total meltdown'.
7. Single person IT operation to do everything.

Question: is this for real? What's the size of the company and what's the budget?

Re:this is a majorly funny story (2)

Firemouth (1360899) | more than 2 years ago | (#38281446)

Give the guy a break, it's probably still his first day! To accomplish all of that, he probably walked in the door a month ago and since then he hasn't seen the light of day, walked outside, slept, or even eaten. It's a cry for help, don't mock the poor guy! Somebody, get this man a pizza and some ambien!

Re:this is a majorly funny story (1)

roman_mir (125474) | more than 2 years ago | (#38281474)

I am not mocking him, I am just wondering if it's not another one of those 'sixth sense' situations? Is he sure he is alive?

Re:this is a majorly funny story (1)

methano (519830) | more than 2 years ago | (#38281498)

There's something wacky here. A thriving e-commerce company with one IT guy, newly hired. Let's think of some more similar situations.

A thriving construction company with one hammer.
A thriving aerospace company with a big backyard.
A thriving hospital with a nurse and a veterinarian on staff.

Maybe I should read the article. Nah!

Re:this is a majorly funny story (1)

roman_mir (125474) | more than 2 years ago | (#38281530)

What? There is an article?

I see dead people.

Re:this is a majorly funny story (5, Insightful)

1u3hr (530656) | more than 2 years ago | (#38281618)

Question: is this for real?

It's an "Ask Slashdot". They're as real as "Letters to Penthouse". Both carefully crafted to create a fantasy situation to excite readers. Read them if the subject is something you're interested in, but don't waste your time giving advice..

Re:this is a majorly funny story (1)

roman_mir (125474) | more than 2 years ago | (#38281714)

I just googled 'letters to Penthouse' and I don't know why, but I found them about as informative as this story but they are definitely more titillating, and the pics are OK. []

As to advice to the submitter of the story: I'd say never mind reverse engineering the whatever gizmo-thingy in that 'thriving' company, if it was thriving before you got there, nothing will change that, because obviously if it's thriving, it's not because of anything that the IT department is doing there.

Start a romance with somebody in sales, that would take the pressure off.

Re:this is a majorly funny story (2)

vlm (69642) | more than 2 years ago | (#38281626)

My gut level guess is my house's IT infrastructure is more elaborate / complicated. Admittedly very little of my gross income depends on my home infrastructure.
My guess is he's a noob to IT. "'a few problems away from a total meltdown" describes every IT infrastructure I've seen in the past 20 years, including fortune 500 corporates. Nothing new there.

I'm serious about the house analogy. Just treat it like a extremely advanced home lan, except you have more time, and outages are much more costly.

I keep all my docs and manuals and data in GIT since its just me, and I replicate it all over. If there were more than just me I'd probably put it on a wiki instance. At all costs, stay ahead of the curve. For example, one hour setting up munin / mrtg / nagios saves ten hours of outage time. If you're neck deep in that extra ten hours of outage and "can't" stop fighting fires, stay up all freaking night or whatever it takes to set up the monitoring system. Even if its just you, set up a ticketing system to keep track of "forgotten" stuff. RT is free and works better than most commercial packages. Once you get the fires put out, start building up the replication. Multiple DHCP servers... Multiple replicated databases. Multiple backup hosts. Probably can't buy new, the good news is for a small network you don't need much power, so old/junk computers are OK. Those are less reliable? Who cares, I have four LXC hosts, each of which cost $99, any two of which can run my primary and backup mysql DB image, and all four have a backup of the current running DB copied to them daily (and burned to cheap DVD) so I can't theoretically lose more than a days data.

In summary, copy advanced home users in all areas, monitoring is critical, replication is important. Back up everything so many times to so many places until you can't think of any more ways to back stuff up. And automate the heck out of all of it.

And keep spare parts for everything. Your DHCP server doesn't care if its on a three ghz six core or a one ghz single core, but your users will care if there is none working at all because you couldn't keep spare hardware in a closet. If people are not accusing you of being a techno-hoarder, you're doing it wrong.

Re:this is a majorly funny story (4, Insightful)

KermodeBear (738243) | more than 2 years ago | (#38281686)

I don't see where he said that all systems have been reverse engineered and documented in one month; only that he is currently reverse engineering systems and documenting.

And, maybe this guy likes what he is doing, getting his hands dirty with network and phone stuff. And some people really like writing Perl (I don't; I think it's the devil's language). If he finds his work rewarding, who are you to mock him?

Develop multiple personalities (1)

StillNeedMoreCoffee (123989) | more than 2 years ago | (#38281276)

I refer you to Gilbert and Sullivan's Mikado.

My previous situation exactly (0)

Anonymous Coward | more than 2 years ago | (#38281278)

Work planning. First, what are your goals - how do you want everything to look for you to be happy and comfortable with it? High level, followed by more specific details for each. Ensure that your design is flexible enough to allow additional features and functionality to be incorporated without a total revamp. After that's mostly done (it won't be complete at this stage), start charting expected timelines for tasks, dependencies, and costs. Go back and revise your goals the plan, to account for everything you thought of once you went through the first time. And since you've never done this before, go through it a third time and a fourth and a fifth - continue until you can make a pass without modifying anything significant. At the end of the main planning effort, you should have a timeline of bite-sized tasks that must be completed to allow other bite-sized tasks to be performed, along with their associated costs. Present the timeline, cost, and benefits to your boss. Get approval for the budget, and get to work. Document everything you do, or (if you're short on time) the dead minimum has to be documentation on the final results of each task.

It's the Eye of the Tiger! (5, Funny)

anom (809433) | more than 2 years ago | (#38281288)

Just buy a few cases of your energy drink of choice and put Eye of the Tiger on repeat until you've got it all fixed.

I believe in you.

Re:It's the Eye of the Tiger! (1)

mtrachtenberg (67780) | more than 2 years ago | (#38281678)

I believe in you too! Quit immediately and form your own IT outsourcing company. You are worth millions.

Speak UP (2, Insightful)

Anonymous Coward | more than 2 years ago | (#38281292)

Tell/emai/post your opinion and observation, as detailed as you can, alongside with your concerns. Make sure your managers see it. Do not expect them to do anything about it. Do it for your own reference, so you may continue working normally. Do not overwork or overworry yourself, for that will not bring you nor the failing systems anywhere closer to resolution. Do your normal job, stay cool and speak up. You are in drivers' seat.

just wondering (1)

shortscruffydave (638529) | more than 2 years ago | (#38281316)

You say that this is a "thriving ecommerce company"...I'm just wondering how it's managed to achieve and maintain "thriving" status with a single member of IT personnel?

Wait a minute .... (4, Insightful)

CuriousGeorge113 (47122) | more than 2 years ago | (#38281328)

"I assumed the position of programmer and sole IT personnel at a thriving e-commerce company."

Wait.... a thriving e-commerce company has one IT person? Am I missing something here...? No wonder everything was band-aided together. They have one person doing everything.

You may want to consider hiring an outside firm to come in and do the audit for you. The last thing you need right now, on top of your daily workload, is to perform an audit. That, and a third party firm creates a sense of objectivity, and would eliminate the "The IT guy wants a new toy" response from the CFO.

Re:Wait a minute .... (1)

vlm (69642) | more than 2 years ago | (#38281722)

You may want to consider hiring an outside firm to come in and do the audit for you. The last thing you need right now, on top of your daily workload, is to perform an audit. That, and a third party firm creates a sense of objectivity, and would eliminate the "The IT guy wants a new toy" response from the CFO.

And make sure not to hire an outside firm that consults on outsourcing IT support. Security firms are pretty good at general IT auditing in addition to strictly security related analysis.

Re:Wait a minute .... (1)

pz (113803) | more than 2 years ago | (#38281814)

Agreed. And talk to the outside auditor to make sure they strongly recommend hiring at least one other IT person.

What happens when you're out sick?

What happens when you want to take a vacation?

What happens when the servers die at 4am?

What happens when Hotmail refuses to accept connections from your company, and then Google Analytics explodes, and then your merchant account service stops processing your transactions, and then the marketing DB goes down, and then the phone systems stop working?

You want AT LEAST one other person helping with your job.

Backups (2)

slazzy (864185) | more than 2 years ago | (#38281330)

Always start by making sure the backups are working properly.

Be a hero (0, Flamebait)

Anonymous Coward | more than 2 years ago | (#38281334)

Pull out a shotgun in the middle of a meeting with management and splatter your brains on the wall behind you to the shock of all your coworkers and management.

Report and plan (0)

Anonymous Coward | more than 2 years ago | (#38281354)

Report the problems to your supervisors immediately. Let them know what you plan to do to address the problems and if you need any additional resources (like extra staff, overtime or an adjustment to your schedule so you can make major changes when no one is using the system).

It important to keep them in loop so they can make decisions based on their budget. They need to be aware of their dependence on the system, its fragility and the need to invest to ensure its continued survival.

At some point, decisions about where to invest need to be made and keeping those decisions-makers informed is important.

And (1)

NEDHead (1651195) | more than 2 years ago | (#38281364)

we've been running ever since..

Start at the bottom, work up (1)

gatkinso (15975) | more than 2 years ago | (#38281372)

By bottom I mean the servers that the least number of systems depend on.

Get them humming, with an eye toward your migration path.

Then, be methodical. Backup everything often. try not to do anything easily not easily reversed, and let the folks know the true state of their systems.

This actually sounds like fun.

could be worse (0)

Anonymous Coward | more than 2 years ago | (#38281392)

All I can say is that it could be worse- at least the guy isn't still there defending his band-aids.

You dont ... (0)

Anonymous Coward | more than 2 years ago | (#38281412)

Advise your boss that an audit is in order due to these anomolies that you have found. Recommend to the financial controls / legal team that they bring in an outside firm to come in and look over the infrastructure and where needed, upgrade your code / configurations / infrastructure to industry best practice.

Relax and take small steps. (1)

MatrixRunnerWork (445953) | more than 2 years ago | (#38281416)

Relax it has survived so far, so likely it will continue as long as you don't make huge changes without a back out plan.

Get a security scanner software or pay someone to audit the external facing servers, that helps build confidence or scare you silly. Most things found in that kind of an audit are fairly easy to fix and low risk (patching software, limiting unneeded services and such).

Second get everything into some form of revision control that is possible. Source, images, web pages.

Back up everything into tarballs or zips so that if you make a change you can undo it quick if it goes poorly.

When it comes time to make changes do a small one first, like updating the copyright at the bottom of the page, re-deploy everything, test as best as you can. This is a confidence building measure then advance on to larger changes ...

Make sure your boss understands the current state, so it positively, since after all they hired the person before you ...

Step by Step (0)

Anonymous Coward | more than 2 years ago | (#38281436)

The very first thing I would do is back everything up. Image every drive.

Next, do a Risk Assessment.

If this system goes TU, how bad will it be?
What steps can we take to help ensure that doesn't happen with this system?

Now come up with a plan of action.

Here are the show-stoppers I found
They generally fall into these categories.
This category can be solved this way, making for a more reliable system
This category can be solved this way, giving us better capabilities.
This category can be solved this way, saving this over time.

Here are all the bandaids I found.
They generally fall into these categories.
This category can be solved this way, making for a more reliable system
This category can be solved this way, giving us better capabilities.
This category can be solved this way, saving this over time.

If it works, don't fix it. (2)

Okian Warrior (537106) | more than 2 years ago | (#38281464)

You're going to spend time rewriting things that currently work? That's a recipe for disaster.

Unless you can predict when something will fail (as in - the database uses 16-bit indexing, so when we hit 65,536 orders the database will crash), it's much more effective to leave things alone.

Wait until changes are needed, then straighten out only those pieces that you have to touch when implementing new functionality.

Work to a benefit. Unless you can point to some aspect which will change in a measurable way (it's crashing frequently, it will crash *less* when I'm done, it will cost less in terms of server rental, &c), leave it alone.

Easy on the finger pointing (2, Insightful)

Anonymous Coward | more than 2 years ago | (#38281476)

No offense, but if you don't have the necessary background to know what/where the tools are; who are you to say everything is band-aided? I see this a lot with new ITs, they see something different than they would have done and instantly label their predecessor a moron; later to make "their" change and break everything. Easy on the finger pointing.

The first thing you need to do is make a comprehensive assessment; don't jump in and start making changes until you have documented everything. If you can contact your predecessor and ask about design and/or documentation that may be stored in an industry standard tool that YOU are unaware of; do so. Once you know how all the pieces move, then start to plan how to improve/repair it. If you dive in and it breaks, you will be blamed; if it breaks and you fix it with minimal down time, you're the hero.

If you ask me... (0)

Anonymous Coward | more than 2 years ago | (#38281480)

The fact that you don't know the answer to any of these questions shows that you're really no more qualified to be the sole IT personnel than the last guy was.

Bandaids? (0)

Anonymous Coward | more than 2 years ago | (#38281516)

Are things working? What you call "bandaids" may be actually decent solutions you don't yet understand.

Stop worrying about a "meltdown" and just get to work.

1. Back up everything. Don't start messing with things unless you're sure you can revert them back to their working condition.

2. Document things as you understand them.

3. Pick just one network element at a time and see if you can simplify it. Clean it up. Remove unnecessary junk. Check to make sure it operates as your documentation indicates.

4. Test. Test. Test. Remove the element from the network. Does the network fail as expected? If not, figure out why.

5. After you feel good about the network element, move on to the next one.

6. Lather, rinse, repeat.

Just move carefully and methodically. You'll eventually have things cleared up.

inherited it mess (0)

Anonymous Coward | more than 2 years ago | (#38281534)

Welcome to the real world. That fact that you are the only one who even appears to be concern about the situation should be your first and only clue, as to what your employer considers to be critical, essential, important, and last but not least profitable. Are you sure that they are profitable?

What, where, why... (5, Informative)

ScottyLad (44798) | more than 2 years ago | (#38281544)

I've spent the best part of my career undertaking tasks like this (as an external consultant), with my average time on an assignment lasting somewhere between 18 months and 3 years.

My aim on every project is to make myself obsolete - in that I try to get documentation up to a point where a suitably qualified individual could come in, read the documentation, and work the rest out for themselves.

My primary objectives are to implement some form of inventory control to document the what / where / why...

  • What - What have you got (servers, software, services, contracts, operating systems, databases, users)
  • Where - Where is it - where are your servers, what machine is this software licence running on?
  • Why - What is the Business Justification for this service - what is the Business Impact if this database stopped running tomorrow?

Once you've got to that stage, then you're ready to get in to the real technical details. Remember that you are pitching your documentation to your successor, or to some imaginary "suitably qualified individual", so documenting what a system does and why is a higher priority than commenting every line of code.

It is possible to do with one person, depending on the size of the organisation, it can be particularly rewarding to do on your own - in a small business you often find some of the users have a good understanding of some of the systems, or are keen to learn.

You stated in your post that you've assumed the role of programmer and sole IT personnel - which means you need to learn to think like a manager as well as a techie (which is harder than most people imagine!). Once you learn to focus on the business priorities, you'll understand where to begin with the technical detail, and what level of documentation is required.

Start with the basics (0)

Anonymous Coward | more than 2 years ago | (#38281558)

Start at the basics of the network.

Switches / routers / firewalls in good condition? How are the logs?
Then move on to the server health. Event viewer good? Whatever server management hardware they have is happy? Warranties, all that stuff?

Just work your way down the path.

Speaking as someone in a similar situation (2)

bravecanadian (638315) | more than 2 years ago | (#38281562)

The situation being understaffed and underfunded but expected to keep everything working... my advice is get out while you can.

It just isn't worth it. The reason why the systems are all patch and duct tape is because they think cheap is good management - and the longer you keep it running the more it proves it to them.

And hey, their new boat they bought with their bonus for keeping expenses down is awesome!

Start over... slowly (5, Insightful)

rabenja (919226) | more than 2 years ago | (#38281580)

I was in much the same position 12 years ago at this company. I am now CIO with 7 people on my team with several business partners to help manage the infrastructure. My advice for what it is worth:
  • - Take time every day to assess and analyze the bigger picture before allowing yourself to get drawn into the details.
  • - Look at the entire system from a risk mitigation perspective. What areas are most likely to cause "meltdown". Spend the most effort there.
  • - What are incremental changes that can be made that improve the overall risk picture? Focus on the biggest bang for the buck.
  • - Defer anything that works well enough for the time being.
  • - Avoid big bang solutions unless they can be contained and tested well, with the capability of rolling back.
  • - Get help where necessary.

Upper Management Buy-In (1)

cogeek (2425448) | more than 2 years ago | (#38281590)

First thing I'd do would be to document the issues that you've found. Talk with your immediate supervisor, explain the issues and the plan that you've come up with to address it. Without upper management buy-in, you're doomed from the start. Look for free/low-cost management utilities out there. Prioritize the issues you've found and start tackling them one at a time. If you make your supervisor aware of the issues and provide an overview of how you plan to deal with them, they'll be a lot more understanding if something does break in the meantime than if it comes as a total surprise.

How big is the IT system (1)

Murdoch5 (1563847) | more than 2 years ago | (#38281594)

Explain to management that the total fruit cake in the job before built the system in suspended failure and one wrong move or random move will bring everything to a halt.

Once you have there attention just start fresh, so rebuild the most broken part of the network on a new system and slowly rebuild and re-factor from there. It will be a lot of work but if done carefully it will save the current mess your stuck with.

Plan out restructure (1)

Keiichi25 (2520526) | more than 2 years ago | (#38281600)

First thing I would recommend is plan out a restructure and rebuild of the setup. List off the critical needs and why they need to be covered. The problem most companies do not understand about IT is that in the process of cost cutting, the critical structures your previous guy was forced to skimp on will cost the company in the long run due to you trying to maintain as much as you can. Underline the need for the restructure to avoid meltdown. The upper management needs to understand that while you can try to support with minimal costs, the catastrophic repercussions come up when you have no fall back due to cost cutting and the number of days it takes to get new hardware or rebuilding of a system back to the level of functionality. A minimum of 2 days to get systems back to functionality and a dead stop to any other support while the critical systems are being rebuilt. As someone else mentioned, hiring additional support will help, but it will not help in the situation of a production level dead stop due to critical systems not having redundacy or planned upgradability. Lastly, underline the necessity to not cut back on maintenance. Running at bare minimum and no maintenance support will cause long term cost overruns as being the only person who has to maintain it will also cause long term burn out and higher turn around of IT workers, which will cost them again in the long run.

Re:Plan out restructure (1)

zlives (2009072) | more than 2 years ago | (#38281708)

Memo's cover your ass. Document your concerns, create a timeline to formulate a plan. Plan and then implement.

Rebuild (1)

goathumper (1284632) | more than 2 years ago | (#38281640)

The truth is if it's that fragile, then recovery or repair are not options because you never know when you'll be done. Your best strategy is to rebuild. Organize the rebuild jobs from smallest (simplest, or least-complex) to biggest, and start from the smaller ones.

Importantly, you need to understand what your infrastructure does and why (which you claim you're already trying to do). However, the most critical point is that your superiors understand what you're up against and the risks they bite into if they choose to not go forward with the rebuild(s).

Once you understand what it is you need to rebuild, then you can do it properly: document the strategy to be followed (and incredibly important is that you document the key reasoning points behind the decision process), and plan out the implementation. If your superiors find that it consumes too much of your time, try to talk them into hiring (one? two?) more folks to help you hold the fort while the rebuilds are in progress so the day-to-day isn't left in the lurch. I had to go through this type of a situation recently and the end result of the rebuilds was that the previously inevitable downtime went away almost completely (only ISP outages were an issue). Deployment of new servers was cut down by 95%, and tons and tons of other benefits. Biggest of all: by the time I was done, everything essentially ran itself and even on the end-user support things were almost automated (granted, 99% of my audience were tech-savvy so they didn't need much help anyway). 95%+ of my time was spent just scouring logs and servers to ensure everything was running smoothly (which it was).

Then again, the key point was selling my upper management on the fact that my predecessors had done such a lousy job of setting everything up that trying to fix it was more expensive than a from-scratch rebuild, and that they were one fly's fart away from a catastrophe. You don't need to scare them shitless, just point out where they are and what they're up against if a rebuild isn't even done (even rebuild of only SOME of the systems can make a huge difference). Make sure it's clearly stated in writing (a "big" e-mail explaining the situation clearly to get the ball rolling usually takes care of that).

Key thing: DO NOT try to fix or recover the old stuff - if it's really as messed up as you suggest, you will consume comparable amounts of time to a rebuild, with none of the benefits and the added risk that you didn't fix all the problems because you couldn't spot some of them.

One other thing that served me well in terms of plotting my strategy: take the approach that I'm building something and going to be fired the day I'm done, and whatever I build needs to be inheritable and clearly understandable by my potential successors. This angle will encourage you to keep it simple, stupid, well documented, and easy to maintain/audit. In the end, this is why your predecessors sucked: they didn't think they'd eventually (be) move(d) on - but in IT, that's the one constant: staff rotation.

Inventory and Asset Managemetn (0)

Anonymous Coward | more than 2 years ago | (#38281646)

I have done this for a lot of companies here that have sold, gone on their own, or been taken over and have a ton of IT stuff that one person needs to figure out what is what. Get a list going of every network device, info on it, and what it is running. Once you accumulate that, you can get its version and age and take it to certain groups and determine its critical need in the company. Sometimes, things change and a server that was once critical and would think needs to be upgraded or replaced can be decommissioned or moved to another cluster or server. Once you break up your equipment into groups and critical risk, you can then plan upgrades with your capital you have each year and possibly support contracts.

First things first. (0)

Anonymous Coward | more than 2 years ago | (#38281652)

This is a delicate balancing act, because by seeking out and acknowledging the problems you are essentially taking ownership of them.

The first thing you should do is let your supervisor know what you are planning on doing and getting a committment from them for dedicated time to fix the problems. This is essential.

If you don't get this time committment, you need to dial back your eager beaverness. Keep letting them know that the audit needs to be done and give a few examples of issues that need to be corrected. When you are working on other tasks, mention that this would be a lot easier if X was fixed, but that it needs to be fixed from the ground up, and again, push for time.

Otherwise what is going to happen is you are going to basically be building yourself a pile of work that is now deemed critical (especially in the event that something horrible does break), that your boss considers you responsible for, but no spare time to fix it. "Oh, just fix things as you see them!" they will say, which when some major infrastructure component and supporting services needs to be rebuilt, is completely impossible. Now instead of being the hero that helps them recover from the jerk that was there before you, you are a scoundrel who did not save them in time.

Get the time committment for the full scope of the work that needs to be done. I'm telling you, this is important, because when it comes time for someone to take the blame, it isn't going to be your boss.

Backups... (1)

David_Hart (1184661) | more than 2 years ago | (#38281662)

As has been mentioned, begin with making sure that you have backups of EVERYTHING. Backup, perform test restores, fix any backup issues, rinse, repeat.

1. Backups: Backup, perform test restores to VMs, fix any backup issues, rinse, repeat. Make sure to examine backup logs every day for the first month or so, and at least once a week thereafter.

2. Monitoring: Implement basic monitoring, including your backup system.

3. Infrastructure: Use the monitoring to fix any infrastructure issues such as overloaded servers (high CPU, memory), overloaded network uplinks or slowdowns (high bandwidth usage, incorrect speed and duplex settings), etc.

4. Applications: Use the monitoring to find application issues. Some may go away as a result of fixing infrastructure issues. Others will require support calls to vendors.

Me too (4, Interesting)

weave (48069) | more than 2 years ago | (#38281666)

I walked into a similar nightmare two years ago. Before I even took the job I assessed the situation and gave them a proposal for what needed to be done and a price estimate for the software and hardware. I told them I would not take the job unless they committed funds to support the function. I also warned them that there were numerous ticking time bombs and I'll defuse them as fast as possible but there was no magic fix and it would take some time and they could have a disaster still

I then convinced them to only hire me part-time and to also hire a part-time desktop support person for a few reasons including they don't want to pay me to do that and having two IT people at least gives you some continuity. Even if the desktop support guy doesn't know the high-end stuff, if I leave the desktop person can still guide the new person and save them a lot of time I never got.

My line of attack was:

  1. Back up data. Wasn't easy. They had old cart tape drive units that were problematic. I ended up getting cheap TB externals to at least make mirrored copies of things. But at least if there was a disaster, I'd have their data safe somewhere -- even if it took me weeks to reconstruct systems to use it.
  2. Secure data. Everything was wide open. All domain users WERE DOMAIN ADMINISTRATORS. Locking that down was a pain. An understanding of what would be impacted ahead of time would have taken months, so I didn't tell anyone what access they had, then started removing people from domain admins a few at a time and waited to hear what broke, then fixed access issues. Not user friendly, but getting that under control fast was necessary.
  3. Renovated room with servers in it (that were 5+ year old deskside servers) so as to accommodate a rack with proper A/C flow, electrical feed, and physical security.
  4. Had them throw ~$50k into a virtual infrastructure and SAN, then virtualized all their old deskside servers until I could migrate apps on them to fresh OS installs. Used Vspehere's DRS product to back up the OS images and data to another system I had them buy for their other site (thankfully not too far away and connected by fiber)
  5. Identifying all in-house written programs and finding turnkey solutions to them, preferably cloud-based to reduce their dependency on in-house IT staff in future.
  6. Documenting everything as best I can as I go.

Getting back to original point, a one-person IT shop is suicide. Them having a two person part-time crew is better because if one leaves, at least the other can provide some sort of continuity -- and that happened already. The fairly young guy I hired for desktop support two years ago died last month :-(

how do you as a single person? (1)

nimbius (983462) | more than 2 years ago | (#38281684)

its simple, you cant. you can however make a resounding case to your employers that you will need more help. learn more about the business, how it works, and interrelate your infrastructure to their bottom line in order to secure extra funding and more hands. management is tasked to ensure you as an engineer have everything you need to do the job, and if your job scope has grown then so to must your resources.

do not try to handle the entirety of the infrastructure on your own; help desk, development, and sysadmin. I pulled that plate-balancing kind of act for the first three years of my career and it amounted to a thanklessly low paying job, a long commute, and an unrelenting amount of stress. if there are bandaids placed everywhere then its because the last guy couldnt communicate the things he really needed (servers, cooling, switches, a real lunch break with actual hot food.)

Get help (1)

iB1 (837987) | more than 2 years ago | (#38281698)

If the systems are really that close to collapsing, then you know as well as I do that it'll happen at the worst possible moment, and you will get all of the blame for it.

Talk to your boss ASAP and highlight where the issues are and explain to him in monetary terms what will happen if the system screws up / how much time will be lost.

Push to see if you can afford to get someone else hired - even if it's a junior network engineer. You need to share the pain before it consumes you.

Define your goals. (2)

scamper_22 (1073470) | more than 2 years ago | (#38281734)

The first step is to define your goals. What do you want out of this?

1. a job
2. learning new skils
3. leadership
4. a chance to grow in the company

If you are the sole IT/programmer person, this is a company in dire need of management with clue as to IT. You could be that change and end up being a manager of IT for this company. You have to work you butt off, fixing things, dealing with budgets and hiring staff. Can you deal with upper management to accomplish everything? That's up to you to decide.

What I won't recommend is killing yourself for a company that is unwilling to learn from its mistakes and do it right. In that case, just treat it as a good learning opportunity, but don't kill yourself. They won't always be able to hire a superhero to come in and keep things running. Or if they do, it will be a well-paid consultant and they will learn their lesson quickly how much it costs.

There is a reason this company has such poor IT systems. You could up being the IT guy in a long line of IT idiots.

If you have to "ask slashdot" (1)

Anonymous Coward | more than 2 years ago | (#38281752)

It sounds like you're already fucked.

Find another job (0)

Anonymous Coward | more than 2 years ago | (#38281770)

Your the sole programmer and IT guy for a "thriving" company? That's not thriving. That's life support. Find another company and let nature take it's course with these guys.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?