Anatomy of rebuilding a Windows 2003 server
So, I get a gig trying to rebuild a windows 2003 server that has recently had problems with such trivial tasks as adding machines to the network, and booting. It's interesting how the progression of events leads to a better, and better understanding of what's going on, and how things actually work. I'd like to share this experience:
- Find that the system fails to boot reliably, and cannot do any functions involving COM applications. Booting into Linux to diagnose the hardware shows a clear problem with the SATA-RAID card losing interrupts. Card interoperability problems. Booting to a recovery console, and watching it blue screen in the driver every time it hits a certain spot in the chkdsk confirms this.
- Breaking the mirror, and dumping the disk to a regular IDE disk shows that it has tons of bad sectors. Card firmware has poor handling of partial disk loss.
- Using the other disk in the mirror to create the PATA disk works like a charm. Booting off of this disk brings the system up reliably, but still has the COM errors. Disk corruption probably led to the COM errors.
- Using MS's instructions to rebuild a COM catalog (using the less destructive way)causes the system to blue screen on boot. COM is used in the boot process by something .
- I used the upgrade in place facility (AKA Repair) of 2K3 server to fix the COM error, to no avail on the phase 2 install. This resulted in more blue screening. The upgrade is not thorough in fixing COM, probably because it would have to reinstall all the previously installed apps. I'd also like to add that this added an hour to the rebuild time.
- You now have to boot to a recovery console in order to go the full-out, destructive COM wipe route in hopes of making the system boot, which could have probably been done before you tried re-installing windows. Windows makes me cry, sometimes
- After going to a recovery console, and renaming the c:\windows\registration directory, the system continues to the stage 2 of upgrade/repair.
- "Approximately 12 minutes left" means suck my left testicle, luser!.
- You get real punchy at 5am
Anyway, this should get finished up now. There's no way I'm shooting for any more than "the way it was before", but I think it'll come up OK, once it's done registering components.
Ya, 12 minutes.... my ass.
Maintanance gone sour
When I first signed on to a job as sys admin, I thought I'd be in for the glamor-filled world of big iron, and big monitors, and I was. I was cleaning them, and pushing them around, and occasionally found my self with the ability to log into them. How times have changed.
I find myslef with the ability to wipe out entire companies' system within 3 mouse clicks. Hell, the servers have mice! However, it still amazes me that with all those niceties, my old rules of sys adminnning still come into play:
- Anything that uses an adaptor has a 50% chance of turning into a cluster fuck. If you find yourself needing to use adaptors, there is a very good chance that not all possibilities, and avenues were thoroughly explored. Consequently, the first adapter is just the figurative air horn of the 18 wheeler recklessly careening twords you.
As an example, today I wound up having to enlarge a Windows 2000 boot disk on a machine from the days when 18GB was considered generous. The plan was easy, mirror the disk using Linux and 'dd', then use Partition Magic to move the stuff around on a larger disk. Upon arrival of the new disk (which Dell had chosen as the replacement part for this machine), it was discovered that the SCA 80 pin connector on the HD failed to mate with the USCSI 68 pin HD connector on the ribbon cable. Enter the SCA -> 68 pin adaptor.
At this point, I should have blocked off the 8 hours, but I hoped. Alas, the problems mounted. Partition Magic does not run on 2K Server... The adaptor increades the length of the cable run just enough to reuquire a new cable... I couldn't garantee success from a backup/restore over a new install of Windows... the cd-rom wouldn't read my rescue CD... you get the idea.
Since this was supposed to be a 'by the numbers' move, it wasn't well planned.
- If your task list/checklist isn't fine grained, you won't pay attention to it. When doing a maint windows, your checklist had better be consulted every 10-15 minutes, or you'll just forget about it, and do what's in your head. Problem is, what's in your head at 2am, in a loud co-lo isn't as well thought out as what was in your team's head the week before at 11am with all the info available. You were probably more sober at 11am, too.
- Document the custom stuff, or replace it with something off-the-shelf. Sure, your 5-way replicated cluster with management scripts is functional, and really clever. Too bad you'll forget how to use it in a year when it breaks, or the next guy will spend his time reading code in order to figure it out, or just toss it.
I once got paid to take a bunch of machines, and turn them into various different servers for a company. My idea was to make a standard install, with standard pakages on all the systems, and have different configuration directories for all the different roles, then have all those directories replicated to all the machines for maximum redundancy, and the ability to take any machine, and quickly switch it into any role.
Worked great. No docs. The next guy took a look, scratched his head, shrugged, and replaced it with his own, undocumented system. I'm sure the guy after that did the same thing. I'm glad the company never figured out how much money they blew on that project.
Well, there's more, but I won't bore you with them.
So here I am on a holidy currently on Plan E, as A B C didn't work, D was so close, and yet not so much, and Plan E is my only hope before packing it in, and trying again....