Clean Sheet Day

So there I was… cold sweat running down my left cheek, tingling sensation in my lips, clammy hands, and freezing fingertips. A dim screen in front of me displayed details of the server startup procedures. There were many green “OK” messages scrolling up almost too quickly too read as the programs started up. Suddenly a red “FAILED” slid passed, then another, and another, and then the words that would make any sysadmin forget about all the sand is his shoes… kernel panic.

Servers Front

Server maintenance needs to be an integral part of any system administrator’s life. Taking down the services for as brief a time as possible to apply fixes, updates, changes, and miscellaneous tasks. I do this much less often than I should, but soon … that’s going to change.

I’m Martin Lehner and I do systems programming and system administration for the Center for Teaching and Learning at MCC. You might have seen my visage on obscure web pages or bulletin boards along side such fluff as the marshmallow ads or classified posts for mid 80s computer hardware, but then again, maybe not. I primarily write java code and try my hardest to keep the online services in the CTL up and running so nobody knows I exist. If people know who I am, then something is probably horribly wrong.

No, really ;]

Friday March 2nd

1:30pm

Paul Hickey (my mostly-windows sysadmin buddy) and I are were setting up to begin our maintenance on the CTL servers. I grabbed a pad of paper and a pen to write down fun scribbles and took note of the tasks ahead of me for the next (hopefully pleasant) six hours or so…

  • patches for the production server (apps.mc.maricopa.edu, AKA: ctl.mc, keeptoo.mc, and dltutorials.mc, Master War chief, Raider of the Seven Dimensions, etc…).
  • update secure shell(ssh) and secure socket layer(ssl) on ALL *nix (linux/unix/solaris) machines.
  • rewire the power cables to balance the servers between two backup batteries.
  • rewire all the network cables for cleaner organization (pictures later)
  • rewire kvm cables for cleaner organization.
  • apply patches to all windows servers.
  • test as much as possible after maintenance to make sure things work.

3:??pm

I spent the first hour or two installing patches on apps and upgrading ssh/ssl on the *nix machines. Apps is the toughest since I have to be extremely careful with it. Changing little things can affect a lot. First I ran a program which told me which updates were available. Then, I go through and apply each one trying to make sure they don’t conflict with anything. After each set I have to check configuration files which may have changed to be sure nothing was lost during updates. That took about an hour. Some dead time in there for letting the packages compile. You see… apps runs a distribution of linux called gentoo. Its extremely configurable and fast but that’s sometimes a curse in that it takes much more careful planning to keep it stable. When it works though, its a champion of speed and stability like you wouldn’t believe. Upgrading ssh and ssl is quick on all machines except prana (streaming media), which requires me to upgrade some older libraries first. Installing ssh/ssl means downloading the packages, configuring them for my OS and hardware, compiling them (like compiling a computer program on any computer), and installing them via install scripts. At this point everything looked good, went on to the next thing.

4pm?

Lost track of time by that point. I remember something about cables? Oh yeah, shut down everything (no printing, no file shares, no website, no in-house tools, no machine logins, people are pretty much out of luck on accessing anything interesting that we host on our servers) which took twenty minutes or so by itself. Paul had applied the windows server patches before we did this. Next we unplugged ALL the cables from ALL the servers and untangled them all (see the picture below of the mess).

Server Mess Floor

Server Mess

After unplugging everything and untangling we started by carefully redoing the power cable connections. There was some debate among us at first about whether to account for rack sliding but we decided against it. Rack mount servers are attached to the rack via rails which allow the server to be pulled out to be serviced without removing it completely from the rack. This requires some extra planning on cables though because the rack needs slack on the cables attached. We decided that since all repairs were done in our office with the rack removed that it was more important to stick to clean cables. After power cables were kvm cables and then network cables. That took us to nearly 6pm, and apart from my knees hurting I think it looked much better. We decided also to completely replace the network cables next week as they are WAAAAAY too long for being used on a rack mount like this. We ordered some new cable to make small 2 or 3ft network cables. That should improve the setup on network cables in the room significantly.

Server Cleaner

6pm-ish

I began turning servers back on. This is the moment of truth for me. My hands get clammy and I cross my fingers that we don’t have any hardware failures or bizarre OS failures when the servers start coming up. As Paul finishes the network cables I slowly turned the machines back on. We did have a few where the network cable was plugged into the wrong port (totally my fault), but quickly fixed them. Tally at this point:

  • Me: 4 Horrible-failures: 0

Then, as I was watching apps start up jboss (the application server which runs our registration system, softsense, and some other java tools), something caught my eye… it couldn’t resolve “apps.mc.maricopa.edu”. I leaned in over the console on the rack and rested my head against its cold metal frame to let out a long sigh. Here we go…

Luckily this one was a quick fix, the hosts file had some settings from back when I set the server up that hadn’t been completely removed. After fixing this I got it working. The hosts file identifies what name the machine has, so in this case, for some reason the file had “sorcerer” in it instead of “apps”.

Monday, March 5th

Creeping doom

Server problems don’t tend to expel their cold viscera until you least expect it. This was the case today. A few problems were noticed…

  • registration not working
  • drake file shares not accessible
  • breeze failed to start
  • ctl blogs failed to work
  • keeptool failed to work

Jboss figured it would be fun to shut down some time after I left on the 2nd. Don’t know why, no errors in the log or anything. Fluke? Turned on now and it works fine. The drake file shares are totally my fault again, forgot to put the windows file sharing service in automatic startup. Don’t know what’s going on with breeze, I don’t directly administer that machine so I can’t comment on it much. Ask Jeff Anderson if you need to gripe about that one ;p. Blogs were a mystery to us so we just reinstalled the files and changed the database connection to “localhost” instead of “apps”. Could have something to do with the weird hostname startup issue that affected jboss earlier. Won’t ever know for sure I’m guessing, at least not until I restart it again in six months after forgetting all this. Keeptool is a maze of chaos in itself and will probably be down until the next version reaches me, which incidentally should be a week or two.

Summary

So, there you have it.. a letter opener. Seriously though, server maintenance this time was pretty tame. I’ve had MUCH worse days where I’ve had 10 people staring at me asking when this or that will be back online and the only thing I’ve got for them is a *shrug* and my assurance that I’m working on it.

<insert witty ending here>

I can hear the ocean in the distance,

Martin.

2 Comments

  1. Shelley Rodrigo says:

    Martin,
    I greatly appreciate you sharing this with us. Many of us have no idea what you do…the story, and pictures, help give us some perspective!
    Shelley

  2. […] and in honor of his style of getting things done under frustration at times, I ran his narrative blog post this week through the pirate talk translator: […]

Leave a Comment