Looking at it with hindsight, I should have reconsidered.
I should have invented timetravel, so that I could go back and warn myself. ‘Remember when the floppy disk for the RAID drivers failed, and you had a bad feeling about replacing the Batch Executive… well, you were right’.
But I didn’t, so I sat in the server room, surrounded with the debris of the old server, and not able to get the new one to work. And as I got phone calls every 5 minutes (often getting a second line while talking) I had the gut wrenching certainty that I was really, really in trouble if the scheduled production run would not be able to start on time… Yea, I met the fifth horseman, and his name was despair.
The high availability server
The most critical server on our network is the Batch Executive. It is the master scheduler in charge of everything that happens on the network regarding batches. It loads software in controllers, resloves equipment conflicts, prompts operators, … And it has a critical design flaw. It loses state when the application terminates unexpectedly. It maintains batch journals, but these are only up to date if the application shuts down properly. Because of that, the vendor recommended running this mission critical piece of software on a high availability server.
This HA has 2 of everything: CPUs, memory banks, disks, ethernet cards, … it uses a special motherboard that makes sure all redundant components execute in lockstep, and it examins the results and progress every clock cycle. If both components disagree it will try to determine who is right and shut that component down. And all along it will keep on running. Theoretically, you could fire a bullet though the thing and it would keep running, as long as one of each component survived.
Unfortunately, this machine is a bitch to set up and maintain. But if that was the only problem, we could live with it.
What is worse is that the machine vendor is being hit hard by the rise of virtualization and clustering. So they closed their office in the Benelux area and left us orphaned. We should be able to get support through the French or German office, but all our requests for maintenance and disk replacement fell on deaf ears. The vendor who recommended that HA solution had the same problem.
On top of that, the startus had caused a number of problems already, and if we didn’t replace it soon, there was a chance that it would act up at the wrong time and possibly ruin a product run with grave consequences.
Originally we were scheduled to replace it on friday, and we would have had the time to prepare everything in advance. But due to various circumstances, it boiled down to ‘now or not’. Had I chosen ‘not’, we would have had to reply on the HA server for 2 more months; Something I didn’t have faith in. And because we had detailed written procedures in place to perform the replacement, I chose ‘now’.
I was going to replace the HA machine with a DELL 2900, 2×4 core, 4GB RAM, and 15Krpm SAS RAID5 disks. This is the standard machine for all our servers, and we are very saisfied with them. We’ve never had any problems with thos machines, other than a failed disk. We have to buy those servers through the software vendor, because they qualify only certain servers, with specific BIOS versions, NIC cards, and drivers.
Of course they charge a premium for this, on the justification that this way, you get a server that is guaranteed to work with their software (remember this).
The Windows 2003 Install went well. We installed the drivers according to the procedures we had tested before. For the rest of the installation we needed the network, so it was time to install the server in the rack.
That is where we had the first problems. Installing or removing a DELL server from a rack takes 2 minutes, because everything slides and locks without screws. The designers of the HA server and rails did obviously not believe in ease of installation. Removing the blades was easy enough. But the rails… not so much. The frame was bolted into the rack with lots of philips head bolts. When we finally got rid of them, we discovered that the frame did not fit through the 19″ opening anymore. So we had to diassemble the frame itself within the rack, and remove it in pieces.