One of our servers was very sick. Terminally so. Let’s call him Bob for now.
Bob was one of the most important servers on the network. Without Bob, the automation engineers can’t automate, the calibration engineers can’t calibrate, the validation engineers can’t validate and the sys admins… Well, the sysdadmins were not too happy with it either because Bob is their charge.
Only a month ago, Bob was leading the happy life of an application server whose only job was to be a terminal server. Automation engineers used Bob to work on the process control software from their own desks.
Bob liked it that way, and Bob was very happy.
The death of Bill
But then one fateful day, Bill died, and he did it rather messily. Bill was the long term memory and master of all important configuration data of the network. In a production environment, Bill would have had an easy life, living out his retirement serving the occasional configuration change. But in a development environment in the last stages before production, Bill was having a rough time. Over 20 engineers were harassing Bill with file IO through an outdated database.
Fetch this… No put it away… No I was here first… No you have to put this away first… No now you have to flush the whole database to the complete controller network and all operator stations…
Bill was already beyond retirement age, and his disks were none too fast. Bill’s disk IO queues were continuously at 70 or 80 IO items, sustained for tens of minutes on end. This is considered VERY bad.
On top of this, due to some special circumstance that could not be changed, Bill had to be rebooted at least thrice a day. Bill took longer and longer to reboot with each passing week.
We were already actively looking for a young, healthy and strong successor to Bill, but it was too late. That fateful day several weeks ago, Bill finally couldn’t take it anymore. He went to sleep and never woke up. His RAID set was completely shot.
To make matters worse, He had had several small brain infarcts in the course of the weeks leading up to his demise, so his Active Directory was corrupted and too much had changed to use any of the backups.
The rise of Bob
Bob and Bill started their life at the same time as brothers. One of them led a life of relative comfort while the other was slaved to death. But with Bill gone, Bob had to come to the rescue. We tossed the yoke on his shoulders and put him to work.
It took over a full day (24 hours) to bring it up because with Bill’s RAID set gone and corrupted Active Directory we had to do it the hard way and start from scratch and rebuild the network manually.
Eventually, Bob was plodding along as Bill had before him and life returned to normal.
Bob’s last stand, Ted to the rescue
However, Bob soon started coughing like Bill did, mere weeks before he passed away. Yesterday he refused to wake up after his 6 AM reboot. After allowing him some rest, unplugged from the mains power he managed to struggle upright and get moving once again, but it was clear that he was going the way of the dodo any day now.
Measures were taken, a great powwow was organized and as luck would have it, Ted was delivered the very next day. Ted is a strong young lad with lots of everything, and a RAID set with 15 kRpm SAS drives.
Ted came preconfigured by the vendor of the process control software so we got a head start because the software was already installed and we could immediately begin moving data from Bob to Ted.
Unfortunately, at 15:00 we discovered a problem with the pre-configuration that was done by the vendor. I am not going to comment on what exactly but it turned out that we had to go back to line 1, page 1 of the 14 page procedure.
To make matters even more interesting, there was a batch of new product in the plant that should not be disturbed under any circumstances. If that batch would be lost it would be a major financial loss (lots of zeros) as well as a slip in the schedule to start production.
At 17:00 we had to bring Bob back online because something needed to be done to that batch for which Bob was critical. Bob had had a day of rest and when we explained him how important it was, he valiantly cried ‘Once more unto the breach!’ and he kicked his hard drives into action for the last time.
Meanwhile we went ahead with importing all the data into Ted’s databases.
At 20:45 we got the go ahead to relieve Bob of his immediate duties. Bob – proud to have fulfilled his duties until the very last moment – breathed a sigh of relief and gladly powered down.
Ted was brought online at first glance everything seemed fine. The final problem now was that each computer on the network (including all other servers) had to have their workstation configuration loaded again because they identified Bob by his SID. They didn’t know Ted yet.
I could have changed Ted’s SID to Bob’s but the vendor does not support that. I didn’t fancy explaining to my boss why our vendor does not support us anymore so we followed the procedure.
Mysterious Mr. X
At this point I should introduce Mr. X.
Mr. X has 2 of everything: 2 power supplies, 2 sets of 2 processors, 2 memory banks, 2 RAID sets, 2 sets of network interfaces, and a very special motherboard with redundant custom chipsets. All parts of the system work completely in lock step. With each instruction that gets executed or memory address that is read or written, the chipset verifies that both redundant parts give equal results. If there is a problem, hardware get disabled and can be replaced and synchronized at runtime without the operating system knowing anything about it.
The reason for all this reliability is that Mr. X is the quintessential mastermind that pulls the strings and runs the entire show. Mr. X is a very very very expensive high availability server that controls the entire factory like a general commands his troops.
If Bill, Bob and Ted are the information store of the network, Mr. X is like the brain with the central nervous system.
Batch administrators upload recipes for high level steps in the production process to Mr. X, who then uses those recipes to tell the controllers what to do, when to do it and how to do it. If Mr. X goes down, so will the process. If a batch gets aborted unexpectedly it will have to be scrapped.
Given that some batches cost a large amount of money (at least one zero more than Mr. X itself), the price you have to pay for someone with Mr. X’s capabilities suddenly becomes insignificant.
Anyway, Mr. X posed us with a serious problem during our server swap out procedure. You see, Mr. X is currently working on a very important batch. One that must not be lost under any circumstance whatsoever.
Technically we’d have to reconfigure Mr. X because the rest of the network can’t see it, and no new software can be downloaded to it.
On the other hand, Mr. X failed to notice the trick we pulled off. The important batch is still running and being processed in little pieces despite the fact that the rest of the network (including Ted) is deaf, dumb and blind for anything coming from him. Actually we feared the batch was lost but the capabilities of Mr. X went even further than our expectations.
Running the configuration utility to bring Mr. X back into the fold might distract him enough (or reset his working data) to lose the batch. Neither I nor my colleague fancied getting yelled at by the top brass or worse so we unanimously decided not to do that for now. If it’s working, don’t fix it. Especially if so much is at stake.
This caused a side effect because none of his Mr. X’s minions on the network was able to tell him that there was still a license dongle on the network. Because of this, Mr. X (being the ultimate professional mastermind) informed us that we’d better renew his license, or else…
To work around this we spared one of his worker drones on the network the reconfiguration and we shoved a second licensing dongle in its back. You see, Mr. X is far too busy or important or busy to occupy himself with licensing drudgery. He is not equipped for it so he needs a subordinate for that.
The end, for now
In any case, for now the network is running. The fact that it is not yet fully reconfigured means the system as a whole has a split personality so I am pretty certain that this saga will get a next chapter, but for now we are safe.
I discovered that the time management system gets confused if you leave work on a different day than you arrived. J Luckily I live close to work because I was hammered from an 18 hour working day.
I guess those things come with the territory.