The following is an article recently published about clustering and recovering from a failure:
I would like to point out a few things. First go read the article (leave it open, so you can flip back to it below). Then come back here for my review.
First mistake – Running an Active/Active (A/A) cluster is a very bad idea. Period! Since the cluster is on Windows Server 2003, a third node that can handle a failure of either Exchange or SQL should be added. This implementation is not mission critical worthy – as is. The biggest issue here would be memory fragmentation. Windows learns how to page and react over time to running your applications, when an A/A server fails, the remaining server takes a very big performance hit.
Second mistake – Much like the first mistake, installing the Exchange & SQL bits (binaries) on the same machine. Yes, I know that Microsoft does this with Small Business Server, but they tweak things to allow them to work nicely together. Never do this for any reason. With a third node, both will be installed, but only 1 would be running at a time.
Third mistake – Performing a major (or minor) outage without a fully tested restore procedure and backups. On page 2 it states the current backups were not getting everything. They only figured this out during this fire drill. Shame on them for getting exactly what was required or heaven forbid actually testing a restore.
Ka . . . . boom? What does that mean? What really happened? What lesson was learned from the Ka . . . . boom? Can you prevent it from happening again? Sure, you can restore/rebuild, but can you prevent it? You should always learn something from a disaster or you are dooming into repeating it.
Option 1 – Several tools were not mention that are freely available. http://www.microsoft.com/technet/prodtechnol/windowsserver2003/library/TechRef/7e782055-450b-46dd-a0a4-164eebf2ae18.mspx lists one of my favorite – the Server Cluster Recovery Utility.
10 hours to get SQL back up and running? What the heck!!! That would be an install and attach method. If you have a proper backup (and tested restore), it pretty easy. Restore SQL and the System State and use it J The whole process will not take that long!
Again I feel it’s important to note that with the Cluster Recovery Utility you can get the signature back again. The process takes seconds.
4th paragraph – Having a dead cluster database on one node DOES NOT mean it is dead on the others. The information is stored in the registry of each node and within the quorum. I have seen the quorum corrupted, I have not seen the cluster Hive of the registry corrupted – I have only seen it become out of date (as in a server was turned off when something happened).
You can rebuild the quorum. http://www.microsoft.com/technet/prodtechnol/windowsserver2003/library/ServerHelp/c9fe11a9-97c0-496a-9223-ed4b77786368.mspx lists the 4 areas that need to be backuped. You do not have to use the Automated System Recovery (ASR) to backup and restore your cluster. It is nice and works great, but it is not the only way. Here is a third party article on recovery with normal System State backups (assuming total hardware failure) http://seer.support.veritas.com/docs/262709.htm.