Got a call from a friend yesterday. He had some problems with his network which essentially revolved around Active Directory being messed up. The exact details are a little unclear, but the long and short of it was that around a week back, their main DC had some hardware problems so they resolved it by transferring the system over to Hyper-V (not sure how they did this either). The AD problems continued and they dug deeper. Along the way (and I don’t know where/when) they decided to DCPromo two of their non virtualised DC’s down to member servers (one of them was their exchange server). Problems persisted. I was asked to look into it some more and found a few things. Netdiag was reporting all kinds of problems with DNS (which was AD integrated), and in the event logs for Directory Service we found just one error which suggested that they were in a USN Rollback scenario. USN Rollback scenarios are discussed here http://support.microsoft.com/?kbid=875495 .
The USN is an internal number that allows domain controllers to track where they are at with respect to replication of Active Directory information. If a DC detects that it has rolled back then it will stop replicating information to other DC’s. It will also put the NETLOGON service into a paused state. It does this to protect the rest of the network. Ok so my friend being a developer (and having a couple of developers with him) saw the NETLOGON service was paused, so what did they do? They wrote a script to restart the NETLOGON service so it would not pause. Sheesh – NEVER LET A DEVELOPER RUN YOUR NETWORK.
Ok – so the way to fix a USN Rollback is to dcpromo the affected server down and then back up to a domain controller OR restore a system state backup. Only problem was that this was there last DC, and I was uncertain of the last system state backup. It turned out that it was only done AFTER they moved the virtualised DC from physical into virtual. Ouch. Digging deeper it follows that they did some form if image backup on the physical system while it was live – unsure of what tool was used, but doing something like this is NEVER a good idea in a multiple domain controller environment.
It’s now a few days since I started writing this blog post and my friend has had to accept defeat. He finally bit the bullet and called MS for support – their only responses were as above – restore the AD from backup… which he didn’t have a good backup of at all. He had to accept that he needed to rebuild his entire domain from scratch.
What caused it? Well I suspect that the way he moved the physical DC into a virtualised environment was the start of the problems. Not ensuring he had good / tested backups along the way was also part of the problem. Not calling on experienced resources early in the piece was a big problem too.
Long story short – Microsoft have said that they don’t support imaging of DCs live – there are reasons behind that – DON’T DO IT. If you are out of your depth – CALL FOR HELP. Backups are useful… during a problem situation you can NEVER have too many of them.