Last week I was redlining SOPs (Standard Operating Procedures) on a test network for some of the Windows Domain specific situations we sometimes encounter on our process control network.
This specific SOP handled Domain Controller promotion and demotion for our process network. It should be noted that in our process control network, Domain Controllers are also DNS servers.
I made a mistake and completely screwed up the Active Directory. Instead of reloading the network, I decided to figure out what was wrong and solve it properly, since there was no rush.
How I screwed up the domain controllers
Just before I promoted the 2nd DC, I went to the system properties tab, where I made a crucial mistake. My procedure said to clear the DNS suffix, and I cleared the checkbox that said ‘change DNS suffix when domain membership changes’ instead.
With 20/20 hindsight this was a pretty stupid thing to do. What this does is it prevents the new DC from getting the name ‘DC2.networkname.companyname.local’ and instead let it keep its old name ‘DC2’
This problem is known as a disjointed namespace, where the FQDN of a server does not match the domain of which it is a member. Active Directory looks for FQDNs when it needs to replicate or do other things, so without a FQDN you get all sorts of helpful errors like ‘The RPC server is unavailable, this might be a DNS problem’
Again looking with hindsight, the error details and the requested names should have made it obvious that DNS was working just fine, but that the specified name was indeed not available in DNS.
I tried various things to solve this problem, but the most stupid one was probably renaming DC2. I still don’t know why I thought this was a good idea. Perhaps to force a new name registration in AD?
Of course this failed because Active Directory didn’t replicate, so the name change never propagated either. I had only made the problem worse. DC1 still thought that DC2 had its original name, and it wouldn’t even try to find DC2 at its new name.
What I did to make it right again
Making things right when you hose Active Directory is not easy (or sometimes downright impossible perhaps) but there is a way out of the aforementioned mess.
When that was done, I renamed the DC2 computer object on DC1 (which still had its old name) to the new name of DC2. I changed the TCPIP settings of both DCs so that DC1 became the preferred DNS server for DC2, and vice versa. This was to insure that they could resolve each other.
I fixed the records in the DNS so that DC2s original name was not mentioned anymore, and I verified that the GUID associated with the DC2 alias record was indeed correct.
At this point I could already synchronize from DC1 to DC2, but not the other way around. Running netdiag on DC1 I was informed that LDAP still had a reference to the original name of DC2, and that some DNS records of this DC were not registered correctly on the DNS running on DC2. It told me to wait 30 minutes in order for DNS server replication to succeed.
I used adsiedit.msc on DC1 to throw away a couple of references in AD to the original name of DC2, as well as the FRS settings for DC1 and DC2. There wasn;t much more to do, so I decided to wait for the DNS replication.
One cup of coffee later netdiag gave no errors anymore, and I could successfully replicate between all Domain Controllers.
Moral of the story
There are several conclusions I could make now:
- DNS and Active Directory are complex things, and I don’t know enough about them yet. As long as everything keeps working, it is easy enough to administer a Windows Domain, but as soon as there is a serious problem, the administrative wheat is separated from the chaff, so to speak.
- If you are running an important network, you have to have SOPs for all the things you do on the network, no matter how simple, and you have to follow them. And as important: test them so that you know they will work.
- Have a test environment where you can horse around and experiment to your heart’s content. Knowing what could go wrong if you make mistakes, and knowing how to solve the mess can be invaluable. If you don’t have lots of hardware lying around to simulate your environment, try to run a virtual network using vmware and an old server. I use a DELL 2900 with $g RAM, 4 cores and a RAID5 configuration of 15 KRpm SAS drives which we have on standby as a spare.
- But the most important conclusion: in this business you are never done learning. It will take a lot of effort to bring my understanding of Active Directory and DNS to the same level as my understanding of C++ and software development. Lucky for me that I like the world of Systems Engineering.