One of my colleagues reported an issue at a customer this past weekend where every time he transferred FSMO roles, MOM would report that the MS DTC (distribution transaction coordinator) service had terminated unexpectedly on all the domain controllers in a domain at this customer. At this particular customer that bought us about 350 emails from MOM since the roles got transferred twice over the weekend in each domain. For reference, it’s a highly distributed Windows Server 2003 SP1 environment with a mix of x86 and x64 installations.
A quick look at MOM & the event viewer on a suspect machine showed a standard event from the SCM, and an MSDTC event about a dc promotion/demotion:
Event Type: Information
Event Source: MSDTC
Event Category: SVC
Event ID: 4145
Time: 4:54:10 PM
MS DTC has been notified that a DC Promotion/Demotion has happened. It is shutting down as a result.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
And from MOM:
Severity: Service Unavailable
Source: Service Control Manager
Name: The service terminated unexpectedly.
Description: The Distributed Transaction Coordinator service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 1000 milliseconds: Restart the service.
Time: 8/2/2008 13:07:58
(Yes the times don’t match – just grabbed the first samples I could find)
I decided to take a look at this today, and bounced the FSMOs around until I determined that the PDCe specifically was the root cause here. I was able to kill the MSDTC service on every DC in the domain by moving the PDCe around. This ruled out that it wasn’t some weird quirk from the patching activities that were going on this weekend when this symptom presented.
My first troubleshooting step was to see if the service was actually crashing, so, I setup Dr. Watson to collect full dumps and installed it as the default debugger on a problem machine and moved the PDCe. The service terminated as expected but nothing in Dr. Watson. This was annoying but not entirely unexpected. If a service just exits like a normal process does then we’d get this event too.
I xcopy’ed the x64 debug package to this particular machine from a utility box that had the debug tools installed (note you can just copy and paste the tools – no need to run the MSI) and fired up windbg. If you’re following along at home, do this:
- Press F6 and find msdtc.exe and select it
- The debugger will break in and likely complain about symbols
- Issue a .symfix C:\symbols
- Issue a .reload
- Issue a g
At this point you’re ready to go and when something interesting happens the debugger should break in (you’ll know when the textbox at the bottom of the screen becomes enabled). I wanted to collect a process dump for this and I had a suspicion I knew what was happening, so, I also did this:
- Issue a bp ntdll!NtTerminateProcess
This tells the debugger to breakin on a call to NtTerminateProcess. As soon as I transferred the PDCe, my breakpoint got hit. I saved a dump (.dump /mf c:\msdtcissue.dmp) as this particular environment has Internet issues and I was having problems getting to the symbol server. When I opened it up on my workstation (Press Ctrl+D and browse in windbg), I located this stack (Issue a k):
11 Id: 1550.1590 Suspend: 1 Teb: 000007ff`fff9a000 Unfrozen
Child-SP RetAddr Call Site
00000000`0165fd78 00000000`77d5a316 ntdll!ZwTerminateProcess
00000000`0165fd80 000007ff`7fc4069b kernel32!ExitProcess+0×25
00000000`0165fed0 000007ff`7fc40863 msvcrt!_crtExitProcess+0x3b
00000000`0165ff00 000007ff`66fd25f5 msvcrt!cinit+0×143
00000000`0165ff40 00000000`77d6b69a msdtctm!DCPromoThreadFunction+0×124
00000000`0165ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a
Note if you’re wondering how to find the correct thread, you can issue a ~*k to dump the stack of every thread. To switch to the thread (11 here), you’d do a ~11s.
I got zero hits on this on Google searching for this, so I had a quick chat with a PSS friend who dug something up on this. It’s a known bug in Windows Server 2003 and presently there’s no QFE. MSDTC subscribes to dcpromo’s (which I knew), but, because of the manner in which it does this, it also catches PDCe changes. This behavior is fixed in Windows Server 2008, though. If you’ve got a good reason you can call and try and make the case for a QFE, but, seeing as the service restarts straight away, I just am going to go tweak my monitoring so MOM ignores this.