Practical ATL: Solving the race condition in CAtlExeModuleT

When I was spelunking through the ATL header for my previous articles (the class object plumbing) I discovered that there is a serious race condition in the server lifetime management.


I contacted the Microsoft C++ folks, and someone from the libraries group told me that this problem has already been fixed in the next release of Visual C++.


But since the issue still exists in all current ATL projects, I decided to write something about it and publish the fix which will circumvent the problem in the recent versions of ATL (7.0 and upwards if I am correct).


The problem


The module lock count of an ATL COM server is managed in the CAtlExeModuleT<T> class. It is this lock count that keeps the server alive as long as it is nonzero.


The CAtlExeModuleT<T> class manages its reference count via the  CComGlobalsThreadModel class increment and decrement methods. CComGlobalsThreadModel is a typedef for CComMultiThreadModel so these methods map to InterlockedIncrement and InterlockedDecrement


The last module lock is released if
- the class object itself has no more external connections (IExternalConnection), and
- there are no more object instances.


As soon as the final module lock is released (module refcount is 0), the shutdown of the server is initiated, which will happen through either a WM_QUIT message or a win32 event object. In both cases there is a time window between the triggering of the shutdown, and the actual call to CoRevokeClassObject. This time window can be several seconds.


This time window causes a race condition between the server (which is already dying at that point) trying to revoke the class object, and a hypothetical client, requesting an interface pointer to the class object. Depending on how the race turns out, the client might end up with an S_OK HRESULT from CoGetClassObject, and an invalid interface pointer. This will cause a critical problem in the client, and may cause an application  crash.


The solution


The proper way to manage the server lifetime is to use CoAddRefServerProcess and CoReleaseRefServerProcess to manage the module lockcount. Those functions will (according to MSDN http://msdn.microsoft.com/en-us/library/aa910552.aspx) suspend all class objects when the reference count drops to 0, in an atomic operation.


In that case, the server starts dying AFTER all class objects are suspended. Any incoming client request for the class object will spawn a new server process after the dying server has finished revoking its class objects.


The fix


The code for fixing this is simple. I simply created a new class that inherits from CAtlExeModuleT<T> and then overrides the Lock and Unlock methods.


template <typename T>


class CAtlExeModuleT_NoRace : public CAtlExeModuleT<T>


{


public:


  typedef CAtlExeModuleT<T> base;


 


#if _ATL_VER < 0x0A00


  LONG Lock(void)


  {


    CoAddRefServerProcess();


    return base::Lock();


  }


 


  LONG Unlock(void)


  {


    CoReleaseServerProcess();


    return base::Unlock();


  }


#else


  //removed compiler messages


#endif


};


I use CoxxxxServerProcess functions to explicitly manage the class object registration, and then I delegate to the base class which maintains its own reference count.


The reason I do that is that the base::Unlock function is the one which triggers the module shutdown, and I don’t want to mess with that. And Unlock works only when paired with Lock, so the only thing I really need to do is to call the CoxxxxServerProcess routines to prevent possible activation requests. The rest is still done by CAtlExeModuleT<T>.


I put a compiler condition around my implementation of the new Lock and Unlock methods, because they are only needed for current versions of ATL. The next version of ATL won’t have this problem, so it would be dangerous to meddle with the reference counting, especially since I cannot predict what side effects this may have.


The conditional compilation allows you to safely upgrade the project to the next version of ATL. The compiler messages which show up in the compiler output inform you that it is no longer needed to inherit from CAtlExeModuleT_NoRace as shown below:


class CStuff_ServerModule : public CAtlExeModuleT_NoRace< CStuff_ServerModule >


{


public :


  DECLARE_LIBID(LIBID_Stuff_ServerLib)


  DECLARE_REGISTRY_APPID_RESOURCEID(IDR_STUFF_SERVER, “{597B5DFF-DACE-42B0-9E49-54A09C10B2BB}”)


};


The only thing you have to do to use my new module class is to include the header and change CAtlExeModuleT to CAtlExeModuleT_NoRace.


And if you upgrade the project to VC10, then you can simply change it back as indicated by the compiler output messages that are printed out when the compiler parses my class.


Conclusion


I was surprised that this bug was still in the ATL headers after all these years. When I contacted the VC++ team I was half expecting them to tell me that I had missed something. This issue is what caused CoAddRefServerProcess to be created after all, over a decade ago, so a COM programmer implementing lifetime management should be aware of it.


However, the modular and hierarchical nature of ATL made it easy to fix this problem, so I guess that is a plus for ATL. And mistakes happen. It is always a bit painful if it happens in widely used libraries, but I too have had my share of whoopsies over the years. It is sad but it happens.


The header file with the fix is attached to this blog post under the MIT license.


 

Could not create NTDS settings on domain controller…

Could not create NTDS settings yadayadayada on domain controller CN=yadayadayada. The RPC server is unavailable.


This was the error message greeting me (with some meaningful text instead of the yadayadayada of course :)) when I tried to add a backup domain controller to the domain of our test and development network. It was the first time I encountered an error at the DC promotion stage.


I checked the usual things (network connections, privileges, etc) but nothing jumped out at me. Now, if an error dialog mentions RPC, then the usual error is either a DNS error, or a DNS server that has not yet refreshed its zone information.


So I opened the DNS config, and discovered that only 1 NIC of our the new DC was registered in DNS. Our networks are multihomed (3 networks in parallel) and yet only 1 address was registered for the NIC. This was a bit odd. There should be 2 (one network has no DNS on purpose).


A quick ping revealed that one of the NICs had no connectivity. It turned out that the primary DC was a bad network connection, due to some wiggling with the cables. Plugging it all the way and then re-registereing the NICs of the new DC with DNS solved the problem.


I still don’t understand why we got that problem, because the disconnected NIC was last in the binding order, so the new DC should have used the available connection anyway. My guess is that it retrieved the address via DNS, and got the address of the NIC that was disconnected. Ah well. Live and learn. 

The kerberos client received a KRB_AP_ERR_MODIFIED error

This is what I got in the event logs yesterday afternoon:


Event Type: Error
Event Source: Kerberos
Event Category: None
Event ID: 4
Computer: SE-SMURF01
Description:
The kerberos client received a KRB_AP_ERR_MODIFIED error from the server PC-BLABLA09$.  The target name used was . This indicates that the password used to encrypt the kerberos service ticket is different than that on the target server. Commonly, this is due to identically named  machine accounts in the target realm (FOO.BAR.STRIPE.LOCAL), and the client realm.   Please contact your system administrator.


Event Type: Error
Event Source: Kerberos
Event Category: None
Event ID: 4
Computer: SE-SMURF01
Description:
The kerberos client received a KRB_AP_ERR_MODIFIED error from the server PC-BLA09$.  The target name used was RPCSS/PC-BLA10. This indicates that the password used to encrypt the kerberos service ticket is different than that on the target server. Commonly, this is due to identically named  machine accounts in the target realm (FOO.BAR.STRIPE.LOCAL), and the client realm.   Please contact your system administrator.


I had replaced those machines a week ago, and everything seemed to work fine. So I didn’t understand why these errors were suddenly popping up. The applications running on those computers where throwing a wobbler as well. Some googling later I found 2 remarks that were useful.


The first one was that someone fixed it by taking the computer out of the domain, renaming it, changing the SID, and changing the IP address. While this is overkill on the scale of killing a mouse with a thermonuclear weapon, it pointed in the direction of a network level problem.


The second remark was by a Microsoft employee who explained that DNS misconfiguration can be the source of problems like this. If kerberos thinks it is communicating with pcA it encrypts the kerb ticket with the password of pcA. but if the ticket then ends up on pcB because of the DNS mismatch, the above events will be logged.


At that moment I realized that I had changed the IP address of an adapter on PC-BLA10 because it conflicted with PC-BLA09. The reason everything worked fine initially was because that port had been left disconnected until 2 days ago when I configured the correct IP address. The conflict was resolved and the DNS information was updated, but that didn’t mean that the DNS caches were up to date. So I cleared the DNS cache of the DNS server, and used ipconfig /flushdns to clear the resolver cache on the domain controller and PC-BLA10, and the problem disappeared.