When I was spelunking through the ATL header for my previous articles (the class object plumbing) I discovered that there is a serious race condition in the server lifetime management.
I contacted the Microsoft C++ folks, and someone from the libraries group told me that this problem has already been fixed in the next release of Visual C++.
But since the issue still exists in all current ATL projects, I decided to write something about it and publish the fix which will circumvent the problem in the recent versions of ATL (7.0 and upwards if I am correct).
The module lock count of an ATL COM server is managed in the CAtlExeModuleT<T> class. It is this lock count that keeps the server alive as long as it is nonzero.
The CAtlExeModuleT<T> class manages its reference count via the CComGlobalsThreadModel class increment and decrement methods. CComGlobalsThreadModel is a typedef for CComMultiThreadModel so these methods map to InterlockedIncrement and InterlockedDecrement
The last module lock is released if
- the class object itself has no more external connections (IExternalConnection), and
- there are no more object instances.
As soon as the final module lock is released (module refcount is 0), the shutdown of the server is initiated, which will happen through either a WM_QUIT message or a win32 event object. In both cases there is a time window between the triggering of the shutdown, and the actual call to CoRevokeClassObject. This time window can be several seconds.
This time window causes a race condition between the server (which is already dying at that point) trying to revoke the class object, and a hypothetical client, requesting an interface pointer to the class object. Depending on how the race turns out, the client might end up with an S_OK HRESULT from CoGetClassObject, and an invalid interface pointer. This will cause a critical problem in the client, and may cause an application crash.
The proper way to manage the server lifetime is to use CoAddRefServerProcess and CoReleaseRefServerProcess to manage the module lockcount. Those functions will (according to MSDN http://msdn.microsoft.com/en-us/library/aa910552.aspx) suspend all class objects when the reference count drops to 0, in an atomic operation.
In that case, the server starts dying AFTER all class objects are suspended. Any incoming client request for the class object will spawn a new server process after the dying server has finished revoking its class objects.
The code for fixing this is simple. I simply created a new class that inherits from CAtlExeModuleT<T> and then overrides the Lock and Unlock methods.
template <typename T>
class CAtlExeModuleT_NoRace : public CAtlExeModuleT<T>
typedef CAtlExeModuleT<T> base;
#if _ATL_VER < 0x0A00
//removed compiler messages
I use CoxxxxServerProcess functions to explicitly manage the class object registration, and then I delegate to the base class which maintains its own reference count.
The reason I do that is that the base::Unlock function is the one which triggers the module shutdown, and I don’t want to mess with that. And Unlock works only when paired with Lock, so the only thing I really need to do is to call the CoxxxxServerProcess routines to prevent possible activation requests. The rest is still done by CAtlExeModuleT<T>.
I put a compiler condition around my implementation of the new Lock and Unlock methods, because they are only needed for current versions of ATL. The next version of ATL won’t have this problem, so it would be dangerous to meddle with the reference counting, especially since I cannot predict what side effects this may have.
The conditional compilation allows you to safely upgrade the project to the next version of ATL. The compiler messages which show up in the compiler output inform you that it is no longer needed to inherit from CAtlExeModuleT_NoRace as shown below:
class CStuff_ServerModule : public CAtlExeModuleT_NoRace< CStuff_ServerModule >
The only thing you have to do to use my new module class is to include the header and change CAtlExeModuleT to CAtlExeModuleT_NoRace.
And if you upgrade the project to VC10, then you can simply change it back as indicated by the compiler output messages that are printed out when the compiler parses my class.
I was surprised that this bug was still in the ATL headers after all these years. When I contacted the VC++ team I was half expecting them to tell me that I had missed something. This issue is what caused CoAddRefServerProcess to be created after all, over a decade ago, so a COM programmer implementing lifetime management should be aware of it.
However, the modular and hierarchical nature of ATL made it easy to fix this problem, so I guess that is a plus for ATL. And mistakes happen. It is always a bit painful if it happens in widely used libraries, but I too have had my share of whoopsies over the years. It is sad but it happens.
The header file with the fix is attached to this blog post under the MIT license.