How The Race Was Lost

I’ve been told by a reader of this modest little web log that I have shirked my duties by not getting this post up before midnight (my time, which is GMT-6). Because of this oversight on my part, I have failed to make my minimum one post per day. My heartfelt apologies to all three of you who read this blog at this point; I’ll try to never let it happen again. 🙂

I promised you yesterday that I’d describe some of the Hell of the Cancel Race. In fact, I hope you never have to care about this stuff because you”ve taken my advice and used CSQs in your driver. With that said, the idea basically goes like this.

Say you are a driver that needs to keep a hold of IRPs as they come in. For some reason or other, you are either unable or unwilling to complete the IRPs in their original thread contexts (i.e. synchronously). Therefore, you need a place to stash them before you return to the caller (with STATUS_PENDING). Say you use a trivial linked list to queue up these IRPs for later processing (using the Tail.Overlay.ListEntry generously supplied by Microsoft). For one reason or another, your driver needs to wait a “long” time before it will finally get around to servicing these queued IRPs, so they stay queued for a while. What will happen if someoone above you decides he’s tired of waiting for you? He will, of course, cancel his request.

To cancel his request, the originator of the IRP will call IoCancelIrp(). This, in turn, causes the IO manager to look inside the IRP for a CancelRoutine (look up Cancel in the DDK for details). If this routine is present, the IO manager will do some bookkeeping and call the routine. That routine is the way a driver finds out that someone above it wants to cancel that IRP. This driver would then presumably dequeue the IRP from the list and complete it (STATUS_CANCELLED) up the chain.

The problem arises because of the fact that your cancel routine can be called at essentially any time after it is set on the IRP (via IoSetCancelRoutine()). It will want to manipulate the same linked list that your dispatch routines are using to queue and dequeue IRPs normally. Furthermore, there are races with the IO manager between the calling of IoSetCancelRoutine() and enqueuing the IRP, and between dequeuing the IRP and calling IoSetCancelRoutine() to clear the cancel routine. Remember, multiple processors can be manipulating the queue simultaneously, so you could have 3 or 4 enqueue/dequeue operations going on at a time, and a couple of cancellations pending.

One detailed example: Suppose you receive an IRP that you decide to queue. If you set the cancel routine first, the IO manager might call the routine before you queue the IRP. Then your cancel routine gets called and tries to de-queue an IRP that isn’t on the queue at all. On the other hand, If you queue the IRP first and then set the cancel routine, your dequeuing thread might dequeue the IRP and complete it before you get a chance to set the cancel routine. You then set a cancel routine, and the IO manager savagely rips your IRP out from under ou while you’re processing it. Have fun running down that crash!

You can manage the situation appropriately with locks, but organizing your use of those locks in such a way that you won’t race is exactly what is so difficult about this problem. The proper solution winds up requiring an interlocked pointer exchange with the CancelRoutine pointer in the IRP, and determining if it was previously NULL (which signals that the IO manager has called the CancelRoutine already). You also have to properly handle the BOOLEAN Cancelled flag in the IRP, which has its own semantics.

Instead of all of that work, why not just call IoAcquireCancelSpinLock() and IoReleaseCancelSpinLock()? Well, a couple of reasons. First, even *that* locking mechanism can be used incorrectly, leading to another tricky race. But even more than that, the cancel lock is a system-wide resource – it is the #1 hot lock in the entire OS. Think about it – the cancel lock has to be acquired by the IO manager every time an IRP is cancelled, at least for a little while (i.e. until the cancel routine calls IoReleaseCancelSpinLock()). Contention for this lock can become a serious bottleneck in your driver’s performance. Much better to wait on a driver-owned lock, or even better, an in-stack queued spinlock (more on that another day).

This is really just a start; the only way to really wrap your brain around the cancel races is to try and code cancel logic yourself. Read the DDK docs on all of these functions mentioned here, as well as the general sections on IRP queuing and cancellation. There are other resources on the Internet as well (google for “cancel-safe queues”). Finally, once again, Walter Oney has an excellent IRP cancellation chapter in his WDM book, and (IIRC) even provides source to his queuing logic.

Next up: an example of CSQ usage.

Happy hacking!

Better Than Chocolate: Cancel-Safe Queues

One of the earliest things I remember discovering about the difficulties of programming in kernel mode (right after “What the @!$ is with all of these UNICODE_STRINGs?!”) is how easy it is to get yourself into race conditions.

All of the standard practices about multi-threaded programming apply when writing a driver, but there’s another kicker: you actually have to *care* how many CPUs you have in your system. Well, in particular, you have to care if you have more than one. As I said yesterday, you can officially assume from now on that every stupid Pentium 4-based computer from the local CompUSA is a dual-processor box, so you have to take this seriously. Add to that the subtleties of dealing with spontaneous IRQL raises, interrupts, running in arbitrary thread contexts, and so on, and life gets interesting.

One of the most common race conditions is the IRP cancellation race. It’s also one of the trickiest to deal with, even if you generally know what you’re doing. Cancellation has changed over the years from the original design, partly due to the change in devices themselves, and partly due to OS optimization. The original mechanism the OS provided for managing the cancel race was based on using StartIo routines, and in fact, the latest DDKs still recommend using a StartIo routine for IRP queue management. It certainly works, for what it was designed to do, but it’s not optimal for a number of reasons. Software-only drivers (“virtual” drivers of various sorts, filesystems, etc.) frequently find that the StartIo model is insufficient. Besides, the cancel lock is one of the hottest locks in the system, so staying as far away from it as possible is always a good idea. Walter Oney has a good description of IRP queuing and cancellation in his WDM book, in which he details other reasons he doesn’t typically use StartIo-based queuing.

With that said, rolling your own IRP queuing logic is very difficult. The races are subtle, and unless you’ve made a lot of these mistakes before, you’re highly likely to do it wrong, no matter how much you think you have gotten it right. Trust me, I know. 🙂 Fortuantely, Microsoft has provided a reusable queuing framework called Cancel-Safe Queues. It is implemented in the kernel on XP+ and is available as a static library for all previous OSes. With the advent of the CSQ, there is no reason to ever write custom IRP-queuing logic again.

CSQ is easy to use, and has the distinct advantage of being massively re-used, so it’s likely to be bug-free. Tomorrow I’ll talk about the race conditions in more detail, and later I’ll provide an example of how to use CSQ in your driver.


Growing Pains

There seem to be technical difficulties with the feedback link, among others. I’m working with the staff to get this resolved. Meanwhile, if you have a burning desire to make yourself heard, whack the contact button at the top of the page.

Sorry for the inconvenience; we now return you to your regularly-scheduled blogging.


The Best Driver Testing Box Ever!

I posted on the PCAUSA NDIS Driver Discussion List ( a couple of weeks ago about my new driver-testing computer. Since then, I have gotten a couple of questions about the details of the setup. Since everyone needs one of these anyway 🙂 I thought I’d post some details.

I am currently running my driver tests on a home-grown dual-AMD64. Reasons I went with this setup:

  • All drivers need to be extensively tested for race conditions, etc., on a 2+ processor system before foisting them on the unsuspecting public. Using a pair of Opteron 240 CPUs, I can get into a true* dual-processor system for a reasonable amount.
  • Driver writers should be looking in the direction of 64-bit compatibility, starting now. Peter Viscarola from OSR pointed out in an NT Insider issue months ago that AMD64 boxes are, in fact, the coolest things since sliced bread (more on that later). Going with Opteron CPUs now should help work out any 64-bit compatibility problems.
  • In my experience, MP boards tend to be higher-quality, leading to fewer mysterious hardware-based bugs than on your run-of-the-mill junk mobo. This has its downsides, though: the IWill board that I chose requres registered RAM, and only specific chips and manufacturers are listed on IWill’s compatibility matrix.
  • Back to that IWill board- it has all of the new toys that current motherboards have: firewire-800 (for kernel debugging, assuming I ever get it working), usb1.1 and usb2.0 ports, built in GigE on copper, plenty of RAM slots (8!), Serial ATA and standard IDE connections, on-board SATA RAID, and lots of other little bells and whistles that aren’t coming to mind atm. This thing has more toys even than my wife’s Mac G5, which I thought was pretty well loaded at the time.
  • It is known (by me, at least!) to work with the AMD64 preview of XP

Installation was a bit of a pain – I finally had to use an IDE disk as primary/master to get it working with our version of Ghost (although our IT staff tells me that the new Ghost should support my SATA controller).

Once the hardware configuration was ironed out, installation of the 64-bit os went perfectly. I started with a release mode version, just to see if it was fast. It is. 😀 It is the fastest Windows computer I’ve ever had a console on, in terms of app responsiveness. IO is lightning fast. The best part is that VMWare already has native amd64 support. That, plus all of the standard development environment (psdk, ddk, cygwin/vim/cvs/grep/blah/blah), yields a fully-useful dev box.

For testing, VMWare is sufficient for 32-bit SP OSes, but that wasn’t really the point of all this, was it? Ghosts of the 64-bit checked OSes will round out the system nicely, once I get around to building them, and since I don’t really use the release side for development anyway, that is perfect for testing on a release build.

Anyone who wants more details on the parts that went into this box is welcome to contact me, and I’ll be glad to send along part names/numbers, etc.

Happy Hacking.


* Intel Pentium 4 CPUs do something called “hyper-threading”. This basically presents a dual-processor interface to the OS, resulting in the same sorts of race conditions that you get on boxes with two+ real chips. This is a VERY GOOD REASON to test every driver on a 2-proc+ chip, as every stupid computer sold at BestBuy/CompUSA/etc., is now a multiprocessor box.