Somehow, I managed to catch a cold this weekend, and can already feel the NyQuil starting to kick in. If this post seems, well, a little druk — that’s why. The good news is that I’m having fun writing it!
Today I think I’ll talk a bit about the kernel-mode stack. There are a few intersting issues in play here for kernel-mode developers.
The kernel stack is small. It is usually 3 pages or so, which means 12k on X86. Believe it or not, this is bigger than it used to be – it was 8K in NT4 and previous. My guess is that Microsoft increased the kernel stack size on Windows 2000 because of the deeper layering of drivers brought about by WDM (more on that in a sec). If you use more than 12K of stack, you’ll hit a guard page, which probably means an instant double-fault bluescreen. The reason that this is a double-fault is that the CPU tries to build an exception record on the stack for transfer to the exception handler, and when it tries to push the record on to the overflowed stack, it faults again. The only thing the kernel can do at that point is die a painful death. It’s worth noting that almost every time I’ve had a duoble-fault bluescreen, it’s been a stack fault.
Why is the stack so small? Well, this was an early design decision made by the kernel team. First off, kernel stacks are generally not pageable. That means that you’re gobbling up 12K of physical memory for each thread running in the system. You have to keep that stack sitting around forever, even if the thread is in user mode, until it exits. With the hundreds of threads in systems today, that memory starts to add up fast. Additionally, because kernel code can execute at raised IRQL, kernel stacks cannot take page faults on access. That means that the traditional way of automatically growing the stack cannot be implemented in the kernel. Because the OS has to just pick an amount of memory for the stack at runtime, and because stack memory is a scarce resource, 12k was the compromise the kernel team landed on.
This means a few things: first, you need to be conservative with local variables. No more 64K arrays on the stack, for one thing. Be aware of the fact that you probably exist within a driver stack, and the drivers above and below you would be greatful for some stack space that they could use for themselves, thank you very much. In addition, you shouldn’t ever use recursion in your driver, unless You Know What You Are Doing. Most recursive algorithms can be expressed in iterative implementations without sacrificing too much. I know your search only goes log(n) levels deep, but don’t make me stress about whether or not it’s ever going to exhaust the available stack. Finally, avoid architectures with lots of deeply-nested functions. This isn’t an excuse to practice bad design, but it’s an encouragement to keep things relatively flatter than perhaps you otherwise would have.
Going back to the layering thing for a second, this is an area with which filter drivers often have a lot of trouble. There have been versions of popular antivirus scanners that are implemented as filter drivers, that simply cannot be installed on systems with any other filesystem filter drivers. They just use up too much stack, so stack faults are common. Don’t be a Bad Filter – be conservative of stack space, and remember that users will associate any bluescreens with your driver if it’s the last driver they instaled.
One final note: thre are a couple of system-supplied functions that allow you to use the stack more carefully. IoGetStackLimits() will let you check on the lower and upper bounds of the stack; IoGetInitialStack() will give you back the base address of the thread’s stack; and IoGetRemainingStackSize() can be called to find out how many bytes of stack are left. These functions should be used whenever a design contomplates recursion, whenever you are passed an address on the kernel stack, or in general, whenever you’re trying to hunt down a stack overflow bug.