Driver Developer’s Toolbox, Part 4: The Checked Build

Fresh off of two days worth of board meetings at my company, and two days out from leaving town for two weeks on vacation (Switzerland, Germany, and France), I’m exceedingly low on time, so please accept my apologies in advance for being slow to post. During my absence, I have a couple of guest bloggers lined up to discuss Interesting Things(TM) to tide you over until I get back.

Today I want to talk about the checked build. As you may know, there are two different builds of the OS, the free build and the checked build. The difference is that the checked build has some additional checks compiled in (usually in the form of ASSERT() macros) and lots of debug logging (via KdPrint()). You can get a good feel for how checked build code differs from free build code simply by reading the Microsoft-supplied samples in the DDK. Obviously, these extra checks can help much during the development of driver projects.

The problem with the checked build is that it’s a pain to deal with. It is hard to find, although it is available on all MSDN subscriptions from “Operating Systems” on up. Also, all service packs are released with checked build counterparts that can typically be downloaded from microsoft.com. Once you have the build, you have a couple of options for how to install it. I prefer to run with a full checked build at this point on one of my test boxes, but the full checked build has the disadvantage of being considerably slower than the free build, so it takes a lot of horsepower to run. The other problem with the full checked build is that debug messages can get to be amazingly verbose, which is oftentimes not helpful.

The solution to these problems is to use a partial checked build. This means using a checked kernel and hal (the kernel and the hal must always go together – they’re a matched pair), and any additional checked kernel components that are relevant. For example, when I’m developing an NDIS driver, I typically run with checked versions of ndis.sys, tdi.sys, tcpip.sys, and afd.sys. FSD and FS filter development call for checked versions of ntfs and fastfat. Use your head; the right checked binaries are usually obvious.

Getting the checked build onto the system is another matter, however. Due to the Fantastic Miracle of System File Protection, driver development has become slightly more painful in this area. To get Windows to allow you to replace the free files with checked ones, you must disable SFP and reboot with a debugger attached. The good news is that the kind folks at OSR have a tool that twiddles the registry keys for SFP automatically, and seems to work across lots of versions of the OS.

Once you have SFP disabled, simply back up ntoskrnl, hal, and whatever other binaries you’re replacing, and copy over the new ones. Keep a debugger attached to the system at all times, as things just won’t work right without it. ASSERT() macros will crash the system with a bugcheck if there is no debugger, which is seldom helpful. There’s always some idiotic driver (VMWare, are you listening?) that ASSERTs in ntio during boot-up, so you’ll need the debugger to dismiss the assert.

I find that developing and testing very early on with a checked build can be a big help in preventing the introduction of bugs in your drivers. And, what’s more, the newer you are at driver development, the bigger the payoff.

Suggestion Box

Please leave suggestions for topics as feedback on this thread. I’m also happy to take submissions for articles — please use the contact link to mail me if you’d like to submit something.

Driver Developer’s Toolbox, Part 3: Verifier

After getting a lot of feedback for the “Introduction to WinDBG” article I posted a few weeks ago, I thought I’d follow it up with another in the series, this time about Driver Verifier. All driver developers need to know a few basic things about the tools of the trade, and one of the most important tools for driver development that Microsoft ships is Driver Verifier.

Driver Verifier is essentially a library of routines that cross-check your interaction with the OS in a stricter way than normal. The design philosophy of the operating system is that kernel-mode components should trust one another. This works well in practice, as it provides significant performance improvements on often-used code paths. However, ithis trust is exactly what you don’t want during driver development, as it can let subtle errors go unnoticed until your software is in your customers’ hands.

There are a couple of tools designed to enforce stricter checking in the OS. One of these tools, the checked build, I’ll talk about in another article. Driver Verifier is the other major runtime driver validation tool. Verifier ships with the operating system, and changes with each release. The configuration interface is presented in a user-mode GUI app, but the code that does the real work is embedded in ntoskrnl. Also, note that Verifier and the checked build are unrelated; you really need both, but you can use either alone.

Verifier can catch tons of little errors once you turn it on. It can check for proper use of IRQLs and spin locks, proper implementation of the DMA protocol, correct handling of IRPs, and so on. It can even test your driver in a low-memory simulation, randomly failing memory allocation requests. Best driver development practice dictates that all development testing be done with full driver verifier turned on (possibly with the exception of low resources simulation). By following this rule, you’re sure to catch as many mistakes as possible, before they are covered up by more layers of your code. In fact, one developer at Microsoft told me that he routinely runs the entire OS under verifier.

The description that follows is done on the current 64-bit XP preview release, but other OSes are similar. To enable this magic, it’s easiest to start verifier.exe from the Start->Run box. Although there is another registry-based way to configure verifier, I won’t go into it here – operate Regmon if you’re curious. Choose “Create custom settings” and click Next, and “Select individual settings from a list”, and Next. At this point, you’re prompted with a list of verifications. The easiest and most comprehensive thing is to check them all, but you may want to leave low resources simulation off the list during development and early testing. Also, IRQL checking has a sizeable performance impact, due to the fact that it invalidates all pageable pages in the driver before each call into your code. Still, this is an invaluable test, particularly in combinationwith the PAGED_CODE macro. Once you finish with options, you are prompted to select which drivers to verify.

Once verification is started on your driver, be sure to have a kernel debugger hooked up, as any violations that Verifier finds will turn into breaks into the debugger. If you don’t have a debugger attached, the system will just bugcheck with verifier’s own bugcheck code. Usually, verifier is pretty clear about what has gone wrong with your driver, so the problems (if not the fixes) are pretty obvious. In my experience, Verifier doesn’t catch many false positives – if Verifier breaks in, there is a very high probability that you have a real bug, and you should fix it.

One other thing – I occasionally find drivers that have clearly not been tested against verifier, because they trip it off during whole system verification. If you find one of these drivers, be a good citizen and drop a friendly note to the company that is responsible for it. Bug reports from clueful developers are always appreciated. And, if that doesn’t work, there’s always public shame. 🙂

A Tale Of Two Laptops

I’m a firm believer in broadening horizons. I love alternatives and underdogs. Variety is the spice of life. With all of that in mind, I ordered a couple of new laptops for my development team that came in this week.

The first one was a Sager – a brand that I had never heard of before a few months ago. Sager makes an impressive box. In fact, it’s by far the most impressive laptop I’ve ever run across, feature-wise. One lucky developer wound up with a Sager NP4750, which is an AMD64-based box. In addition to having every option I’ve ever heard of in a computer — seriously, check that link if you don’t believe me — the AMD64 setup seems to be pretty solid. As you know if you’ve been reading my blog for a while, I’m a big fan of the AMD64, and of 64-bit computing in general. Other than a few minor gotchas with the XP 64-bit preview release (bluescreen on trying to install the wrong VMWare, sound drivers don’t quite work all of the time, etc), it looks good. We’re still working on the set-up, so I’ll let you know if we don’t get any of those problems resolved. FWIW, my dual Opteron wound up with zero problems, and I couldn’t be happier with it. I’ve never heard of Sager before this, so I invested in the best warranty coverage they could offer. It can’t be any worse than the WinBooks that are on their way out!

Because of the success and ultra-coolness of that Sager laptop, I decided to do the obvious thing and buy an Apple PowerBook G4. After having run Linux on my laptops for many years, I decided it was time to upgrade to a slightly more usable UNIX laptop. My wife has had a PowerMac G5 for over a year now, and it’s been fantastic. Everything works right, the UI is the most beautiful graphics work I’ve ever seen (non-art graphics, anyway), and in general, the Mac Mystique is real. This laptop does nothing to diminish my happiness with Apple. Seriously, if you’ve never bought a piece of Apple hardware before, go treat yourself to an IPod or something and marvel at the amazingly good packaging and perfect out-of-box experience. As someone who has been in the product business for a few years, I have learned to really appreciate the companies in the world that do an outstanding job on fit and finish.

Anyway, you might be wondering what a Windows driver developer is doing running all of this wacky hardware. The answer is simple: the more different environments you use, the better you get at using all of them. The more different operating systems you expose yourself to, the better you get at improving any of them. My Macintosh experiences (and my Linux experiences) have been invaluable when it comes to improving my Windows products.

I’m still in the process of setting up the Mac, but I have Microsoft VirtualPC 6.1 and Microsoft Office 2004, so I have everything I need to do driver development the way I always have. Emulation speed isn’t as good in VPC, however, so for testing, I use Microsoft Remote Desktop to get into my afore-mentioned dual Opteron. It’s so much faster than any laptop I’ve ever seen (even the Sager) that laptop-based testing just doesn’t make sense to me any more.

Some Follow-Up To Previous Comments

A couple of things:

– Further offline discussion with Wayne points out that, if there are memory barrier issues in Java (in its current incarnation), they are JDK problems, not language problems per se, due to the fact that Java guarantees “program order” (causing a permanent performance penalty). The example he gave turned out to need some re-working to really test this correctly.

– The Java synchronization stuff posted doesn’t actually do anything for memory barriers at all in theory, although as it happens, all of the underlying OS synchronization primitives provide implicit memory barriers. Java on an architecture in which synchronization primitives are implemented differently might have a problem.

Rod posted a cool link about Java memory issues. It’s a good thing that I Hate Java, or else I’d have to be concerned about stuff like this. 🙂

– As far as memory barrier references, there are few. The is some discussion in Dekker and Newcommer’s Writing Windows NT Device Drivers, which is old and out of date, but a great book nonetheless. Wikipedia has an article about memory barriers, and they’re covered in the processor manuals for the Pentium 4, Itanium, and AMD64. Note that they’re also sometimes referred to as “fences”. Adrian Oney from Microsoft knows about them; that’s as much as I can say about that though. 🙂 I would really appreciate any additional resources you find.

Intel And Multi-Core Chips

One of my daily reads is ArsTechnica. Hannibal, their CPU guy, does an amazing job of talking about the low-level stuff in such a way that it’s easily understandable, even to people with limited neural matter such as myself.

His review of the Intel Developer Forum (just concluded) has some very interesting stuff in it with regard to dual-core CPUs. As if it weren’t already important enough to design software in an MP-safe way, now we’re reaching the point that software *must* be designed to take advantage of multiple CPUs, or else a significant chunk of your average microprocessor will go unused.

If you have 5 minutes after you’re done with your real work (i.e. reading this blog), head on over there and take a peek, and then make sure you re-test all of your drivers on MP boxes for good measure.

Memory Barriers Wrap-up

Hello blogosphere! I hope everyone had a great time this weekend puzzling through the mysteries of memory barriers. Personally, I spent the weekend coding and reading about realtivity (a recent post by Raymond Chen got me re-re-re-re-re-started on physics again).

In addition to the above-mentioned nonsense, I got some time to drag out the intel manuals to see what they had to say about x86 memory barriers. For the curious, the details can be found in section 7.3 of the 3rd volume of the Intel Pentium 4 manuals.

The situation is slightly different between the {i486, P5} and P6+ (Pentium Pro, Pentium II, Xeon, etc.) processors. The first group of chips enforces relatively strong program ordering of reads and writes at all times, with one exception: read misses are allowed to go ahead of write hits. In other words, if a program writes to memory location 1 and then reads from memory location 2, the read is allowed to hit the system bus before the write. This is because the execution stream inside the processor is usually totally blocked waiting for reads, whereas writes can be “queued” to the cache somewhat more asynchronously in the core without blocking program flow.

The P6-based processors present a slightly different story, adding support for out-of-order writes of long string data and speculative read support. In order to control these features of the processor, Intel has supplied a few instructions to enforce memory ordering. There are three explicit fence instructions – LFENCE, SFENCE, and MFENCE.

  • LFENCE – Load fence – all pending load operations must be completed by the time an LFENCE executes
  • SFENCE – Store fence – all pending store operations must be completed by the time an SFENCE executes
  • MFENCE – Memory fence – all pending load and store operations must be completed by the time an MFENCE executes

These instructions are in addition to the “synchronizing” instructions, such as interlocked memory operations and the CPUID instruction. The latter cause a total pipeline flush, leading to less-efficient utilization of the CPU. It should be noted that the DDK defines KeMemoryBarrier() using an interlocked store operation, so KeMemoryBarrier() sufferes from this performance issue.

This story changes on other architectures, as I’ve said before, so the best practice is stil to code defensively and use memory barriers where you need them. However, it doesn’t look like you’re likely to run into these situations in x86-land.

Memory Barriers, Part 2

So my question du jour is, “Is anyone still not using Firefox?” I have been getting sick in recent months of friends and family calling me and complaining about spyware, pop-ups, viruses, and so on. Amazingly enough, simply installing Firefox has dropped my personal support call volume to near-zero. I’ve also been using firefox exclusively for months, except for accessing certain MS sites that require IE, and have been thrilled. YMMV, of course, but the newly-released 1.0 preview release runs amazingly well on both Linux and Windows. It’s actually more stable on my amd64 than either the 32-bit or 64-bit versions of IE.

OK, so yesterday, I posted an extra-credit assignment. Nobody tried it, so I’m going to elaborate on it a bit. If you haven’t read yesterday’s post yet, scroll down and do so before trying to go at this one.

int a = 0;
int b = 1;

f()
{
        for(;;)
        {
                ASSERT(a < b);
        }
}

g()
{
        for(;;)
        {
                b++;
                a++;
        }
}

This code is similar to code I once saw a Microsoft person scribble on a whiteboard, and I thought it was a really interesting way to frame the memory barrier problem. Say you create both threads f() and g() on a dual-proc computer and then just walk away and let it run. Will the ASSERT ever fire? According to the MS guy, the answer is “yes”, and the reason is that the a++ can be committed to RAM before the b++, making a == b.

Consider the values of a and b after a few revolutions. There are a couple of different scenarios:

   case 1               case 2
(expected)   (reordered writes)
   a | b                    a | b
   -----                   -----
   0 | 1                   0 | 1   (initial)
   0 | 2                   1 | 1   (after b++; #2 re-orders write)
   1 | 2                   1 | 2 

There is another sequence too, for example: a++, b++, b++, a++; and b++, a++, a++, b++.

There are a couple of interesting things to think about here. The first is that this happens in a loop. That effectively gives you two places to put memory barriers: between b and a, like so:

g()
{
        for(;;)
        {
                b++;
                KeMemoryBarrier();
                a++;
        }
}

or between a and b:

g()
{
        for(;;)
        {
                b++;
                a++;
                KeMemoryBarrier();
        }
}

Notice that there are actually two places this barrier can be placed, with equivalent effect.

These two examples solve slightly different problems, as outlined in the sequences given above.

So, over the weekend, here are three more things to ponder:

  1. What impact does the fact that a++ is actually a read/update/write operation have on this? Is the effect architecture-specific?
  2. Are the reordering issues different between on-chip reordering and compiler-generated reordering? Is this also architecture-specific? (think 64-bit computing here)
  3. What would the tables look like under the various possible sequences with and without the barriers in either or both places?

Have a good weekend!

Memory Barriers

Sorry for the long break in blogging; I’ve been catching up on 1001 things at work, and getting ready for an upcoming trip to the Old World. I promise I won’t let it happen again! <g>

I first heard about memory barriers from Ed Dekker’s book on NT Device Drivers. This is still probably my favorite overall book on driver-writing, even though it’s getting to be badly out of date. Ed can tell stories with the best of ’em, and his is one of the few books that really has a personality. He addressed the concept of memory barriers in conjunction with the (slightly oddball) Alpha processor port of NT.

First, consider the following code:

int a = 0;
int b = 0;

f()
{
        while(a == 0)
        {
        }

        ASSERT(b == 1);
}

g()
{
        b = 1;
        a = 1;
}

Assume f() and g() are two threads started simultaneously. Will that ASSERT() ever fire? The naive answer is “no”. However, modern super-scalar processors sometimes re-order memory accesses for various reasons, and if that happens, the ASSERT can be tripped. This is subtle; think about it for a second if it’s not immediately obvious.

This problem can be fixed with a memory barrier. A memory barrier is an explicit instruction to the CPU that orders reads and writes to memory. In other words, it requires that any outstanding read or write accesses to memory be completed, processor-wide. The implementation of a memory barrier is CPU-specific, as are the situations in which one might be needed. IA-64 write combining presents different issues to programers than normal x86 semantics, for example.

On an x86 chip, any interlocked operation will force an implied memory barrier, and if you’re using a new enough DDK, you can call KeMemoryBarrier() to make your intentions obvious. The above code would be fixed, for example, by changing g() as follows:

g()
{
        b = 1;
        KeMemoryBarrier();
        a = 1;
}

So how can you tell when you need one? Well, the good news is that it doesn’t seem like I run into many situations where this is an issue. The operating system protects you with implicit memory barriers included in all locks, and if you always protect shared memory with a lock of some sort, you’re safe. However, if you try to minimize the use of locks in your code, this can jump up and bite you.

There is one other source of re-ordering that you should be aware of, as well: compilers tend to re-order things in certain cases, and while the rules for re-ordering are subtle and complex, you can always protect yourself using a combination of the “volitle” keyword in C code and compiler-specific intrinsics.

For more information on memory barriers, check out this paper at WHDC.

Extra credit: analyze the following code, in light of memory barriers:

int a = 0;
int b = 1;

f()
{
        for(;;)
        {
                ASSERT(a < b);
        }
}

g()
{
        for(;;)
        {
                b++;
                a++;
        }
}

When Does “Output” Mean “Input”?

After more philosophization on the meaning of direct IOCTL codes, I came to the conclusion that I’ve never used METHOD_IN_DIRECT in a driver. Naturally, I wondered if it was any different than METHOD_OUT_DIRECT. Boy, was that an interesting investigation.

To start off with, you have to know a thing or two about how an IRP works. An IRP is the basic data structure passed into all driver dispatch routines. It contains all of the caller’s parameters, as well as an associated data structure that replaces the traditional stack used during function calls. In particular, IRPs have a member called MdlAddress. Note that it doesn’t say “InMdlAddress” and “OutMdlAddress” – it’s just MdlAddress.

After some consideration, I determined that when a usermode app calls DeviceIoControl() or NtDeviceIoControlFile() on a METHOD_IN_DIRECT code, it must just pass its data in the InputBuffer into the driver at MdlAddress. I put together a quick test driver to verify this fact. Nope, wrong.

The next step was to look around for any sample code that calls DeviceIoControl() with METHOD_IN_DIRECT. I searched my DDKs for about 5 minutes and finally gave up – the only samples I found were calling from the kernel, and not calling NtDeviceIoControlFile().

After fiddling with the code for long enough to convince myself that I wasn’t crazy (riiiiight), I decided to do what any sane developer would do in a similar situation: I broke out WinDbg. Knowing that all IOCTL requests from user mode end up calling NtDeviceIoControlFile, I disassembled that function:

kd> ln nt!NtDeviceIoControlFile
(8052af7e)   nt!NtDeviceIoControlFile   |  (8052afaa)   nt!NtFsControlFile
Exact matches:
    nt!NtDeviceIoControlFile = 
kd> u 8052af7e 8052afaa
nt!NtDeviceIoControlFile:
8052af7e 55               push    ebp
8052af7f 8bec             mov     ebp,esp
8052af81 6a01             push    0x1
8052af83 ff752c           push    dword ptr [ebp+0x2c]
8052af86 ff7528           push    dword ptr [ebp+0x28]
8052af89 ff7524           push    dword ptr [ebp+0x24]
8052af8c ff7520           push    dword ptr [ebp+0x20]
8052af8f ff751c           push    dword ptr [ebp+0x1c]
8052af92 ff7518           push    dword ptr [ebp+0x18]
8052af95 ff7514           push    dword ptr [ebp+0x14]
8052af98 ff7510           push    dword ptr [ebp+0x10]
8052af9b ff750c           push    dword ptr [ebp+0xc]
8052af9e ff7508           push    dword ptr [ebp+0x8]
8052afa1 e84ea70000       call    nt!IopXxxControlFile (805356f4)
8052afa6 5d               pop     ebp
8052afa7 c22800           ret     0x28

It looks like NtDeviceIoControlFile just hops directly to IopXxxControlFile(), which is not exported. Disassembling that function in WinDbg shows that this is where the real magic happens. Some selected lines:

kd> ln nt!IopXxxControlFile
(805356f4)   nt!IopXxxControlFile   |  (80535dac)   nt!IopInitializeBootLogging
Exact matches:
    nt!IopXxxControlFile = 
kd> u 805356f4 80535dac
8053579f e846befdff       call    nt!ProbeForWrite (805115ea)
805357ed e8409ff6ff       call    nt!ObReferenceObjectByHandle (8049f732)
805358da e85408efff       call    nt!IoGetRelatedDeviceObject (80426133)
805358e4 e86f06efff       call    nt!IoGetAttachedDevice (80425f58)
80535aea e84deaeeff       call    nt!IoAllocateIrp (8042453c)
80535c16 e84cebeeff       call    nt!IoAllocateMdl (80424767)

etc…

OK, so now I know we’re in the right function. Now I look for what happens to METHOD_IN_DIRECT, which (according to the DDK) is type 1. That IoAllocateMdl call looks promising, too, as we know that the function should only be allocating a MDL for DIRECT I/O. Some exploration yields:

80535c0c 53               push    ebx
80535c0d 6a01             push    0x1
80535c0f 56               push    esi
80535c10 ff752c           push    dword ptr [ebp+0x2c]
80535c13 ff7528           push    dword ptr [ebp+0x28]
80535c16 e84cebeeff       call    nt!IoAllocateMdl (80424767)

Now, remember that arguments are pushed on the stack backwards, so ebp+0x28 will be VirtualAddress, ebp+0x2c will be Length, esi (which is xor’d to 0) represents a FALSE for SecondaryBuffer, 0x1 is TRUE for ChargeQuota, and ebx holds the address of the IRP (which I know is correct, because it was set to the return value of IoAllocateIrp()).

The interesting point is that this is the *only* call to IlAllocateMdl in the entire function. In fact, it’s the only call to any MDL-related function, so that must be what’s used to set MdlAddress. A little exploration confirms that:

kd> dt nt!_IRP
   +0x000 Type             : Int2B
   +0x002 Size             : Uint2B
   +0x004 MdlAddress       : Ptr32 _MDL
...

80535c16 e84cebeeff       call    nt!IoAllocateMdl (80424767)
80535c1b 894304           mov     [ebx+0x4],eax

Here, I used the dt command to tell me the offset of the MdlAddress member of the IRP struct. Then, I looked at what happened to the return value (eax), and sure enough, it’s a match. Remember that we determined above that ebx is our IRP.

So, only one question remains: what data is mapped into that MDL? Here’s the interesting part: those arguments provided to IoAllocateMdl are statically defined. They’re not dependant on the transfer method. In other words: no matter what transfer method you choose, if you get to the IoAllocateMdl() call, you’re getting the same buffer mapped into the MDL. Which buffer is it?

To find that out, we have to identify ebp-28 and ebp-2c. Looking back at the way this function was called, we should be able to figure out what happens. The good news here is that this function uses the standard stack frame pointer, which is set up at the top of the function:

kd> u 805356f4 80535dac
nt!IopXxxControlFile:
805356f4 55               push    ebp
805356f5 8bec             mov     ebp,esp

This means we only have to look at whatever is +28 in the caller’s frame. Remember that the push we just did above is the first thing on the stack, and the return address will be next. So, we just go back to the caller’s string o pushes and look for the one at +20, which will be the 9th argument. That turns out to be ebp+0x28 as well. Using the same logic, we see that our argument is the 9th argument to NtDeviceIoControlFile. Now, we just crack open our copy of Nebbett’s Native API book, and find that the 9th argument to NtDeviceIoControlFile() is OutputBuffer!

Well, that certainly explains a lot. No matter whether you specify METHOD_IN_DIRECT or METHOD_OUT_DIRECT, it looks like Windows will just build a MDL on OutputBuffer. After this little revelation, I went back and tried to figure out what happened to InputBuffer, which is the 7th argument, at offset ebp+0x20. I didn’t have to look far – immediately above the IoAllocateMdl() stuff is this:

80535bca 397520           cmp     [ebp+0x20],esi
80535bcd 7435             jz      nt!IopXxxControlFile+0x510 (80535c04)
80535bcf 68496f2020       push    0x20206f49
80535bd4 ff7524           push    dword ptr [ebp+0x24]
80535bd7 ff75d8           push    dword ptr [ebp-0x28]
80535bda e8075ceeff       call    nt!ExAllocatePoolWithQuotaTag (8041b7e6)
80535bdf 89430c           mov     [ebx+0xc],eax
80535be2 8b4d24           mov     ecx,[ebp+0x24]
80535be5 8b7520           mov     esi,[ebp+0x20]
80535be8 8bf8             mov     edi,eax
80535bea 8bc1             mov     eax,ecx
80535bec c1e902           shr     ecx,0x2
80535bef f3a5             rep     movsd
80535bf1 8bc8             mov     ecx,eax
80535bf3 83e103           and     ecx,0x3
80535bf6 f3a4             rep     movsb
80535bf8 c7430830000000   mov     dword ptr [ebx+0x8],0x30
80535bff 33f6             xor     esi,esi
80535c01 8b4d2c           mov     ecx,[ebp+0x2c]

Remember that esi is still 0. This code allocates a buffer of ebp+0x24 (i.e. InputLength) bytes and sets it to Irp->AssociatedIrp.SystemBuffer (also found with the dt command). It then does what boils down to RtlCopyMemory(), x86-style, from source ebp+0x20 (InputBuffer) to dest SystemBuffer, length ebp+0x24 (InputLength). In other words, the system always double-buffers InputBuffer on NtDeviceIoControlFile().

OK, so I know you really have to be a geek to find this fascinating, but I really didn’t gather that this was the case just from reading the documentation, although it’s certainly possible that I missed it. The lack of samples seems to indicate that this isn’t a commonly-used code path, either.

The bad news is that this post has taken over 2 hours to write, and now it’s likely that I’m going to be late to work. See you on the flip side.