Low-Level Coding

I have always been an advocate of programmers knowing how things work under the hood. There are lots of reasons for this, and lots of examples of others thinking the same thing. Most CS schools offer (require?) a course in compiler design, for example, so that students get a feel for what goes on under the hood of a programming language. Raymond Chen mentions that he does most of his debugging in assembly language, and Larry Osterman says the same thing. I’ve long had a habit of having coders that work for me learn x86 assembly if they didn’t know it already, and I prefer to hire people who do.

While assembly is near and dear to my heart, there are other low-level topics that I have hardly given a second thought to over my career, one of which is the Hell of Floating Point Math. However, I just took the time out to read Eric Lippert’s five-part series on Fun With Floating-Point Arithmetic. It’s well-written and worth a read, even if you don’t usually don’t run into floating point math (and particularly if you don’t).

Another of my favorite resources is DataRescue’s fabulous Interactive Disassembler, commonly known as IDA. It’s really one of the tools that I’ve come to rely on to get an idea of what happens to my nice, tidy (riiiight…) C code once cl.exe is done with it. Although the same thing can be accomplished with WinDBG, using IDA to examine your code (or someone else’s!) is much easier, due to the fact that it can properly handle C-style constructs such as structs, unions, switch statements, etc. You can also supply names for variables, which provides a major help in understanding what’s going on. I have no commercial interest in IDA, for what that’s worth.

Real-Life Debugging, Final Answer

After much frustration following the dismal failure of my naming theory, we decided to do the unthinkable and actually test the code in the field, with the affected customer. We got exceedingly lucky this time around – the administrator at the customer site (in southern Mexico!) was amazingly helpful and willing to jump through all of the stupid hoops I asked him to.

So, we went back to the drawing board and got a full read-out from the customer on what actually happens when things break. He mentioned a detail that I had missed the first time around, that made all the difference: he said that when the computer drops off the network, existing drive maps break in addition to new attempts. OK, so this probably isn’t name resolution after all.

Although this technique is probably offensive to folks like Raymond Chen, I decided the best course of action here would be to sweet-talk the admin into putting a remote control application (I use TightVNC a lot, including this time) on the affected file server. I had him set up a packet capture using Ethereal, but when things broke, all I got were a bunch of TCP RST packets, indicating that the server decided that there was no longer a listening service at that port. Weird.

Then another engineer decided to poke around in the server’s event log. We started seeing messages from SRV (the file sharing service) complaining about failing to allocate a work item. This time we thought for sure we were onto something. We set up a reproduction environment, and this time, we stressed the network (as opposed to just letting it sit there for a period of time, waiting for name resolution to stop), and under stress, the box froze in mere minutes, putting the same message into the event log.

After doing some comparisons with the Passthru intermediate driver sample, another engineer found the bug: we had allocated a packet pool of only 32 packets, as opposed to the 65,535 that Passthru allocated. Packets were getting mishandled internally during periods of high stress, and somehow, this was causing SRV to bomb out and just quit listening on the ports.

A new driver was built and sent to the client, who ran with it for a few days and declared victory. This whole experience took several days, but once again, just goes to show you that there is no substitute for getting close to the problem, either with a reliable reproduction or with close customer contact during debugging.

Real-Life Debugging, Part 2

On Thursday, I talked a little about the a problem I’m facing with one of my drivers, causing the machine running my code to disappear from the (windows file-sharing) network. My first theory on on this had to do with name resolution.

Windows uses an old, complicated process to make name resolution work, dating waaaay back to the Windows for Workgroups and Lan Manager days. Each computer got (and still gets) a name called a “NetBIOS Name”, which is up to 15 characters in length. Larry Osterman has more information about it in a recent post.

The issus here is how these names are resolved. The original way to make this resolution work was to just broadcast a datagram on the local subnet asking who owned that particular name. This, of course, had the effect of limiting NetBIOS-based networks to a single physical subnet (lmhosts files notwithstanding). In conjunction with this, Microsoft implemented a “browser” protocol that collected names and kept them around for resolution on the subnet. The details are esoteric, but suffice it to say that name resolution can take a while to get working, or to break after a computer goes offline, due to the protocol used to update browsers. All of this was decided in a vacuum, with another engineer and me just kind of thinking about it and wondering what could be up.

So, we went looking through the code to see if we could figure out what was going on. Indeed, we found some old code that had been put in to eliminate 137/UDP datagrams, due to an issue with WINS registration that I’ll go into in another post someday. So, we took the code out of the driver and triumphantly gave it to the affected customer.

And we were wrong.

So much for debugging in a vacuum. Next time: the road to the final solution (we think).

A Real-life Debugging Challenge

My company typically sells to customers that are nowhere near its headquarters. This presents some special, real-world problems having to do with debugging of drivers. We’re currently in the middle of debugging a real booger of a bug, and in this case, the bug only shows up at low lattitude.

One of our customers is in Mexico, and in this particular network, the installation of our software causes his file “server” (Windows XP SP2) to suddenly drop off the network. Nobody could manage to reproduce the problem internally, and only two other customers in the history of the company had reported the problem.

The product has two kernel-mode drivers: a virtual Ethernet miniport and an NDIS intermediate driver.

Sooo, what would cause a windows computer to drop off the network an hour or two after boot, every time, but only when these two drivers are installed? I’ll discuss my first theory in the next post, but for an advance hint, go check out Larry Osterman’s blog – he has a couple of relevant posts in the last day or three.

Security: Handing a Kernel Buffer to Usermode

Peter Viscarola, one of the Cadre of Bright Guys and Gals that work at OSR, posted a note on NTDEV this week that i think bears repeating and explaining a bit, related to a subtle driver security point.

The note that prompted the response had the following code in it:

deviceExtension->VirtualAddress = ExAllocatePool(PagedPool, NumberOfBytes);
if (deviceExtension->VirtualAddress) {
   deviceExtension->Mdl = IoAllocateMdl(deviceExtension->VirtualAddress, NumberOfBytes, FALSE, FALSE, NULL);
   MmProbeAndLockPages(deviceExtension->Mdl, KernelMode, IoModifyAccess);  
   ...
}

The driver eventually wound up mapping deviceExtension->VirtualAddress into a usermode address and returning it to an application. Peter mentioned that there is a security flaw in this architecture, and when asked, explained it as follows:

It is possible for a malicious application to clone your address space, thus  getting a pointer into pool.  
That's all well and good... until the orignal app exits, and your driver returns the block of pool.  At which 
time the malicious app would have access to pool that's being used for ... anything ... perhaps for system 
purposes.

I admit that I had never thought of this, perhaps in part because I never give apps direct access to pool. There are a number of problems with this design, including problems associated with process context — see previous Kernel Mustard articles for more detals on this.

But the scarier part is the security problem. The rest of this may just be bad architecture, but there aren’t many things much worse than a security flaw, especially one that gives the user access to kernel memory. A good rule of thumb is to NEVER map kernel-mode pages into usermode addresses for any reason. This should be an architectural red flag! Far better would be to let the kernel map user space into kernel space for you, using ReadFile/WriteFile or IOCTL with METHOD_IN/OUT_DIRECT.

Peter continues:

You really might want to re-think the whole idea of mapping POOL into  an application's address space.  
This approach can open a complex security loophole that can endanger the system.  I'd recommend allocating 
memory in a section, and using that.  This isn't entirely free of potential problems, but it's better.

The whole architecture sounds suspect to me, and as I said above, I’d just fix the design to not require kernel-mode mapping of memory that gets handed to the user. The kernel itself is the only software that should perform this task, because it has all of the necessary checks to ensure system security.

Thanks to Peter for the excellent point. Now, for 10 Brownie Points, who can explain what problems he was talking about with regard to the section object solution?

I Am A Two-Timer!

My wife found out tonight that I am a two-timer: Microsoft has been kind enough to award me the MVP award for the second year in a row. I’d like to thank them for their vote of confidence, in spite of my (obviously) slow autumn, with two vacations and a huge pile of projects that were/are due to complete at the end of the year.

Thanks again to all of my readers for helping make this new blog a success. Here’s hoping for a productive and innovative 2005.