Cool Hack Of The Week

Here’s an e-mail from Jamey Kirby, posted to NTDEV, on how to keep Explorer from showning a drive. I love it!

Use numbers for the drive letters. You can access the drive via
3:\yadayadayada\yadayadayada. You can even park a CMD prompt on it, but
explorer and associated APIS will not enumerate them.

Jamey

Jamey gets a Gold Star(TM) for pointing this out.

(Stack) Size Matters

Somehow, I managed to catch a cold this weekend, and can already feel the NyQuil starting to kick in. If this post seems, well, a little druk — that’s why. The good news is that I’m having fun writing it!

Today I think I’ll talk a bit about the kernel-mode stack. There are a few intersting issues in play here for kernel-mode developers.

The kernel stack is small. It is usually 3 pages or so, which means 12k on X86. Believe it or not, this is bigger than it used to be – it was 8K in NT4 and previous. My guess is that Microsoft increased the kernel stack size on Windows 2000 because of the deeper layering of drivers brought about by WDM (more on that in a sec). If you use more than 12K of stack, you’ll hit a guard page, which probably means an instant double-fault bluescreen. The reason that this is a double-fault is that the CPU tries to build an exception record on the stack for transfer to the exception handler, and when it tries to push the record on to the overflowed stack, it faults again. The only thing the kernel can do at that point is die a painful death. It’s worth noting that almost every time I’ve had a duoble-fault bluescreen, it’s been a stack fault.

Why is the stack so small? Well, this was an early design decision made by the kernel team. First off, kernel stacks are generally not pageable. That means that you’re gobbling up 12K of physical memory for each thread running in the system. You have to keep that stack sitting around forever, even if the thread is in user mode, until it exits. With the hundreds of threads in systems today, that memory starts to add up fast. Additionally, because kernel code can execute at raised IRQL, kernel stacks cannot take page faults on access. That means that the traditional way of automatically growing the stack cannot be implemented in the kernel. Because the OS has to just pick an amount of memory for the stack at runtime, and because stack memory is a scarce resource, 12k was the compromise the kernel team landed on.

This means a few things: first, you need to be conservative with local variables. No more 64K arrays on the stack, for one thing. Be aware of the fact that you probably exist within a driver stack, and the drivers above and below you would be greatful for some stack space that they could use for themselves, thank you very much. In addition, you shouldn’t ever use recursion in your driver, unless You Know What You Are Doing. Most recursive algorithms can be expressed in iterative implementations without sacrificing too much. I know your search only goes log(n) levels deep, but don’t make me stress about whether or not it’s ever going to exhaust the available stack. Finally, avoid architectures with lots of deeply-nested functions. This isn’t an excuse to practice bad design, but it’s an encouragement to keep things relatively flatter than perhaps you otherwise would have.

Going back to the layering thing for a second, this is an area with which filter drivers often have a lot of trouble. There have been versions of popular antivirus scanners that are implemented as filter drivers, that simply cannot be installed on systems with any other filesystem filter drivers. They just use up too much stack, so stack faults are common. Don’t be a Bad Filter – be conservative of stack space, and remember that users will associate any bluescreens with your driver if it’s the last driver they instaled.

One final note: thre are a couple of system-supplied functions that allow you to use the stack more carefully. IoGetStackLimits() will let you check on the lower and upper bounds of the stack; IoGetInitialStack() will give you back the base address of the thread’s stack; and IoGetRemainingStackSize() can be called to find out how many bytes of stack are left. These functions should be used whenever a design contomplates recursion, whenever you are passed an address on the kernel stack, or in general, whenever you’re trying to hunt down a stack overflow bug.

URL Me, Baby!

Good news: I have finally managed to procure a real URL for this blog: www.kernelmustard.com, active as of this morning. I’m very excited – I was getting tired of spelling out “msmvps.com” for people. You’d be surprised how easy it is to screw that up. Anyway, tell your friends, tell your colleagues, tell your co-workers – kernelmustard.com is ready for prime time!

Raymond Chen had a really interesting article on The Old New Thing today about alignment on 64-bit platforms. Worth a read if you’re not used to that sort of thing. I actually posted a little preview of a marshalling discussion as well, which is relevant to kernel-mode people.

Continuing vaguely along the theme from yesterday, I thought I’d plug a book that I refer to often when I have to work on Actual Hardware:

Developing Windows NT Device Drivers: A Programmer's Handbook
by Edward N. Dekker, Joseph M. Newcomer 
Addison-Wesley Publishing Co.; ISBN: 0201695901 

This is an oldie but a goodie. It’s written for Windows NT 4.0, but it’s still quite relevant in a couple of ways. In particular, it has some of the best hardware interfacing material I’ve ever encountered. Some of it is old now (i.e. the Hal calls, the DMA calls, etc.), but the concepts are there, and that’s what’s important. Also, Ed Dekker is a walking repository of Bad Hardware Stories, some of which come through in the book.

It’s not current at all when it comes to things like PnP and power management, but the rest of the concepts are still just about right-on. When you read it in the right light, it’s a valuable resource. It’s hardbound and printed on nice paper, with a great index – all features that make it actually usable for a practicing programmer.

Anyway, next time you feel the urge to expand your brain, put on some ben folds music, grab Developing Windows NT Device Drivers, and reminisce about the time you got your new Pentium Pro server with 48MB RAM and installed NT4 for the first time. 🙂

Driver Developer’s Toolbox, Part 1: The DDK

There are a lot of basic questions asked in the various driver development
forums that basically reduce to “how should I go about developing a driver?”.
Every so often, I’ll post a quick discussion of one of the tools I use every
day to do driver development. Try though they might, Microsoft doesn’t exactly
hammer you over the head with information like this – a lot of this know-how
really just comes from doing the job.

At the risk of starting too basic, when you sit down to build a driver, the
first tool will need is the Windows Driver Development Kit (DDK). The DDK has
changed a lot over the years, but the last couple of releases have finally
started to settle down from a consistency standpoint. The current released DDK
is the Windows Server 2003 DDK, which represents build 3790 of the OS. It also
includes full development environments for Windows XP and Windows 2000. The
DDK is (almost) free from Microsoft – go to WHDC for more information on how to
get it.

Installation of the DDK takes forever. I’ll never understand why these things
take sooooo long to uncompress and copy files, but perhaps I’m just a fundamentally
impatient person. Yep, that’s it. 🙂 At any rate, do yourself a favor and install
absolutely everything in the DDK. The whole mess – all of the build environments,
all of the tools, all of the samples, all of the docs. There are a couple of reasons
for this: first, you need to have the older build environments if you’re planning on
releaseing drivers for the older platforms. The “wnet” build environment isn’t always
backward-compatible with the “w2k” environment, so you need both. Unless, of course,
you are one of the lucky few that don’t have to support older OSes. I’m jealous.

The other reason is that the sample code is by far the best documentation in the whole
kit. You might find yourself wondering about the semantics of a particular API in a
particular situation, and it really helps to be able to grep through all of the samples for
usage examples. I recently was in a situation where I needed to get a file handle back
from a file object, so I looked at all of the samples I could find for examples of
ObOpenObjectByPointer(). There were *none* in any of the files, prompting me to
redesign the driver to not need that API. I’m better off now, in that the overall architecture
wound up being much cleaner.

The DDK includes the latest authoritative documentation from Microsoft on how
to build drivers. It includes both “design guides” and reference material.
The design guides tend to read like technical documentation, so I’d still
recommend one of a few third-party driver development books. The reference
sections provide lots of detail about every public function provided for use by
driver writers. Most of this documentation (all?) is also available online at
OSR’s website www.osronline.com, which
also has tons of other resources for driver writers. For those who didn’t
know, OSR is the company that hosts the NTDEV and NTFSD mailing lists.
Wherever you get it from, I’d recommend becoming very familiar with the DDK
documentation.

Also included with the DDK is an entire build environment. All of the headers,
libraries, compiler tools, and support infrastructure needed to build your
driver are included. In particular, you don’t need visual studio to compile
your drivers. In fact, using visual studio directly is not supported by
Microsoft for driver development. Mark Roddy (www.hollistech.com) has a script called
DDKBUILD that can be used to integrate the DDK with Visual Studio, but unless
you’re really glued to its editor (and believe me, there are better ones), there’s really no reason to use it.

Finally, don’t miss the array of testing and troubleshooting tools present in
the DDK. There are so many of them, and some of them are so important to doing
the job correctly, that I’ll post specifically about them another time.

Innies And Outies

Lots of people lately have been trying to do kernel mode file I/O, and running
into their share of problems in the process. I was just involved in a virtual
disk project that relied heavily on file I/O in the driver, so I thought I’d
post a quick tutorial while it’s fresh in my mind.

First things first, try not to do this. Someone from Microsoft made the point
on NTDEV that you really don’t want to have to do file I/O from kernel mode if
you can avoid it. There are security issues to consider (e.g. opening files
that the user wouldn’t have had access to), and besides that, it’s just
tricky.

With that said, if kmode file I/O is what you need to do, there are two basic
ways to do it: IRP-based and function-based. The easier of the two methods is
the function-based method, which employs the use of the Zw APIs for file manipulation.
Typically, files are opened with ZwCreateFile, read written with ZwReadFile and
ZwWriteFile, perhaps queried with ZwQueryInformationFile, and closed with ZwClose. This
method of file manipulation is geared toward using handles, so the standard warnings
about kernel-mode handle use apply. If you’re running on a newer OS, specifying
OBJ_KERNEL_HANDLE in your OBJECT_ATTRIBUTES is always a good idea, as it makes the handle visible in all
contexts, while at the same time making it useless from user mode. Other than some
basic API differences (i.e. OBJECT_ATTRIBUTES structures, UNICODE_STRING strings, etc.),
this should feel quite a lot like Win32 access.

The one big caveat with function-based file I/O is that it cannot be done from any
IRQL > PASSIVE_LEVEL. This, in particular, includes APC_LEVEL. If you happen to be
sitting below a filesystem driver, for example, you may find yourself called back at
APC_LEVEL, and it is incorrect to use any of the Zw* file manipulation functions at that
IRQL. The reason has to do with I/O completion, which I’ll get into another day. The
right thing to do here is to post the IO to a worker thread and wait for it to complete.

By the way – if you are using PAGED_CODE() to assert IRQL at the tops of your functions –
which you should be doing, by the way – remember that this will still pass even if you
are called back at APC_LEVEL, so you will have to either do an explicit IRQL check, or
better yet, just post all I/O off to a thread.

There is some debate as to whether you should simply use system work items or create a
dedicated worker thread. If you go the latter route, remember that there is a nontrivial
cost in setting up a new thread, and you have to be careful about how you kill it off –
you don’t just want to terminate it, because it won’t be cleaned up properly. Instead,
you should have an event that you set when you want the thread to exit.

The other method for doing file i/o is simply to build and send IRPs down to the filesystem
drivers themselves. This is less documented but not difficult to do. Instead of handles,
here you’ll need the device object of the FSD and the file object representing the opened
file. In general, the idea is that you call IoBuildAsynchronousFsdRequest() with appropriate
parameters, and then attach the file object to the next stack location. If you don’t do
that latter step, you’ll see very odd crashes in the FSD. I hope to have an example of
this method posted within the week; check back if you’re curious.

With either method, there are serious deadlock issues. Without going into detail, if you
believe you will re-enter the FSD (as in the case of a virtual disk driver backed by a
a file), your read (and write in particular) I/O needs to be noncached, or you’ll get
into a difficult race with the cache manager.

Double, Double, Toil And Trouble

Note to self… write these posts in a decent authoring tool and *then* put
them online. A reboot just ate my first attempt at an AMD64 article, with no
recovery file. Live and learn. Now you’re stuck with the 3:00AM version,
stream-of-consciousness stizz. Enjoy 😉

Last year at Driver DevCon, the folks from AMD were out in force, armed with a
dizzying array of goodies emblazoned with their logo. What were they pushing,
you ask? The AMD64 architecture, of course, and well worth it too. AMD64 is a
64-bit extension of the x86 CPU architecture (from a programming interface
standpoint, anyway; I’m sure they’re nothing alike in the core). For
those who haven’t been keeping score at home, Intel introduced the IA-64
architecture a couple of years ago with the Itanium chip. Itanium is a cool
hunk of silicon, but it’s fundamentally incompatible with x86. AMD saw their
chance to fill a big need that Intel didn’t touch, and boy, did they do a good
job. It’s one year after DevCon, and it’s currently possible to head down to
the local computer parts store and pick up an Athlon 64 board/chip for cheap
and drop it into an otherwise-x86 system. Opteron-based servers are gaining
huge market share compared to Itanium sales, and (as attentive Kernel Mustard
readers know) I can say from first-hand knowledge how cool my dual-Opteron test
box is. When I first started looking into 64-bit chips, I was confused about a
few things. Here are some of the points I picked up along the way, so that you
might not have to duplicate my quest for enlightenment.

So, what is AMD64? In a nutshell*, this is the name of AMD’s new x86-based
64-bit CPU architecture. It was designed from the ground up to be a totally
new internal architecture, but with a mandate to change as little as possible
about the programming interface. What changes they did make are completely
obvious and intuitive if you have any understanding of x86 CPU history – the
registers have been extended from 32-bit (i.e. eax) to 64-bit (“rax”; I guess
“r” stands for ReallyExtended or something :P), and a new opcode prefix has
been added that signals the chip that a 64-bit operand is on the way. Other
than that, plus a few new general-purpose registers and some other details that
Really Don’t Matter, it looks and feels just like an x86. The long`addresses
look a little weird in the debugger, but you get used to them eventually. IDA
already disassembles AMD64 binaries perfectly.

One item up for discussion at the moment is what the final name of the
architecture will be in the Geek World. AMD is partial to AMD64, of course,
and frankly, I agree – Intel got to have IA-32 and IA-64, so I don’t see why it
would hurt to have an architecture named after AMD. However, Microsoft seems
to be backing away from AMD64, and (if you can believe MSDN Subscriber
Downloads), looks to favor the shorter “x64” name. Another name that gets
thrown around occasionally is “x86-64”. The reason this matters at all is that
Intel announced (nine months ago!) that it will also produce an x86-based
64-bit chip. Details are still sketchy, but as usual, ArsTechnica has tons of good information.
My money is on “x64” winning out eventually, as it sounds the most like “x86”.
That won’t stop me from referring to it as AMD64 for now, though. I’m a slow
learner.

So why go to AMD64? Well, for starters, it’s really fast today. Forget this
emulation crap – these chips run native x86/win32 apps today, and fast.
This makes the upgrade path seamless, and at the prices these chips are selling
for, it’s now officially foolish to invest in additional 32-bit chips. Add to
this the fact that you get all of the benefits of other 64-bit architectures,
and you have a killer proposition.

There are two main variants of AMD64 chips currently in production, both of
which are made by AMD. They are:

  • Athlon 64 – This is the main consumer chip that AMD is marketing.
    Variants of this chip run on everything from notebooks to low-end home boxes to
    business-class workstations. These chips seem to be labeled with AMD’s
    traditional performance index (Athlon-64 3700, for example). They have
    versions of the chip marketed to the Celeron crowd, and even a mobile processor
    based on the 64-bit core. Not even Apple, my favorite hardware company, can
    manage that with a 64-bit G5 yet, after a year of shipping G5 workstations.

  • Opteron – This is the high-end chip, targeted at servers and
    workstations. It uses the same core and same programming interface as the
    Athlon chips, but includes higher-end features like lots of cache and the
    ability to run multiprocessor setups. The Opteron chips connect to one another
    on a motherboard via AMD’s proprietary HyperTransport bus. As I understand it,
    this bus is designed to work in conjunction with the Opterons’ built-in memory
    controllers to reduce memory bus contention. Opterons are numbered differently,
    with the base Opteron 140 at the low end. The 142, 144, 146, and so on, are
    faster versions of the 140, but all 1XX chips are designed for uniprocessor
    implementations only. The 240 series is identical to the 140, but can be run
    with a maximum of two CPUs on a motherboard. The numbering scheme continues on
    upward as speed and maximum scalability increase.

More on the AMD64 architecture can be found at
WHDC, AMD,
and of course at ArsTechnica. There is also
a previous Kernel Mustard post about my adventures with my dual Opteron 240, below.

One parting shot – There are a couple of blogs worth pointing out, in case
you’re not reading them yet. One is The
Old New Thing
by Raymond Chen, which covers a wide variety of topics that I almost
always find interesting (well, at least until he starts pontificating about GUIs!).
Another is what promises to be an excellent new
blog
by Thomas Divine. Thomas spends a lot of time dolling out free advice
on the PCAUSA NDIS discussion list (which I also post to on occasion), and is
the leading non-Microsoft source for network driver information.

Now, back to beating my head against the Hell Of Filesystem Locks. See you
tomorrow.


* It is worth pointing out that this is a lie; I have never really been able to say anything in a nutshell.

Twelve Programming Truths

I’ve posted an article here that I’ve been tinkering around with for a while. It’s an adaptation of RFC1925 to computer programming. Feedback welcome, of course.

The Twelve Programming Truths

The Twelve Programming Truths

Steve Dispensa
CTO, Positive Networks
MVP, Microsoft Windows DDK

August 1, 2004

The Twelve Networking Truths (also known as the Twelve Fundamental Truths) were
first collected and presented in RFC 1925.
These were meant to apply to the
practices of network architecture and engineering, and they are fantastic in
that context. As my career transitioned from network architecture to driver
development, I kept coming back to the Twelve Truths every so often, and each
time I noticed their remarkable value even as adapted to software development.

Like the author of RFC 1925, I have by no means discovered or invented these
truths, but after having trained a fair number of new software developers, it
has become oblivious that these principles, as applied to computer programming,
are generally useful and valuable.

With that in mind, I will discuss each of the Twelve Truths as applied to the
craft of software engineering. It is my hope that you find as much value in
them as I have. I hope that I haven’t destroyed their impact with my attempt
to adapt them to another discipline, although I’m afraid something is always
lost in the translation.

Two acknowledgments must be made: first, to Ross Callon, who
first assembled the Truths into RFC 1925 on April 1, 1996, and to Dr. Deep Medhi,
head of the computer networking department at the University of Missouri – Kansas
City, who first introduced me to the RFC.


   (1)  It Has To Work.

You’d think this Truth would go without saying, but in fact, sometimes it seems
like the last thing on the minds of a room full of software engineers during
design of a program. Much (most?) software development is done for money.
Clients, employers, and customers all expect the code to work. They don’t
usually even care how it works (sometimes to the chagrin of the developers!),
but It Has To Work.

There are a couple of subcases of It Has To Work. The first case is
understanding the definition of “work”. This word, like “done”, gets thousands
of connotations and definitions in various contexts, and there is nobody like a
programmer to confuse these definitions. Your client may not have any use for
your word processor if the spell checker gets exponentially slower with every
word typed. Yes, the check may complete eventually, but it doesn’t “work” for
the customer.

Another aspect of “working” is completeness. I’m not the first to quote this
concept, but my empirical observations have led me to conclude that it takes
50% of the time to do the first 97% of the work, and 50% of the time to do the
last 3%. Around my office, we have the “Last 3% Rule” when it comes to project
planning. It may be that the encryption algorithm works perfectly, but until
the user interface “works” – all the buttons do the right things, the
appearance is polished and professional-looking, it doesn’t suffer from state
problems, etc. – it’s not done.

At my office, we don’t declare something Working until it has passed our
Quality Assurance testing. The QA team is a separate group of folks who are
responsible for thoroughly verifying that the software’s functionality. It
doesn’t Work until they say it does.

There’s a reason this is Truth #1. Nothing else matters if it doesn’t Work, in
every sense of the word. Above and beyond everything else, Working software
yields happy customers.

   (2)  No matter how hard you push and no matter what the priority,
        you can't increase the speed of light.

        (2a) (corollary). No matter how hard you try, you can't make a
             baby in much less than 9 months. Trying to speed this up
             *might* make it slower, but it won't make it happen any
             quicker.

The original intent of this Truth is to illustrate that there are laws (some
from Physics, some from The Reality Of Life) that cannot be broken. In
networking, system throughput is eventually totally bound by the speed of
light, and there’s nothing that even the most talented engineer can do about
it, other than work around these inherent limitations.

The obvious application to software engineering is that software development
takes time. It usually takes more time than the customer would like, and often
takes more time than the developer would like (in particular, the Last 3% is
usually pretty boring work). However, failing to understand this fact can only
lead to one of three outcomes:

  • The deadline arrives, and your software doesn’t Work
  • Your software Works, but you’ve blown the deadline
  • (in some cases) The deadline arrives and the software Works, but you do not, due to chronic lack of sleep and food.

None of these are good outcomes, and all of them, in a bad enough case, can put
a serious and permanent dent in your ability to earn money as a programmer.
Also, that last case can only happen if the planning isn’t off by more than the
amount that you can make up by your superhuman effort.

Nobody is a perfect predictor of the time it takes to develop software, but you
owe it to yourself as a programmer to give yourself a fighting chance of
hitting your deadlines. This does *not* mean that you should plan on having to
work tons of extra hours just to hit an un-hittable date. Rather, it means
that you must work with customers, project managers, and anyone else who is
interested, to make sure that expectations are set correctly up-front. There
is no way out of this task, and all previous known attempts to get around
expectation management have failed miserably.

   (3)  With sufficient thrust, pigs fly just fine. However, this is
        not necessarily a good idea. It is hard to be sure where they
        are going to land, and it could be dangerous sitting under them
        as they fly overhead.

I’ve heard a lot of bad ideas in my career. These ideas are usually the result
of an overzealous engineer that doesn’t quite understand the problem at hand,
and doesn’t really want to figure out the Right Way To Do It. In almost all
cases, there is a right way and a large number of wrong ways to accomplish a
given task. Wrong ways include everything from hooking Windows system calls in
production code to hacking in a new feature in the 11th hour with insufficient
design documentation.

Sometimes these tactics might work, for a while, but you will always suffer in
the end.

   (4)  Some things in life can never be fully appreciated nor
        understood unless experienced firsthand. Some things in
        networking can never be fully understood by someone who neither
        builds commercial networking equipment nor runs an operational
        network.

Find an expert in any field of software engineering. There is virtually a 100%
chance that he or she has written production code in that field. Code that has
been sold to someone, or on which someone else depends in one way or another.
If this seems obvious to you, consider the number of Monday morning
quarterbacks. Software engineering has them too. Sometimes they are students,
sometimes they are managers, sometimes customers, sometimes even professors.
Whoever they are, if they haven’t written production code they won’t understand
the practice of software engineering. If you are one of these people,
recognize this fact and have some faith in your programmers.

There is a tangible difference between writing code for a class project, or
even “just for fun”, and writing production code. The best way by far to
become a better programmer is to write more production code. Until you’ve
gotten tech support calls, bug reports, feature requests, and (perhaps
inevitably) patched vulnerabilities in your code, you’re just practicing.

   (5)  It is always possible to aglutenate [sic] multiple separate problems
        into a single complex interdependent solution. In most cases
        this is a bad idea.

Good design is hard to find. It is a balance between generalization and
perfection in the expression of a set of abstract ideas on the one hand, and
being down-to-earth enough so as to avoid requiring the coding of lots of
exception cases on the other. Good designs are possible with any design
paradigm, and some are better suited to certain problems than others. Whatever
your design paradigm, don’t over-complicate the situation. As my grandfather
used to say, “Don’t make a Rembrandt!”

   (6)  It is easier to move a problem around (for example, by moving
        the problem to a different part of the overall network
        architecture) than it is to solve it.

        (6a) (corollary). It is always possible to add another level of
             indirection.

Rule 6a is one of the wisest things ever said. Many of the design problems I
alluded to in Rule 5 above are really just this in disguise. Just remember –
you have to pay the price for bad design eventually. If you find yourself
encountering Rule 6, it’s probably a pretty good sign that you need a
re-design.

For added entertainment, notice that Rule 6a (about indirection) is itself a
sub-rule. 🙂

   (7)  It is always something

        (7a) (corollary). Good, Fast, Cheap: Pick any two (you can't
            have all three).

These Truths call to mind Rule #2 – there is no substitute for planning.
Furthermore, there are problems that no amount of planning can solve. This
is not really a re-formulation of Murphy’s Law (although you should keep that
in mind as well!), but merely a description of the peculiar art of software
engineering. Things happen. Designs change. APIs don’t work as documented.

Oddball platforms don’t support otherwise-universal interfaces. The list goes
on, and by definition (and by Rule 7), you can’t predict them. Better to keep
this fact of life in mind during the planning stage of the project, and you’ll
have happier customers, spouses, partners, managers, etc.

Rule 7a needs no elaboration, other than to say that I have found it to be true
over and over again. As a general principle, you cannot optimize Rule 7a away.

   (8)  It is more complicated than you think.

Much like Rule 7, this rule reminds us that human nature is to assume the
straight-ahead case, minimizing the possibility that things could Go Wrong.
People are just wired that way; there’s no way around it. As a software
engineer, you must be vigilant against the feeling that you completely
understand your problem.

It was often said about the Theory of Relativity that only a couple of people
in the entire world understood it during Einstein’s time, despite the belief on
the part of many that they had their brains wrapped around it. The bummer
about this Truth is that it is usually not empirically obvious until after the
painful process of learning has occurred.

One specific point about Rule 8 relates to the area of software security.
Here, it takes more than best practices to write secure code. You must always
maintain an attitude of extreme caution. It’s extremely difficult to know if
your code is secure as written. Lots of code that is secure in one environment
turns out to have big holes in it in other environments. While there is no
guarantee that this will prevent vulnerabilities, defensive coding practices
and a general attitude of suspicion about everything you write will go a long
way. I know I’m suspicious of everything I write. 🙂

   (9)  For all resources, whatever it is, you need more.

       (9a) (corollary) Every networking problem always takes longer to
            solve than it seems like it should.

While Rule #9 has its obvious applications to software engineering, there is
one particular uniqueness about software: often it is impossible to just “add
resources” to a programming project. One reason is that the ramp-up time is
often very long, even with an experienced coder. Another is that the
complexity of adding a developer to a project tends to be O(n) with n = the
number of programmers on the project. Furthermore, that implies that the
“total complexity” of the project is O(n^2) [1]. Both of these numbers can be
improved upon in practical cases with good design and organization of the
project, but there’s no doubt that small teams have an easier time being highly
efficient.

   (10) One size never fits all.

This is another admonition against over-ambitious design. The natural tendency
of the human (programmer’s) brain is to generalize. Triumphs of generalization
are everywhere around us, particularly in the realm of physical science. Other
things, such as advances in hardware engineering, would be impossible without
the art of generalization.

But, like anything, too much of a good thing ceases to be a good thing. One
database developer I know is fond of saying that “the design cannot be so
abstract that the database doesn’t hold anything!” Often, inexperienced
software architects over-generalize a design in their eagerness to arrive at
the “perfect” solution. However, this is seldom possible. Software
engineering is a real-world activity, where imperfections in design are a fact
of life.

Sometimes it’s okay to have one-off solutions. A skilled software architect
develops a sense for how to minimize exception cases while maximizing the
elegance of a design.

   (11) Every old idea will be proposed again with a different name and
        a different presentation, regardless of whether it works.

        (11a) (corollary). See rule 6a.

As the rule itself states, this is just another way to say Rule 6a. Be on
guard against getting drug down the same old path again and again. It is a
waste of time and money.

   (12) In protocol design, perfection has been reached not when there
        is nothing left to add, but when there is nothing left to take
        away.

Possibly the most valuable axiom ever contributed to the world’s body of
engineering knowledge. The same goes for software design – there is no need
to overcomplicate software, and there is much need to keep it simple. Good
software must be maintainable, and the less of it there is, the easier the
task of maintenance.

Again, this rule runs against human nature – we like to build things –
elaborate things – but nothing could be worse for software engineering. More
than anywhere, software engineering requires the KISS principle.


Summary

I have long relied on the Twelve Truths, first in network engineering, and now
in software engineering. There are few documents with as high of a concentration
of practical wisdom as this one, and new engineers would do well to read and
understand RFC 1925.


[1] “Proof” of the complexities of adding developers to a project:

Every coder has a certain amount of undocumented “state” in his or her head. This
state is often a requirement for writing successful code, or at least is required to
do it efficiently. It is this state that makes a group software development project
complex.

Call the “amount” of this state X. If all n developers have X state
in their heads, there is nX amount of undocumented state in the project. Each additional
developer brings X more state to the project.

Each developer will eventually have to know the state in the heads of all of the other
developers on the project (n-1)(X), requiring a total amount of “learning” of state of
n(n-1)(X).

Therefore, adding a developer to the project (who starts with no state, but develops it over
time) requires that developer to acquire (n-1)(X) amount of state, or O(n). The total amount
of complexity in the project then rises to n(n-1)(X), as that developer adds his state to
the project, which is O(n^2).

Driver DevCon and FAST_MUTEXes and ERESOURCES, oh my!

Hello world. As you may know, I live in Kansas, so I’m allowed to use a title like the one this post has. Deal with it. If you don’t get the reference, you’re really better off.

Bad news – I just got word today that the Driver DevCon conference (redundant?) has been pushed off until Spring 2005. Bad news, yo! That was the only conference I was actually looking forward to this year. Sigh…. too bad, I’ll just have to wait another six months. This conference was supposed to be the first few days of November, but that is kind of an uncomfortable time anyway, as the US General Election is that Tuesday. Still, as one of my favorite bloggers would say… NO BUZZ!

So, how many of you have ever actually used a fast mutex in your kmode code? I’m sure most people know what they are if they have written any kmode code (if for no other reason than searching the Ex functions in the DDK index window always seems to land on the Fast Mutex stuff first). Well, I was going through some old vids of Microsoft driver development conferences the other day, and David Goebel (one of the almost-original NT guys) was talking about when and why to use them.

He made a point that I’d heard before but had never really taken to heart: you should use the least impactful synchronization primitive that you can get away with, for any given situation. We all know and love the KSPIN_LOCK, and I’m sure most device-type people get away with using only these locks. However, sometimes, using a spinlock is (to quote my grandfather) like killing an ant with a sledgehammer. There are other, more friendly locks that can be used, for the sake of yourself and for the rest of the system.

To review, the way a spinlock works is as follows: a memory region (the KSPIN_LOCK object you allocated and initialized) is “spun” on by the CPU using an atomic test-and-set operation until it finds that it now owns the lock. You could imagine pseudocode like this:


VOID AcquireLock(LONG *lock)
{
        while(InterlockedExchange(lock, 1));
}

VOID ReleaseLock(LONG *lock)
{
        *lock = 0;
}

(Five points to the first person to point out the weakness in this code). In addition to spinning on the processor, the kernel raises the IRQL to DISPATCH_LEVEL. In fact, on a uniprocessor kernel/hal, *all* that is done during KeAcquireSpinLock() is raising the IRQL to DISPATCH_LEVEL. If you don’t see why this is, keep thining about the meaning of IRQL. Maybe I’ll post about it another day.

Anyway, the point here is that grabbing (or trying to grab) a spinlock is not a nice thing to do to a computer. You might be ticking along just fine at PASSIVE_LEVEL (or maybe APC_LEVEL, or even IRQL 0.5 (ask me later)), and you decide you need a lock. Now, the fair thing to do would be to get a lock at low IRQL, so that when your quantum expires, you will get context-switched and someone else can do some work. If you’re at DISPATCH_LEVEL, it’ll never happen – you *own* the CPU (interrupts aside) until you drop your lock. How rude! But it’s actually even worse than that: while you’re at DISPATCH_LEVEL, you’re not just blocking context-switching – you’re actually blocking all DPCs on that chip. This is even worse, as the “standard model” (to borrow Walter Oney’s phrase) for interrupt processing involves queuing a DPC to do the real work. That’s right – you’re keeping other device drivers from doing their jobs. If you do this often enough or long enough, users will notice, and they’ll associate the degraded performance with the installation of your driver (“it was fine until I installed my WhizBang 3000!”).

So what’s a driver writer to do? To re-state David’s general principle, you should use the least amount of lock necessary to synchronize with threads at the highest IRQL you can be called at. If you are synchronizing DPCs against each other, you’re going to need spinlocks, because DPCs cannot block, they can only spin. That has the effect of requiring every locker to spin, whether or not they could have blocked. This is also the situation when synchroninzing with your ISR (via KeSynchronizeExecution()). However, if you are a software-only driver of some sort and you know you won’t ever be trying to grab a lock at DISPATCH_LEVEL, you don’t need spinlocks at all. In these cases, you have lots of other choices, but my two favorites are FAST_MUTEXes and ERESOURCES.

Fast Mutexes are manipulated via the ExInitializeFastMutex/ExAcquireFastMutex/ExFreeFastMutex functions. They are very fast and very small. Also, since they aren’t dispatcher objects, the kernel doesn’t have to grab the dispatcher lock to service you. This is a Good Thing(TM), as the dispatcher lock is right up there with the cancel lock in terms of contention. Fast Mutexes do only what you think they should, as minimally as possible, and as effeciently as possible.

ERESOURCEs are sometimes known as reader/writer locks or shared/exclusive locks. If you can use them, you should, as they will generally have the effect of reducing contention and increasing concurrency in your driver. ERESOURCEs are suitable whenever you have operations that only need a lock to keep stuff from getting changed out from under them, as opposed to needing to actually make changes themselves. The Windows filesystem drivers make extensive use of ERESOURCEs.

Someone recently asked on NTDEV about dropping the IRQL after you grab a lock. I’ll refer you there for the details (I believe it was Doron Holan who posted the answer), but to make a long story short, never EVER do this. You can basically only call KeLowerIrql() if you previously called KeRaiseIrql(), and even then you can lower the IRQL no lower than where it was when you called KeRaiseIrql().

A couple of other points about FAST_MUTEXes. First, they are *not* dispatcher objects. That means that you cannot call KeWaitFor***() on them, and in particular, you cannot wait for multiple objects with them. If that’s what you want to do, they all have to be dispatcher objects. Also, unless you are very sure you know what you’re doing, don’t try to use ExAcquireFastMutexUnsafe(). Another point to remember is that acquisition of a fast mutex mutex will raise the IRQL to APC_LEVEL, which isn’t bad most of the time, but does block all APCs out, so there are some things that the thread you’re running in won’t be able to do until you drop again. Finally, fast mutexes are not reentrant. You will deadlock if you try to acquire a fast mutex that you already own. Again, if recursive acquisition is what you need, go look at regular dispatchre MUTEXes.

All of this basically just boils down to being a good citizen and obeying the Law Of IRQL Minimization at all times. Don’t lock if you don’t need to, don’t lock for longer than you need to, and use the right tool for the job.

CSQ Wrap-up: Implementation Notes

By now, I assume you’re sold on the notion of converting to CSQ, and waiting with baited breath for details on how to go about it. If so, this article is for you!

Implementing CSQ in your driver project is really easy to do. There are two bacsic steps:

  1. Implement a few callbacks to pass to the CSQ library
  2. Implement the queuing system in your project

CSQ Callbacks

There are six callbacks that you have to implement. In many cases, the implementations will be completely trivial – Microsoft just provided lots of knobs in order to make the library as generally useful as possible. The other nice thing about these routines is that you basically don’t have to worry about synchronization at all, as they are called with the CSQ lock held when need be (but remember that queue-manipulation functions are called at DISPATCH_LEVEL!). All of these routines are documented in the DDK, but here’s a quickie implementor’s guide.

I like to put all of my CSQ support stuff into a sngle file. There are a couple of file-globals that I usually use: a LIST_ENTRY to point to my queue of IRPs, a KSPIN_LOCK to pass to acquire and release when CSQ tells me to, and of course an IO_CSQ struct.

By way of function implementations:

  • CsqRemoveIrp – typically just RemoveEntryList. Note that you shouldn’t call an interlocked operation here because you don’t have to.
  • CsqInsertIrp – typically just InsertTailList.
  • CsqPeekNextIrp – this function returns a pointer to the next IRP in the list to be removed. This is the least straightforward function, and is worth a read of the DDK for reference. In general, you can return NULL if the list is empty, the head of the queue if no starting IRP is specified, or the IRP itself if one is specified.
  • CsqAcquireLock – typically just KeAcquireSpinLock.
  • CsqReleaseLock – typically just KeReleaseSpinLock.
  • CsqCompleteCanceledIrp – a mini-completion routine. You may just be able to complete the IRP with STATUS_CANCELLED.

That, friends, is all that it takes. This doesn’t really take more than a couple of hundred lines of code, and the best part is that once you build your CSQ support file, you can re-use it over and over for similar projects.

Using The Queuing System

Using the CSQ library for the queuing operations in your driver is really a piece of cake. The first step, of course, is to initialize the library. This is typically done during DriverEntry, but can be done wherever is convenient for you. Initialization is done by passing pointers to your callbacks to IoCsqInitialize(). Once that is done, and your LIST_ENTRY and KSPIN_LOCK are initialized, you’re ready to use the queue. This is as easy as calling IoCsqInsertIrp and IoCsqRemoveNextIrp. If your queue management needs are more complex than just running the queue in FIFO fashion, other IoCsq APIs allow more control over queue operations.

Summary

This implementation guide covers the simple (and common) case of just needing a single basic FIFO queue to store IRPs during processing. More advanced scenarios are possible, of course. In particular, one of the most common IRP-queuing scenarios involves the use of the StartIo model. If your driver uses StartIo for IRP queuing, CSQ may be just what the doctor ordered for increased queue management and better overall performance. Microsoft has provided a special CSQ sample in the DDK for you; do have a look. The major tweak involves duplicating the semantics of insertion in your CsqInsertIrp callback.

I hope some of you have found this series helpful. Feel free to post questions or comments, and of course corrections. I’d be particularly interested to hear about anyone else’s experience in converting to CSQ – problems encountered, performance differences, etc.