After much frustration following the dismal failure of my naming theory, we decided to do the unthinkable and actually test the code in the field, with the affected customer. We got exceedingly lucky this time around – the administrator at the customer site (in southern Mexico!) was amazingly helpful and willing to jump through all of the stupid hoops I asked him to.
So, we went back to the drawing board and got a full read-out from the customer on what actually happens when things break. He mentioned a detail that I had missed the first time around, that made all the difference: he said that when the computer drops off the network, existing drive maps break in addition to new attempts. OK, so this probably isn’t name resolution after all.
Although this technique is probably offensive to folks like Raymond Chen, I decided the best course of action here would be to sweet-talk the admin into putting a remote control application (I use TightVNC a lot, including this time) on the affected file server. I had him set up a packet capture using Ethereal, but when things broke, all I got were a bunch of TCP RST packets, indicating that the server decided that there was no longer a listening service at that port. Weird.
Then another engineer decided to poke around in the server’s event log. We started seeing messages from SRV (the file sharing service) complaining about failing to allocate a work item. This time we thought for sure we were onto something. We set up a reproduction environment, and this time, we stressed the network (as opposed to just letting it sit there for a period of time, waiting for name resolution to stop), and under stress, the box froze in mere minutes, putting the same message into the event log.
After doing some comparisons with the Passthru intermediate driver sample, another engineer found the bug: we had allocated a packet pool of only 32 packets, as opposed to the 65,535 that Passthru allocated. Packets were getting mishandled internally during periods of high stress, and somehow, this was causing SRV to bomb out and just quit listening on the ports.
A new driver was built and sent to the client, who ran with it for a few days and declared victory. This whole experience took several days, but once again, just goes to show you that there is no substitute for getting close to the problem, either with a reliable reproduction or with close customer contact during debugging.
kernel data inpage error
Very nice site! Good work.