Saturday, January 28, 2006

Debugging Story One - First Pass

[Our previous posting covered the symptoms of this problem]

So, the usual stuff:

1. Get the hardware from the client (in this case just a bare circuit board from inside the machine).
2. Make sure we can see the same problem as they can...
3. Clone a nice clean VMWare XPSP2 machine for testing with.
4. Hook-up Windbg to VMWare as a kernel mode debugger, via a named pipe.
5. Get their drivers installed on the VMWare machine. Snapshot it...

Then, I plugged and unplugged the device. (Be careful about VMWare focus when you're testing USB devices with it.) As expected, I saw the same problem in VMWare that had been seen elsewhere.

So, with Device Manager 'stuck', and plug/unplugs not being recognised, I broke-in with Windbg.

The first step is to look for system-wide locks:

kd> !locks
KD: Scanning for held locks......

Resource @ nt!IopDeviceTreeLock (0x805584e0) Shared 1 owning threads
Threads: 81bc9020-01<*>
KD: Scanning for held locks.

Resource @ nt!PiEngineLock (0x80558460) Exclusively owned
Contention Count = 1
Threads: 81bc9020-01<*>
KD: Scanning for held locks...

The first one it finds is called "IopDeviceTreeLock". I don't know of any public documentation which describes this lock, but it sounds relevant to being allowed to look at the device tree.

Let's look at the thread which is holding this lock:

kd> !thread 81bc9020
THREAD 81bc9020 Cid 0004.003c Teb: 00000000 Win32Thread: 00000000 WAIT: (Executive) KernelMode Non-Alertable
81858e5c NotificationEvent
IRP List:
818a0008: (0006,01b4) Flags: 00000000 Mdl: 00000000
Not impersonating
DeviceMap e10010a8
Owning Process 81bcc830 Image: System
Wait Start TickCount 14706 Ticks: 2591 (0:00:00:40.484)
Context Switch Count 567
UserTime 00:00:00.0000
KernelTime 00:00:02.0703
Start Address nt!ExpWorkerThread (0x804e4196)
Stack Init f9eab000 Current f9eaaa8c Base f9eab000 Limit f9ea8000 Call 0
Priority 13 BasePriority 12 PriorityDecrement 0 DecrementCount 16
ChildEBP RetAddr Args to Child
f9eaaaa4 804dc0f7 81bc9090 81bc9020 804dc143 nt!KiSwapContext+0x2e (FPO: [EBP 0xf9eaaad8] [0,0,4])
f9eaaab0 804dc143 81858e5c 81858ac8 00000000 nt!KiSwapThread+0x46 (FPO: [0,0,0])
f9eaaad8 f9da9110 00000000 00000000 00000000 nt!KeWaitForSingleObject+0x1c2 (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
81858a10 00000000 8196b030 00000000 00000000 CARD+0x6110

There's the 3rd-party smartcard driver at the bottom - it's holding a lock on the device tree, and it's waiting for something.
There's an IRP active on this thread too - let's have a look at that:

kd> !irp 818a0008
Irp is active with 9 stacks 9 is current (= 0x818a0198)
No Mdl Thread 81bc9020: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
>[ 1b, 2] 0 0 81858a10 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000

Urgh! What a horribly deep stack... I won't bother including all those zeros next time.

It's an IRP with major function 0x1b, minor function 0x2 - this is IRP_MJ_PNP, IRP_MN_REMOVE_DEVICE.

So, the smartcard driver is stuck in its REMOVE_DEVICE handler, blocking-up the whole of the device tree.
That doesn't really strike me as startlingly useful info, except that it does confirm that the problem with the unloading is to do with the HIDCOM/CARD.SYS stack.

Let's look for other IRPs:

kd> !irpfind
unable to get large pool allocation table - either wrong symbols or pool tagging is disabled

Searching NonPaged pool (811b0000 : 81c00000) for Tag: Irp?

Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
81852880 [00000000] irpStack: ( 0, 0) 818516d0 [ \Driver\HidCom]
818a0008 [81bc9020] irpStack: (1b, 2) 81858a10 [ \Driver\CARD]

(I've snipped dozens of irrelevant ones)

There we see the PNP IRP we've already looked at, and another one which has been sent to our serial port driver.

kd> !irp 81852880
Irp is active with 10 stacks 9 is current (= 0x81852a10)
No Mdl Irp count = 81858884 Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context

(Lots of empty stack positions snipped)

>[ 0, 0] 0 e0 818516d0 00000000 f9e460f4-81851788 Success Error Cancel
\Driver\BcHidCom BcHidCom
Args: 00000000 00000000 00000000 00000000
[ e, 0] 0 e1 818516d0 00000000 f9da763c-81858ac8 Success Error Cancel pending
\Driver\BcHidCom GemPC413
Args: 81851b10 00000000 001b0048 00000000

This is of major type 0xe (IRP_MJ_DEVICE_CONTROL), which corresponds to the WIN32 DeviceIOControl() function, or to about half a dozen of the serial port API functions.

The IOCTL code is the third argument - 0x1b0048.

IOCTL's are tedious to decode at a glance, but the 0x1b tells us it's a serial port IOCTL, and the 0x48 is four times the IOCTL index number, which is 18 (decimal). This is IOCTL_SERIAL_WAIT_ON_MASK, which is the kind of thing you get from the WIN32 function WaitCommEvent. When you send IOCTL_SERIAL_WAIT_ON_MASK to a serial device, it holds that IRP pending until one of a number of configurable events happens on a port - for example a byte is received at the serial port.

At this point, we couldn't understand why HIDCOM still had an IRP pending after a surprise removal - drivers are supposed to fail any pending IO when they receive a surprise removal. Looking at the relevant HIDCOM device object:

kd> !devobj 818516d0
Device object (818516d0) is for:
HidCom1 \Driver\BcHidCom DriverObject 81af6af0
Current Irp 00000000 RefCount 0 Type 0000001b Flags 0000204c
Dacl e137aae4 DevExt 81851788 DevObjExt 81851e98
ExtensionFlags (0x00000008) DOE_REMOVE_PROCESSED
AttachedDevice (Upper) 81885c10 \Driver\serenum
AttachedTo (Lower) 81869030 \Driver\usbccgp
Device queue is not busy.

gives the helpful sounding DOE_REMOVE_PROCESSED, which certainly sounds like the system thinks it's sent all the REMOVE-like IRPs, but I don't know of any documentation which confirms that it means that.

The two obvious ways to confirm which PNP IRPs had been sent to our device are either to set breakpoints in our IRP dispatch routine, or to install a checked build of our driver - it traces things like that to the debug stream.

I find breakpoints like that rather tedious and easy to get confused in - and the fact that there are two instances of the device loaded makes them particular confusing. So we installed the checked build of our driver, and waded through lots of spew to find the PNP irps. This revealed that our HIDCOM devices were receiving and handling IRP_MN_SURPRISE_REMOVAL and we knew that contained code to our queues of read and write IRPs:

IrpFailQueue(&pExt->writeIrpQueue, STATUS_DELETE_PENDING);
IrpFailQueue(&pExt->readIrpQueue, STATUS_DELETE_PENDING);

And of course, that's the problem. The stuck IRP is neither a read nor a write - it's an IOCTL. The IRP for IOCTL_SERIAL_WAIT_ON_MASK was pending, and our driver wasn't completing it on a removal. So a simple addition of

WaitCommEventRelease(pExt, STATUS_DELETE_PENDING);

to the relevant section of our PNP code, and suddenly everything sprang into life. Now when I unplugged the device, I saw our device unload (my favourite bit of trace from this driver is the line in the DriverUnload handler...).

This was about 8.30 on a Friday night, and I was delighted to be able ship this version to the client for them to try on Monday morning.

And it took all the way until Monday afternoon before they were back with the next problem. I'll write that up in a minute.

1 comment:

Edward Choh said...

Great writeup, I'm troubleshooting something very similar with a TDI client and your article helped quite a bit.