Saturday, February 11, 2006

WHQL and patching

The smartcard driver which appears in the previous posts raises an interesting point about WHQL signing. The supplied smartcard driver was WHQL signed - it was signed back in 2001, despite having the serious IRP completion bug.

Of course, the patched, working driver is no longer WHQL signed.

So which is the higher quality driver?

And, if we ever were able to get through the 'null' support at the vendor to offer them a suggested improvement, which would be the bigger barrier to them implementing this -recompiling the driver with one line of code extra, or retesting it and doing a WHQL resubmission?

Wednesday, February 01, 2006

Debugging Story One - Second Pass

[See previous posts for earlier details about this problem]

Ok, so we fixed the problem with the WaitCommEvent IRP not being failed on a SURPRISE_REMOVAL. I'd always prefer the bugs to be found in someone else's code, but really I'm just happy when they're fixed.

So I was somewhat dischuffed when the client came back (after having made initial sounds of delight) to report that they could only unplug and replug the device four times. After the fourth time, the smartcard driver would appear 'banged out' in Device Manager, with a 'Code 10' report on it.

Dead strange this - clearly something was remembering to count to four somewhere and something was imposing a limit of four on itself. The biggest problem was that I couldn't reliably reproduce this - some machines did it, sometimes, but nothing which I was prepared to run a kernel debugger on would demonstrate the problem at all consistently.

Whenever I ran this on a machine where I had good visibility, the driver stack would unload properly and, as you'd expect, this didn't leave anything around to remember that it was supposed to be counting to four. And yet the client was reporting this on every machine they tested on, both XP and W2K.

Eventually, after the client rejected several attempts by me to admit defeat, we jointly worked out that the problem only showed-up on a machine where the Windows Smart Card service was running. (Can you tell I'm not a smart-card expert yet?)

A quick 'net start "smart card"' and there we were, with a driver stack which didn't unload properly.

Unlike the last time, we weren't stuck with the PnP system all locked-up, so we could re-plug the device again successfully - but each time we did this, another pair of HIDCOM devices and another CARD device were created. The CARD driver sets itself a limit of four devices - after that it fails to create any more. The HIDCOM device cheerfully goes on forever adding devices.

Stopping the smart card service after unplugging didn't help - once the devices were stuck, they were there for good.

So for more detail, back to Windbg:

kd> !irpfind

Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
[snip]
8a638840 [8a559478] irpStack: ( e, 0) 8a5acb48 [\Driver\CARD]
[snip]


kd> !irp 8a638840
Irp is active with 9 stacks 9 is current (= 0x8a6389d0)
No Mdl System buffer = 8a0ed9e8 Thread 8a559478: Irp stack trace.
cmd flg cl Device File Completion-Context

[snip]

>[ e, 0] 5 0 8a5acb48 8a29d2f8 00000000-00000000
\Driver\CARD
Args: 00000004 00000000 00310038 00000000

As you can see, there's an IRP pending for the smartcard driver. It's IRP_MJ_DEVICE_CONTROL, and corrsponds to IOCTL_SMARTCARD_GET_STATE (that's the 00310038 ).

So, again the problem is a dangling IRP. There shouldn't be any pending IRPs at this point; the driver should have completed them all during its remove processing.

This didn't really look like a problem with our driver (HIDCOM) - it looked like a problem with the smart card driver, for which we had neither source nor symbols. And our client said of the smartcard system vendor "their support is null". After another half-hearted (and rejected) attempt to have the client abandon the job, we fired-up Ida Pro (www.datarescue.com - superb product and service) and set off into CARD.SYS.

You get a pretty gentle start into reverse engineering device drivers - DriverEntry is the DLL's entry point, and the filling-in of the function pointers in the DriverObject is easy to dissassemble and gives you a nice initial set of named functions with known signatures:

mov     ecx, offset sub_14464
mov [eax+38h], ecx
mov [eax+40h], ecx
mov dword ptr [eax+80h], offset loc_14592
mov dword ptr [eax+70h], offset loc_13FDE
mov dword ptr [eax+0A4h], offset sub_1423C
mov dword ptr [eax+90h], offset loc_14410
mov dword ptr [eax+94h], offset loc_13FB6

quickly becomes

mov ecx, offset DispatchCreateClose
mov [eax+DRIVER_OBJECT.MajorFunction], ecx
mov [eax+(DRIVER_OBJECT.MajorFunction+8)], ecx
mov [eax+(DRIVER_OBJECT.MajorFunction+48h)], offset DispatchCleanup
mov [eax+(DRIVER_OBJECT.MajorFunction+38h)], offset DispatchDeviceControl
mov [eax+(DRIVER_OBJECT.MajorFunction+6Ch)], offset DispatchPnp
mov [eax+(DRIVER_OBJECT.MajorFunction+58h)], offset DispatchPower
mov [eax+(DRIVER_OBJECT.MajorFunction+5Ch)], offset DispatchSystemControl

As an aside - windbg will also give you the major function pointer offsets like this:

kd> !drvobj CARD 7
Driver object (8ab086e0) is for:
\Driver\CARD
Driver Extension List: (id , addr)

Device Object list:
8a5acb48

DriverEntry: b4c1c360 CARD
DriverStartIo: 00000000
DriverUnload: b4c1c1c2 CARD

Dispatch routines:
[00] IRP_MJ_CREATE b4c1a464 CARD+0x4464
[01] IRP_MJ_CREATE_NAMED_PIPE 804fcab1 nt!IopInvalidDeviceRequest
[02] IRP_MJ_CLOSE b4c1a464 CARD+0x4464
[03] IRP_MJ_READ 804fcab1 nt!IopInvalidDeviceRequest
[04] IRP_MJ_WRITE 804fcab1 nt!IopInvalidDeviceRequest
[05] IRP_MJ_QUERY_INFORMATION 804fcab1 nt!IopInvalidDeviceRequest
[06] IRP_MJ_SET_INFORMATION 804fcab1 nt!IopInvalidDeviceRequest
[07] IRP_MJ_QUERY_EA 804fcab1 nt!IopInvalidDeviceRequest
[08] IRP_MJ_SET_EA 804fcab1 nt!IopInvalidDeviceRequest
[09] IRP_MJ_FLUSH_BUFFERS 804fcab1 nt!IopInvalidDeviceRequest
[0a] IRP_MJ_QUERY_VOLUME_INFORMATION 804fcab1 nt!IopInvalidDeviceRequest
[0b] IRP_MJ_SET_VOLUME_INFORMATION 804fcab1 nt!IopInvalidDeviceRequest
[0c] IRP_MJ_DIRECTORY_CONTROL 804fcab1 nt!IopInvalidDeviceRequest
[0d] IRP_MJ_FILE_SYSTEM_CONTROL 804fcab1 nt!IopInvalidDeviceRequest
[0e] IRP_MJ_DEVICE_CONTROL b4c19fde CARD+0x3fde
[0f] IRP_MJ_INTERNAL_DEVICE_CONTROL 804fcab1 nt!IopInvalidDeviceRequest
[10] IRP_MJ_SHUTDOWN 804fcab1 nt!IopInvalidDeviceRequest
[11] IRP_MJ_LOCK_CONTROL 804fcab1 nt!IopInvalidDeviceRequest
[12] IRP_MJ_CLEANUP b4c1a592 CARD+0x4592
[13] IRP_MJ_CREATE_MAILSLOT 804fcab1 nt!IopInvalidDeviceRequest
[14] IRP_MJ_QUERY_SECURITY 804fcab1 nt!IopInvalidDeviceRequest
[15] IRP_MJ_SET_SECURITY 804fcab1 nt!IopInvalidDeviceRequest
[16] IRP_MJ_POWER b4c1a410 CARD+0x4410
[17] IRP_MJ_SYSTEM_CONTROL b4c19fb6 CARD+0x3fb6
[18] IRP_MJ_DEVICE_CHANGE 804fcab1 nt!IopInvalidDeviceRequest
[19] IRP_MJ_QUERY_QUOTA 804fcab1 nt!IopInvalidDeviceRequest
[1a] IRP_MJ_SET_QUOTA 804fcab1 nt!IopInvalidDeviceRequest
[1b] IRP_MJ_PNP b4c1a23c CARD+0x423c

We're interested in the handling of IOCTLs, so let's have a look at DispatchDeviceControl - it starts like this:

loc_13FDE:                         
mov eax, [esp+4]
push ebx
push ebp
push esi
mov esi, [eax+28h]
push edi
mov eax, [esi+2C4h]
cmp dword ptr [eax+0A4h], 0
jnz short loc_14002
mov eax, 0C00002B6h
jmp loc_14059
loc_14002:
...

I shan't bore everyone with the whole function, but that jump takes us to the epilogue of the function where it cleans up the stack frame, and returns the value in EAX. We know the signature of this function - the return is an NTSTATUS value. Looking in ntstatus.h, we find that 0xC00002B6 corresponds to STATUS_DEVICE_REMOVED.

So clearly this bit of code

cmp     dword ptr [eax+0A4h], 0
jnz short loc_14002
mov eax, 0C00002B6h
jmp loc_14059

is saying something along the lines of

if(!DeviceIsPresent)
{
return STATUS_DEVICE_REMOVED;
}

which sounded sensible.

So it seems that the author of the IRP_MJ_DEVICE_CONTROL handler has at least thought about device removal. We decided to set a breakpoint on the start of the handler and see what happened when the device was unplugged. There is a slow but steady stream of IOCTLS sent to the smart card driver, presumably by the smart card service, so you do break on entry to this function every so often.
But the interesting break occurred after our HIDCOM driver had processed SURPRISE_REMOVAL and failed all the existing I/O which CARD.SYS had pending. At this point, CARD.SYS received another IOCTL, and this time passed through the piece of code which returns STATUS_DEVICE_REMOVED.

And, of course, that's the problem. Unless you return STATUS_PENDING with aspirations of completing the IRP in the future, you must complete it in the dispatch routine. In fact, Driver Verifier does catch this bug - but in all the hassle of struggling to repro thia problem, there had somehow never been the right combination of service running, verifier running, problem occurring, etc, etc. Sigh.

Now we were faced with a driver with no access to the source code, in which we needed to complete an IRP. There's not much code involved in completing an IRP (at least for the driver), and very similar code ends-up all over the place. To our great good fortune, another error condition in the same dispatch routine was handled properly - complete the IRP and return STATUS_DEVICE_REMOVED.

It took a one-byte patch to the jmp loc_14059 instruction to make it jump to the code which did the correct complete-and-return. This seemed to work beautifully - everything unloading fine. (It took another one-byte patch to the checksum on the driver to make it load again.)

Then I couldn't resist changing the version resource on the driver to get our name on it, so we had to have a two-byte patch to the checksum...

Feeling pretty smug now, I zipped-up the patched driver and sent it off to the client. It was another Friday night, and a satisfactory end to what had been a very frustrating time, largely spend trying to repro the problem rather than trying to fix it.

This time we didn't even make it until Monday lunchtime before they were back to us...

Saturday, January 28, 2006

Debugging Story One - First Pass

[Our previous posting covered the symptoms of this problem]

So, the usual stuff:

1. Get the hardware from the client (in this case just a bare circuit board from inside the machine).
2. Make sure we can see the same problem as they can...
3. Clone a nice clean VMWare XPSP2 machine for testing with.
4. Hook-up Windbg to VMWare as a kernel mode debugger, via a named pipe.
5. Get their drivers installed on the VMWare machine. Snapshot it...

Then, I plugged and unplugged the device. (Be careful about VMWare focus when you're testing USB devices with it.) As expected, I saw the same problem in VMWare that had been seen elsewhere.

So, with Device Manager 'stuck', and plug/unplugs not being recognised, I broke-in with Windbg.

The first step is to look for system-wide locks:


kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks......

Resource @ nt!IopDeviceTreeLock (0x805584e0) Shared 1 owning threads
Threads: 81bc9020-01<*>
KD: Scanning for held locks.

Resource @ nt!PiEngineLock (0x80558460) Exclusively owned
Contention Count = 1
Threads: 81bc9020-01<*>
KD: Scanning for held locks...


The first one it finds is called "IopDeviceTreeLock". I don't know of any public documentation which describes this lock, but it sounds relevant to being allowed to look at the device tree.

Let's look at the thread which is holding this lock:


kd> !thread 81bc9020
THREAD 81bc9020 Cid 0004.003c Teb: 00000000 Win32Thread: 00000000 WAIT: (Executive) KernelMode Non-Alertable
81858e5c NotificationEvent
IRP List:
818a0008: (0006,01b4) Flags: 00000000 Mdl: 00000000
Not impersonating
DeviceMap e10010a8
Owning Process 81bcc830 Image: System
Wait Start TickCount 14706 Ticks: 2591 (0:00:00:40.484)
Context Switch Count 567
UserTime 00:00:00.0000
KernelTime 00:00:02.0703
Start Address nt!ExpWorkerThread (0x804e4196)
Stack Init f9eab000 Current f9eaaa8c Base f9eab000 Limit f9ea8000 Call 0
Priority 13 BasePriority 12 PriorityDecrement 0 DecrementCount 16
ChildEBP RetAddr Args to Child
f9eaaaa4 804dc0f7 81bc9090 81bc9020 804dc143 nt!KiSwapContext+0x2e (FPO: [EBP 0xf9eaaad8] [0,0,4])
f9eaaab0 804dc143 81858e5c 81858ac8 00000000 nt!KiSwapThread+0x46 (FPO: [0,0,0])
f9eaaad8 f9da9110 00000000 00000000 00000000 nt!KeWaitForSingleObject+0x1c2 (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
81858a10 00000000 8196b030 00000000 00000000 CARD+0x6110


There's the 3rd-party smartcard driver at the bottom - it's holding a lock on the device tree, and it's waiting for something.
There's an IRP active on this thread too - let's have a look at that:


kd> !irp 818a0008
Irp is active with 9 stacks 9 is current (= 0x818a0198)
No Mdl Thread 81bc9020: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
>[ 1b, 2] 0 0 81858a10 00000000 00000000-00000000
\Driver\CARD
Args: 00000000 00000000 00000000 00000000


Urgh! What a horribly deep stack... I won't bother including all those zeros next time.

It's an IRP with major function 0x1b, minor function 0x2 - this is IRP_MJ_PNP, IRP_MN_REMOVE_DEVICE.

So, the smartcard driver is stuck in its REMOVE_DEVICE handler, blocking-up the whole of the device tree.
That doesn't really strike me as startlingly useful info, except that it does confirm that the problem with the unloading is to do with the HIDCOM/CARD.SYS stack.

Let's look for other IRPs:


kd> !irpfind
unable to get large pool allocation table - either wrong symbols or pool tagging is disabled

Searching NonPaged pool (811b0000 : 81c00000) for Tag: Irp?

Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
81852880 [00000000] irpStack: ( 0, 0) 818516d0 [ \Driver\HidCom]
818a0008 [81bc9020] irpStack: (1b, 2) 81858a10 [ \Driver\CARD]


(I've snipped dozens of irrelevant ones)

There we see the PNP IRP we've already looked at, and another one which has been sent to our serial port driver.


kd> !irp 81852880
Irp is active with 10 stacks 9 is current (= 0x81852a10)
No Mdl Irp count = 81858884 Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context

(Lots of empty stack positions snipped)

>[ 0, 0] 0 e0 818516d0 00000000 f9e460f4-81851788 Success Error Cancel
\Driver\BcHidCom BcHidCom
Args: 00000000 00000000 00000000 00000000
[ e, 0] 0 e1 818516d0 00000000 f9da763c-81858ac8 Success Error Cancel pending
\Driver\BcHidCom GemPC413
Args: 81851b10 00000000 001b0048 00000000


This is of major type 0xe (IRP_MJ_DEVICE_CONTROL), which corresponds to the WIN32 DeviceIOControl() function, or to about half a dozen of the serial port API functions.

The IOCTL code is the third argument - 0x1b0048.

IOCTL's are tedious to decode at a glance, but the 0x1b tells us it's a serial port IOCTL, and the 0x48 is four times the IOCTL index number, which is 18 (decimal). This is IOCTL_SERIAL_WAIT_ON_MASK, which is the kind of thing you get from the WIN32 function WaitCommEvent. When you send IOCTL_SERIAL_WAIT_ON_MASK to a serial device, it holds that IRP pending until one of a number of configurable events happens on a port - for example a byte is received at the serial port.

At this point, we couldn't understand why HIDCOM still had an IRP pending after a surprise removal - drivers are supposed to fail any pending IO when they receive a surprise removal. Looking at the relevant HIDCOM device object:


kd> !devobj 818516d0
Device object (818516d0) is for:
HidCom1 \Driver\BcHidCom DriverObject 81af6af0
Current Irp 00000000 RefCount 0 Type 0000001b Flags 0000204c
Dacl e137aae4 DevExt 81851788 DevObjExt 81851e98
ExtensionFlags (0x00000008) DOE_REMOVE_PROCESSED
AttachedDevice (Upper) 81885c10 \Driver\serenum
AttachedTo (Lower) 81869030 \Driver\usbccgp
Device queue is not busy.


gives the helpful sounding DOE_REMOVE_PROCESSED, which certainly sounds like the system thinks it's sent all the REMOVE-like IRPs, but I don't know of any documentation which confirms that it means that.

The two obvious ways to confirm which PNP IRPs had been sent to our device are either to set breakpoints in our IRP dispatch routine, or to install a checked build of our driver - it traces things like that to the debug stream.

I find breakpoints like that rather tedious and easy to get confused in - and the fact that there are two instances of the device loaded makes them particular confusing. So we installed the checked build of our driver, and waded through lots of spew to find the PNP irps. This revealed that our HIDCOM devices were receiving and handling IRP_MN_SURPRISE_REMOVAL and we knew that contained code to our queues of read and write IRPs:


IrpFailQueue(&pExt->writeIrpQueue, STATUS_DELETE_PENDING);
IrpFailQueue(&pExt->readIrpQueue, STATUS_DELETE_PENDING);


And of course, that's the problem. The stuck IRP is neither a read nor a write - it's an IOCTL. The IRP for IOCTL_SERIAL_WAIT_ON_MASK was pending, and our driver wasn't completing it on a removal. So a simple addition of


WaitCommEventRelease(pExt, STATUS_DELETE_PENDING);


to the relevant section of our PNP code, and suddenly everything sprang into life. Now when I unplugged the device, I saw our device unload (my favourite bit of trace from this driver is the line in the DriverUnload handler...).

This was about 8.30 on a Friday night, and I was delighted to be able ship this version to the client for them to try on Monday morning.

And it took all the way until Monday afternoon before they were back with the next problem. I'll write that up in a minute.

Debugging Story One - Initial Bug Report

I spent a chunk of the last week tracking down a couple of Windows driver bugs. I rather like reading debugging war stories, and I wondered if there might be any interest in this one...

This debugging was done on behalf of a client who has a USB peripheral, based slightly on the Cypress Semiconductor CY4601 reference design. (See a previous post for our connection to this product.) In this application, our USB-Serial WDM driver is known as HIDCOM.SYS.

The client's product is a big piece of machinery, with a USB interface for connection to a Windows PC. Their device is configured so that it appears to be three USB peripherals:

  • An imaging device, which loads the Windows USBSCAN.SYS driver.

  • A first serial port, which is attached to a smartcard reader built into the machine. This uses our HIDCOM.SYS driver.

  • A second serial port, which is used for control functions within the machine. This also uses HIDCOM.SYS.


So, it's quite complicated already, but onto this lot, we add the following:

SERENUM.SYS - this is the MS-provided serial device enumeration driver. It loads as an upper filter on to serial port drivers, and is responsible for detecting PnP-compliant serial devices which are connected, and then loading drivers for them. Both HIDCOM devices have SERENUM layered on top by the HIDCOM.INF file.

CARD.SYS - In this application the smartcard reader was PnP compliant, so SERNUM would load another driver on top of one of the HIDCOM devices - we'll call this CARD.SYS. It's written by the manufacturer of the smartcard reader.

So, quite a stack, and unusual (c.f. other HIDCOM applications) in having a KM driver talking to us rather than a user mode application.

The problem the client had was that when they unplugged the device, their devices would NOT disappear from Device Manager. In fact, Device Manager would no longer do anything without hanging. USB devices were not recognised when connected or disconnected. The only option at that point was to reboot the PC.

The discovered that if they went to Device Manager and disabled the smartcard device BEFORE unplugging, then everything was OK.

I'll describe the start of the investigation in another post.

Our connection with CY4601

The Cypress Semiconductor CY4601 reference design is a 'USB to serial converter' solution. There are a number of such products around - some, like the FTDI solution, are based on an ASIC and don't have a microcontroller at the perpheral end at all. Others, like the Cypress version, use a general-purpose USB peripheral microcontroller with appropriate firmware.

It's surprisingly difficult to emulate a PC serial port with a USB peripheral. People who haven't tried rarely believe that, but it's true for a variety of reasons. Maybe I'll post about that another time. But I think most 'power users' will have come across USB-Serial widgets at one time or another, and the experience is often imperfect.

Anyway, our connection with the product is that a long time ago, we wrote the WDM driver which ships with the Cypress reference design. The version which you can actually download from Cypress is fairly rudimentary - since that particular contract with Cypress came to an end, we have continued to improve the driver and have built a number of custom versions for various other clients. The improvements have tended to be either to application compatibility, or to general robustness. Several versions of the driver have been WHQL signed for clients.

Prior to last week, we thought the current driver was pretty much there...