Ok, so we fixed the problem with the WaitCommEvent IRP not being failed on a SURPRISE_REMOVAL. I'd always prefer the bugs to be found in someone else's code, but really I'm just happy when they're fixed.
So I was somewhat dischuffed when the client came back (after having made initial sounds of delight) to report that they could only unplug and replug the device four times. After the fourth time, the smartcard driver would appear 'banged out' in Device Manager, with a 'Code 10' report on it.
Dead strange this - clearly something was remembering to count to four somewhere and something was imposing a limit of four on itself. The biggest problem was that I couldn't reliably reproduce this - some machines did it, sometimes, but nothing which I was prepared to run a kernel debugger on would demonstrate the problem at all consistently.
Whenever I ran this on a machine where I had good visibility, the driver stack would unload properly and, as you'd expect, this didn't leave anything around to remember that it was supposed to be counting to four. And yet the client was reporting this on every machine they tested on, both XP and W2K.
Eventually, after the client rejected several attempts by me to admit defeat, we jointly worked out that the problem only showed-up on a machine where the Windows Smart Card service was running. (Can you tell I'm not a smart-card expert yet?)
A quick 'net start "smart card"' and there we were, with a driver stack which didn't unload properly.
Unlike the last time, we weren't stuck with the PnP system all locked-up, so we could re-plug the device again successfully - but each time we did this, another pair of HIDCOM devices and another CARD device were created. The CARD driver sets itself a limit of four devices - after that it fails to create any more. The HIDCOM device cheerfully goes on forever adding devices.
Stopping the smart card service after unplugging didn't help - once the devices were stuck, they were there for good.
So for more detail, back to Windbg:
Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
8a638840 [8a559478] irpStack: ( e, 0) 8a5acb48 [\Driver\CARD]
kd> !irp 8a638840
Irp is active with 9 stacks 9 is current (= 0x8a6389d0)
No Mdl System buffer = 8a0ed9e8 Thread 8a559478: Irp stack trace.
cmd flg cl Device File Completion-Context
>[ e, 0] 5 0 8a5acb48 8a29d2f8 00000000-00000000
Args: 00000004 00000000 00310038 00000000
As you can see, there's an IRP pending for the smartcard driver. It's IRP_MJ_DEVICE_CONTROL, and corrsponds to IOCTL_SMARTCARD_GET_STATE (that's the 00310038 ).
So, again the problem is a dangling IRP. There shouldn't be any pending IRPs at this point; the driver should have completed them all during its remove processing.
This didn't really look like a problem with our driver (HIDCOM) - it looked like a problem with the smart card driver, for which we had neither source nor symbols. And our client said of the smartcard system vendor "their support is null". After another half-hearted (and rejected) attempt to have the client abandon the job, we fired-up Ida Pro (www.datarescue.com - superb product and service) and set off into CARD.SYS.
You get a pretty gentle start into reverse engineering device drivers - DriverEntry is the DLL's entry point, and the filling-in of the function pointers in the DriverObject is easy to dissassemble and gives you a nice initial set of named functions with known signatures:
mov ecx, offset sub_14464
mov [eax+38h], ecx
mov [eax+40h], ecx
mov dword ptr [eax+80h], offset loc_14592
mov dword ptr [eax+70h], offset loc_13FDE
mov dword ptr [eax+0A4h], offset sub_1423C
mov dword ptr [eax+90h], offset loc_14410
mov dword ptr [eax+94h], offset loc_13FB6
mov ecx, offset DispatchCreateClose
mov [eax+DRIVER_OBJECT.MajorFunction], ecx
mov [eax+(DRIVER_OBJECT.MajorFunction+8)], ecx
mov [eax+(DRIVER_OBJECT.MajorFunction+48h)], offset DispatchCleanup
mov [eax+(DRIVER_OBJECT.MajorFunction+38h)], offset DispatchDeviceControl
mov [eax+(DRIVER_OBJECT.MajorFunction+6Ch)], offset DispatchPnp
mov [eax+(DRIVER_OBJECT.MajorFunction+58h)], offset DispatchPower
mov [eax+(DRIVER_OBJECT.MajorFunction+5Ch)], offset DispatchSystemControl
As an aside - windbg will also give you the major function pointer offsets like this:
kd> !drvobj CARD 7
Driver object (8ab086e0) is for:
Driver Extension List: (id , addr)
Device Object list:
DriverEntry: b4c1c360 CARD
DriverUnload: b4c1c1c2 CARD
 IRP_MJ_CREATE b4c1a464 CARD+0x4464
 IRP_MJ_CREATE_NAMED_PIPE 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_CLOSE b4c1a464 CARD+0x4464
 IRP_MJ_READ 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_WRITE 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_QUERY_INFORMATION 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_SET_INFORMATION 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_QUERY_EA 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_SET_EA 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_FLUSH_BUFFERS 804fcab1 nt!IopInvalidDeviceRequest
[0a] IRP_MJ_QUERY_VOLUME_INFORMATION 804fcab1 nt!IopInvalidDeviceRequest
[0b] IRP_MJ_SET_VOLUME_INFORMATION 804fcab1 nt!IopInvalidDeviceRequest
[0c] IRP_MJ_DIRECTORY_CONTROL 804fcab1 nt!IopInvalidDeviceRequest
[0d] IRP_MJ_FILE_SYSTEM_CONTROL 804fcab1 nt!IopInvalidDeviceRequest
[0e] IRP_MJ_DEVICE_CONTROL b4c19fde CARD+0x3fde
[0f] IRP_MJ_INTERNAL_DEVICE_CONTROL 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_SHUTDOWN 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_LOCK_CONTROL 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_CLEANUP b4c1a592 CARD+0x4592
 IRP_MJ_CREATE_MAILSLOT 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_QUERY_SECURITY 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_SET_SECURITY 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_POWER b4c1a410 CARD+0x4410
 IRP_MJ_SYSTEM_CONTROL b4c19fb6 CARD+0x3fb6
 IRP_MJ_DEVICE_CHANGE 804fcab1 nt!IopInvalidDeviceRequest
 IRP_MJ_QUERY_QUOTA 804fcab1 nt!IopInvalidDeviceRequest
[1a] IRP_MJ_SET_QUOTA 804fcab1 nt!IopInvalidDeviceRequest
[1b] IRP_MJ_PNP b4c1a23c CARD+0x423c
We're interested in the handling of IOCTLs, so let's have a look at DispatchDeviceControl - it starts like this:
mov eax, [esp+4]
mov esi, [eax+28h]
mov eax, [esi+2C4h]
cmp dword ptr [eax+0A4h], 0
jnz short loc_14002
mov eax, 0C00002B6h
I shan't bore everyone with the whole function, but that jump takes us to the epilogue of the function where it cleans up the stack frame, and returns the value in EAX. We know the signature of this function - the return is an NTSTATUS value. Looking in ntstatus.h, we find that 0xC00002B6 corresponds to STATUS_DEVICE_REMOVED.
So clearly this bit of code
cmp dword ptr [eax+0A4h], 0
jnz short loc_14002
mov eax, 0C00002B6h
is saying something along the lines of
which sounded sensible.
So it seems that the author of the IRP_MJ_DEVICE_CONTROL handler has at least thought about device removal. We decided to set a breakpoint on the start of the handler and see what happened when the device was unplugged. There is a slow but steady stream of IOCTLS sent to the smart card driver, presumably by the smart card service, so you do break on entry to this function every so often.
But the interesting break occurred after our HIDCOM driver had processed SURPRISE_REMOVAL and failed all the existing I/O which CARD.SYS had pending. At this point, CARD.SYS received another IOCTL, and this time passed through the piece of code which returns STATUS_DEVICE_REMOVED.
And, of course, that's the problem. Unless you return STATUS_PENDING with aspirations of completing the IRP in the future, you must complete it in the dispatch routine. In fact, Driver Verifier does catch this bug - but in all the hassle of struggling to repro thia problem, there had somehow never been the right combination of service running, verifier running, problem occurring, etc, etc. Sigh.
Now we were faced with a driver with no access to the source code, in which we needed to complete an IRP. There's not much code involved in completing an IRP (at least for the driver), and very similar code ends-up all over the place. To our great good fortune, another error condition in the same dispatch routine was handled properly - complete the IRP and return STATUS_DEVICE_REMOVED.
It took a one-byte patch to the jmp loc_14059 instruction to make it jump to the code which did the correct complete-and-return. This seemed to work beautifully - everything unloading fine. (It took another one-byte patch to the checksum on the driver to make it load again.)
Then I couldn't resist changing the version resource on the driver to get our name on it, so we had to have a two-byte patch to the checksum...
Feeling pretty smug now, I zipped-up the patched driver and sent it off to the client. It was another Friday night, and a satisfactory end to what had been a very frustrating time, largely spend trying to repro the problem rather than trying to fix it.
This time we didn't even make it until Monday lunchtime before they were back to us...