Hang in EsbEhci after a call to CancelIo

I have a simple driver for our USB device, based on the OSR USB sample
in the DDK.

The driver has been in use for over 8 years on thousands of instruments.

Recently, and only on one of our applications, we are getting system
lockups in the USB communication chain, apparently related to
CancelIo calls.

Of course, it doesn’t happen frequently or in any easily repeatable way…

My first idea was to port the driver from WDM platform to the KMDF platform.

Under KMDF 1.7, when the driver hangs the only recourse is to turn off the PC.
If I try to shut down the application using the driver it never closes.

Under the WDM model, and under KMDF 1.9, if I unplug the device and then
plug it back in the PC recovers from the error.

I grabbed a memory dump from when the I/O was hung, and after opening it
in windbg, it appears to be hung in \Driver\UsbEhci.

Irp is active with 5 stacks 2 is current (= 0x8c59af6c)
Mdl=89907f58: System buffer=898e00a0: Thread 89b506a0: Irp stack trace.
cmd flg cl Device File Completion-Context
[0, 0] 0 10 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000

[f, 0] 0 e1 89931030 00000000 8066510e-8c59af90 Success Error Cancel pending
\Driver\usbehci nt!IovpInternalCompletionTrap
Args: 893426dc 00000000 00220003 00000000
[f, 0] 0 e0 89c06c38 00000000 804f0800-89b334d0 Success Error Cancel
\DRIVER\VERIFIER nt!IopUnloadSafeCompletion
Args: 893426dc 00000000 00220003 00000000
[e, 0] 5 e1 898fe020 89928a38 8066510e-8c59afd8 Success Error Cancel pending
\Driver\MyDriver nt!IovpInternalCompletionTrap
Args: 00020400 00000004 00222001 00000000
[e, 0] 5 0 89bf6020 89928a38 00000000-00000000
\DRIVER\VERIFIER
Args: 00020400 00000004 00222001 00000000

Any ideas?

Thanks,
Peter Steinberg

xxxxx@thermofisher.com wrote:

I have a simple driver for our USB device, based on the OSR USB sample
in the DDK.

The driver has been in use for over 8 years on thousands of instruments.

Recently, and only on one of our applications, we are getting system
lockups in the USB communication chain, apparently related to
CancelIo calls.

That’s not an API one uses in the normal course of operation. Does this
only happen at shutdown or abnormal termination?

Under KMDF 1.7, when the driver hangs the only recourse is to turn off the PC.
If I try to shut down the application using the driver it never closes.

Under the WDM model, and under KMDF 1.9, if I unplug the device and then
plug it back in the PC recovers from the error.

I grabbed a memory dump from when the I/O was hung, and after opening it
in windbg, it appears to be hung in \Driver\UsbEhci.

OK, so your driver has received one of your own ioctls (222001), and has
sent it down as an URB (220003). What URB were you sending here? Are
you checking for cancellation before sending the IRP down to USBD?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

In general, the user application is waiting for the completion of a IOCTL (222001), which is turned into a read request (via WdfUsbTargetPipeFormatRequestForRead).

This is so that the device can notify the application if it needs attention (when we get tio revise the device firmware, it will become a USB interrupt pipe).

If the application needs to send commands to the device, it calls CancelIo to cancel the outstanding read request.

On a different USB endpoint, when we are collecting large amounts of data we have multiple read requests active so the driver does not lose any data while waiting for the user code to resubmit a read request. When the data collection is stopped CancelIo is called to cancel those outstanding read requests.

The driver translates IOCTL 222001 as a bulk read request, and handles it by calling:
WdfRequestRetrieveINputBuffer to get the endpoint to use
WdfRequestRetrieveOutputMemory to get the buffer to transfer data into
WdfUsbTargetPipeFormatRequestForRead to build the read request
WdfRequestSetCompletionRoution to set the IO completion routine
WdfRequestSend to send the request down the chain

The IO target for WdfRequestSend is the return from a call to WdfUsbTargetPipeGetIoTarget.

No special checks are made for IRP cancellation. I can easily call WdfRequestIsCanceled before the call to WdfRequestSend if you think that would be helpful.

Thanks,
Peter

xxxxx@thermofisher.com wrote:

In general, the user application is waiting for the completion of a IOCTL (222001), which is turned into a read request (via WdfUsbTargetPipeFormatRequestForRead).

This is so that the device can notify the application if it needs attention (when we get tio revise the device firmware, it will become a USB interrupt pipe).

If the application needs to send commands to the device, it calls CancelIo to cancel the outstanding read request.

Why? You can send commands to other pipes while a read request is
outstanding.

No special checks are made for IRP cancellation. I can easily call WdfRequestIsCanceled before the call to WdfRequestSend if you think that would be helpful.

I would certainly try that, but it’s superstition and voodoo (neither of
which I am particularly opposed to when coding). I sincerely doubt this
is your problem.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

The instrument has a command / response interrupt. We have one pipe for all commands going to the instrument, and one pipe for all responses coming from the instrument.

We cancel the outstanding read request so that the thread that is sending the command can properly get the reply from the instrument…

The actual interface is slightly more complicated. Since we have multiple threads and processes able to use the driver, we use a system global semaphore to only allow one thread to access the instrument at a time.

When a specific thread wants to use the instrument, it posts a global event that it wants to use the instrument, and then waits to acquire the global semaphore controlling instrument access.

When the thread waiting for the read to finish gets the event notification, it cancels the IO request and releases the semaphore so the other thread can get the instrument.

Your mechanism is convoluted and may suffer from data loss:

“When the thread waiting for the read to finish gets the event notification, it
cancels the IO request and releases the semaphore so the other thread can get
the instrument.”.

When a data-in URB is canceled, size of data transferred may not be reliable in the URB.

You may also want to disable selective suspend for your device. It gives more trouble than worth it.

wrote in message news:xxxxx@ntdev…
> The actual interface is slightly more complicated. Since we have multiple
> threads and processes able to use the driver, we use a system global
> semaphore to only allow one thread to access the instrument at a time.
>
> When a specific thread wants to use the instrument, it posts a global
> event that it wants to use the instrument, and then waits to acquire the
> global semaphore controlling instrument access.
>
> When the thread waiting for the read to finish gets the event
> notification, it cancels the IO request and releases the semaphore so the
> other thread can get the instrument.
>

Without any actual code it’s only possible to guess where the problem lies,
possibly it’s the synchronization which allows only access to one instrument
at a time. Are you by any chance using (Ke)PulseEvent to use event
notification, I ask because that’s broken.

When you say the only recourse is to turn off the PC, do you mean you cannot
break in with the debugger ? You might want to add some sanity checking and
data collection to your code and break into the debugger and check the state
of things to further investigate the problem.

//Daniel

xxxxx@broadcom.com wrote in news:xxxxx@ntdev:

Your mechanism is convoluted and may suffer from data loss:

“When the thread waiting for the read to finish gets the event
notification, it cancels the IO request and releases the semaphore so
the other thread can get the instrument.”.

When a data-in URB is canceled, size of data transferred may not be
reliable in the URB.

You may also want to disable selective suspend for your device. It
gives more trouble than worth it.

I inherited the mechanism and it’s not feasible to change at the moment.

When the URB is cancelled that data is ignored.

I haven’t played with selective suspend specifically, but the setting on the USB Root Hub “Allow the computer to turn off this device to save power” is disabled.

wrote in news:xxxxx@ntdev:

> Without any actual code it’s only possible to guess where the problem
> lies, possibly it’s the synchronization which allows only access to
> one instrument at a time. Are you by any chance using (Ke)PulseEvent
> to use event notification, I ask because that’s broken.
>
> When you say the only recourse is to turn off the PC, do you mean you
> cannot break in with the debugger ? You might want to add some sanity
> checking and data collection to your code and break into the debugger
> and check the state of things to further investigate the problem.

We’re not using (Ke)PulseEvent.

I can break in with the debugger, that’s where the initial WinDebug output came from.

I need to turn off the PC because the process with the IRP cannot be killed. Task Manager / Process Explorer fail. The PC cannot shut down, it closes must processes and then hangs forever.

I don’t know how to cancel the IRP through the debugger.

I’ve logged the various calls, both at the user level and driver level and everything looks fine.

Any code needed is available.

Thanks,
Peter

.reboot in WinDbg will work very well.

Gary G. Little
H (952) 223-1349
C (952) 454-4629
xxxxx@comcast.net

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@thermofisher.com
Sent: Thursday, September 23, 2010 11:52 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Hang in EsbEhci after a call to CancelIo

wrote in news:xxxxx@ntdev:

> Without any actual code it’s only possible to guess where the problem
> lies, possibly it’s the synchronization which allows only access to
> one instrument at a time. Are you by any chance using (Ke)PulseEvent
> to use event notification, I ask because that’s broken.
>
> When you say the only recourse is to turn off the PC, do you mean you
> cannot break in with the debugger ? You might want to add some sanity
> checking and data collection to your code and break into the debugger
> and check the state of things to further investigate the problem.

We’re not using (Ke)PulseEvent.

I can break in with the debugger, that’s where the initial WinDebug output
came from.

I need to turn off the PC because the process with the IRP cannot be killed.
Task Manager / Process Explorer fail. The PC cannot shut down, it closes
must processes and then hangs forever.

I don’t know how to cancel the IRP through the debugger.

I’ve logged the various calls, both at the user level and driver level and
everything looks fine.

Any code needed is available.

Thanks,
Peter


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer