Terminate system thread hangs

Hi all

I’m having a few problems when I terminate a system thread and would appreciate any clues as to how to find the cause.

I create multiple threads for different purposes with PsCreateSystemThread using ObReferenceObjectByHandle to give me a reference. Later, I use an event to kill the thread and wait on the thread reference with KeWaitForSingleObject for the thread to terminate via PsTerminateSystemThread.

The problem is that occasionally although debug prints show that PsTerminateSystemThread is being calld, the the thread reference does not get signalled and KeWaitForSingleObject times out after maybe 5 or 10 seconds.

This is a driver for a USB device and although I am sending irps to the device from some of the threads, windbg !irpfind doesn not report any irps at all if I break in during the timeout. I assume that windows is waiting for something to complete or be destroyed or de-referenced but don’t know enough to know where to begin looking.

Any advice or suggestions appreciated.

Sean

Create:

PsCreateSystemThread(&ThreadHandle, (ACCESS_MASK)0L, 0, 0, 0, TimerThread, pdlThread)
ObReferenceObjectByHandle(ThreadHandle,(ACCESS_MASK)0L,NULL,KernelMode,&pdlThread->ThreadRef, NULL);

Terminate:

KeSetEvent(&pdlThread->ThreadKillEvent, IO_NO_INCREMENT, FALSE);
KeWaitForSingleObject(pdlThread->ThreadRef, Executive, KernelMode, FALSE, &Timeout);

Thread:

while(!quit)
{
if ( KeWaitForSingleObject(&pdlThread->ThreadKillEvent, Executive, KernelMode, FALSE, &Timeout) != STATUS_TIMEOUT )
quit = TRUE;
}
PsTerminateSystemThread(STATUS_SUCCESS);

In the debugger, check where the thread is hanging. It may be waiting for all thread-associated IRP to complete, if you issue any. Though async USB IRPs you build yourself are not usually associated with a thread.

I think you are missing a call to ZwClose(handle) after getting the
object reference.

Mark Roddy

On Mon, Dec 6, 2010 at 1:04 PM, wrote:
> PsCreateSystemThread

> I think you are missing a call to ZwClose(handle) after getting the object reference.

The only thing that may result from failing to close a handle is non-paged memory leak (because ETHREAD
that gets allocated from non-paged pool will stay on the thread list until its refcount goes down to zero) - it is not going to prevent ETHREAD from getting into the signaled state, is it. Therefore, no matter how serious this bug in itself is, I think the OP’s problem arises for some other reason…

Anton Bassov

Hi all

Thanks for the comments. Just to confirm that the problem persists if I call ZwClose(ThreadHandle) after ObReferenceObjectByHandle. I did have this call orginally but had commented it out whilst debugging this issue.

If I break in before the timeout expires and check the !running threads, I get the following output which may be a clue? MyDriver!fault is worrying.

Many thanks

Sean

Child-SP RetAddr Call Site
fffff88009758570 fffff80002ec0d57 nt!MiUnlinkPageFromLockedList+0x359
fffff88009758600 fffff80002eab506 nt!MiResolveTransitionFault+0x167
fffff88009758690 fffff80002ea94d1 nt!MiDispatchFault+0x946
fffff880097587a0 fffff80002e8cfee nt!MmAccessFault+0x8f1
fffff88009758900 fffff800030f7d6c nt!KiPageFault+0x16e
fffff88009758a90 fffff800031715f7 nt! ?? ::NNGAKEGL::string'+0x1e9c7 fffff88009758ac0 fffff800031752a1 nt!PspExitThread+0x2c7 fffff88009758b90 fffff8000311f884 nt!PspTerminateThreadByPointer+0x4d fffff88009758be0 fffff88008243da8 nt!PsTerminateSystemThread+0x24 fffff88009758c10 fffff88008248b80 MyDriver!fault+0x4598 fffff88009758c18 fffffa8008f5cb60 MyDriver!fault+0x9370 fffff88009758c20 0000000000000065 0xfffffa8008f5cb60
fffff88009758c28 0000000000000003 0x65
fffff88009758c30 fffff88009758c40 0x3
fffff88009758c38 fffffa8008c39950 0xfffff88009758c40 fffff88009758c40 fffffffffffe7960 0xfffffa8008c39950
fffff88009758c48 0000000000000000 0xffffffff`fffe7960

Hmmm…

I can’t seem to repeat that last trace, maybe a red herring. I feel sure that windows is waiting for 5s for either something to be released before signalling or whatever thread that does the signalling is being held off. I am killing the thread in response to IRP_MJ_CLOSE by the way.

Regards

Sean

You need to make symbols for your driver available for the debugger. The call stack shows that the driver symbols are not loaded.

WHat calls you use to build the USB IRPs?

Hi Alex

I never seem to have all the symbols working correctly, seems to take ages loading them too under x64.

Mostly just use IoAllocateIrp and UsbBuildInterruptOrBulkTransferRequest to make up the USB calls.

Even though I abort and reset the USB pipes long before all this happens thread termination stuff happens, I wonder if there are still some pending async transfers that I haven’t accounted for… Perhaps that would explain the problem.

Regards

Sean

A little more information…

The whole shutdown sequence begins when the USB device is “suprise” unplugged. The very first thing I know about it is when an IoCallDriver, which is called in my thread, returns STATUS_NO_SUCH_DEVICE. Some short time later I get a IRP_MN_SURPRISE_REMOVAL as normal. Some time after that I close the app which results in an IRP_MJ_CLOSE during which I try and kill the problem threads.

So, I start the threads running during the IRP_MN_START_DEVICE, stop talking to the device after IRP_MN_SURPRISE_REMOVAL but terminate them during IRP_MJ_CLOSE. Could there be a problem as although the driver is still there, the device has gone?

I hope this is making some sense…

Cheers

Sean

xxxxx@deva.co.uk wrote:

A little more information…

The whole shutdown sequence begins when the USB device is “suprise” unplugged. The very first thing I know about it is when an IoCallDriver, which is called in my thread, returns STATUS_NO_SUCH_DEVICE. Some short time later I get a IRP_MN_SURPRISE_REMOVAL as normal. Some time after that I close the app which results in an IRP_MJ_CLOSE during which I try and kill the problem threads.

Why not just kill the threads during surprise removal? If the threads
are talking to the device, then they won’t have anything to do after a
surprise removal.

I don’t see anything immediately wrong with what you have described.
Note, however, that you should not submit any new URBs after you get the
surprise removal notification. URBs that are in-flight should complete
with errors, but new URBs submitted after that point can cause problems.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Thanks for the comments Tim

I just changed the code around to kill the threads during suprise removal and I still get the same problem. Also, there aren’t (any longer…) any URBs sent after the suprise remove so this doesn’t seem to be the cause.

I just wish I could work out what was stopping the thread reference from being signalled.

I actually have four threads, three of which touch the USB and it is these threads that have the problem.

However, the simplest of these threads just passes a few bytes of data from a queue in the device extension to the USB device and returns a few more bytes of data from the device back to a similar queue. It doesn’t mess about with any objects or do anything “fancy”. Just a synchrnous bulk transfer in fact. I’m struggling to see how this can stop the thread terminating.

Regards

Sean

Hi

This just gets better… I simplified the driver down to one simple thread where now only 1 time in 40 or so a suprise remove results in the thread reference not being signalled.

In the case where it fails, when the device is unplugged the completion routine for my usb transfer reports USBD_STATUS_XACT_ERROR in pUrb->UrbBulkOrInterruptTransfer.Hdr.Status rather than USBD_STATUS_DEVICE_GONE in the cases where it all works fine.

Is seems unplugging the device can result in one of the two errors but when I get the USBD_STATUS_XACT_ERROR something is left dangling that prevents the thread reference being signalled when the thread tries to terminate. My code doesn’t differentiate the errors in any way.

Any ideas guys? I’m out of my depth…

Cheers
Sean

xxxxx@deva.co.uk wrote:

This just gets better… I simplified the driver down to one simple thread where now only 1 time in 40 or so a suprise remove results in the thread reference not being signalled.

In the case where it fails, when the device is unplugged the completion routine for my usb transfer reports USBD_STATUS_XACT_ERROR in pUrb->UrbBulkOrInterruptTransfer.Hdr.Status rather than USBD_STATUS_DEVICE_GONE in the cases where it all works fine.

Is seems unplugging the device can result in one of the two errors but when I get the USBD_STATUS_XACT_ERROR something is left dangling that prevents the thread reference being signalled when the thread tries to terminate. My code doesn’t differentiate the errors in any way.

What do you do in response to the errors? Do you delete the URB and
complete the IRP back to user mode? Or is it your own set of URBs? How
is the thread involved here? Usually, in a continuous reader situation,
you’d resubmit in the completion handler. What’s the purpose of the thread?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Hi Tim

Thanks for the reply but I’m off home now… The thread just sends a few bytes out to the device and reads a few bytes back every 100ms. The whole scenario is very similar to scenario 11. http://support.microsoft.com/kb/326315/

I need to do some more debug in the morning but it may be linked to how I call abort and reset pipe to try and clear errors…

Cheers

Sean

Does it do this ‘sending and receiving’ using its own IRPs that might
be incomplete and associated with the thread object?
Mark Roddy

On Wed, Dec 8, 2010 at 1:08 PM, wrote:
> Hi Tim
>
> Thanks for the reply but I’m off home now… ?The thread just sends a few bytes out to the device and reads a few bytes back every 100ms. The whole scenario is very similar to scenario 11. http://support.microsoft.com/kb/326315/
>
> I need to do some more debug in the morning but it may be linked to how I call abort and reset pipe to try and clear errors…
>
> Cheers
>
> Sean
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

From my experience, abort and reset pipe URBs (as well as many other special URBs) had to be sent on PASSIVE_LEVEL. If you by any chance send them on DISPATCH_LEVEL, things may get ugly. Don’t know if this requirement still holds in Vista+.

Anyway, please confirm that you don’t call IoBuildDeviceIoControlRequest and IoBuildSynchronousFsdRequest functions in your thread. These functions require special considerations in the IRP handling.

AFAIK, the IRQL requirements for reset and abort are still that they must be sent at passive. KMDF goes through some pain to make sure this happens when errors occur on the continuous readers

d

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@broadcom.com
Sent: Wednesday, December 08, 2010 12:40 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Terminate system thread hangs

From my experience, abort and reset pipe URBs (as well as many other special URBs) had to be sent on PASSIVE_LEVEL. If you by any chance send them on DISPATCH_LEVEL, things may get ugly. Don’t know if this requirement still holds in Vista+.

Anyway, please confirm that you don’t call IoBuildDeviceIoControlRequest and IoBuildSynchronousFsdRequest functions in your thread. These functions require special considerations in the IRP handling.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Hi all

Thanks for your comments, I just wrote and posted a lengthy reply and explanation but it seems to have disappeared… not my week…

The short version of the earlier reply is that yes, it turns out we are calling IoBuildDeviceIoControlRequest to issue an abort and reset from our thread at irql =0 . This only happens on the rare occasion an unplug causes the xact error instead of the device removed error (as detected in the completion routine). This is because in this case, the call to iocalldriver (in the thread) returns STATUS_UNSUCCESSFUL instead of STATUS_NO_SUCH_DEVICE which leads us to think we can abort/reset and retry. The abort succeeds but the IoCallDriver for the reset fails and we panic and don’t complete the irp.

Bingo! This is the bug that must lead to the uncompleted irp that I can’t find in windbg.

The abort/reset code has been around a long time and I haven’t been focussed on looking for problems in that area. If I complete the irp after IoCallDriver fails, then first indications are that it is now working.

However, I don’t understand why calling the irp complete function doesn’t result in the completion routine running if the IoCallDriver fails.

Thanks again for you help , it was the witch hunt for IoBuildDeviceIoControlRequest that eventually uncovered the problem with our SendAwaitUrb function.

Sean

ps This is the function used to send the urb… failing IoCallDriver completes the irp but doesn’t hang around for the completion routine.

NTSTATUS SendAwaitUrbOnComplete(PDEVICE_OBJECT pdo, PIRP Irp, PVOID pev)
{
KeSetEvent((PKEVENT) pev, IO_NO_INCREMENT, FALSE);

if(!NT_SUCCESS(Irp->IoStatus.Status))
{
ser_print(“SendAwaitUrbOnComplete failed %lx\r\n”,Irp->IoStatus.Status);
}

return STATUS_MORE_PROCESSING_REQUIRED;
}

NTSTATUS SendAwaitUrb(IN PDEVICE_OBJECT LowerDeviceObject, PURB urb)
{
KEVENT event;
IO_STATUS_BLOCK iostatus={0};
PIRP Irp;
PIO_STACK_LOCATION stack;
NTSTATUS ntStatus;

KeInitializeEvent(&event, NotificationEvent, FALSE);

Irp = IoBuildDeviceIoControlRequest(IOCTL_INTERNAL_USB_SUBMIT_URB, LowerDeviceObject, NULL, 0, NULL, 0, TRUE, &event, &iostatus);
if (!Irp)
{
ser_string(“Unable to allocate IRP for sending URB\n”);
return STATUS_INSUFFICIENT_RESOURCES;
}

IoSetCompletionRoutine(Irp, SendAwaitUrbOnComplete, (PVOID) &event, TRUE, TRUE, TRUE);

stack = IoGetNextIrpStackLocation(Irp);
ASSERT(stack != NULL);
stack->Parameters.Others.Argument1 = (PVOID) urb;

ntStatus = IoCallDriver(LowerDeviceObject, Irp);

if(!NT_SUCCESS(ntStatus))
{
ser_print(“IoCallDriver failed %lx\r\n”,ntStatus);
ser_print(“IoCallDriver failed ioStatus %lx\r\n”,iostatus.Status);
ser_print(“IoCallDriver failed urb->UrbHeader.Status %lx\r\n”,urb->UrbHeader.Status);

// KeClearEvent(&event);
IoCompleteRequest(Irp, IO_NO_INCREMENT);
// ser_string(“SendAwaitUrb calling KeWaitForSingleObject\r\n”);
// ASSERT_KEWAIT_IRQL(0);
// KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
// ser_string(“SendAwaitUrb KeWaitForSingleObject returned\r\n”);

return ntStatus;
}

if (ntStatus == STATUS_PENDING)
{
LARGE_INTEGER Timeout;
Timeout.QuadPart = -1 * 10000000;

ASSERT_KEWAIT_IRQL(Timeout.QuadPart);
if ( KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, &Timeout) == STATUS_TIMEOUT)
{
IoCancelIrp(Irp); // okay in this context
//ser_string(“SendAwaitUrb calling KeWaitForSingleObject\r\n”);
ASSERT_KEWAIT_IRQL(0);
KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
//ser_string(“SendAwaitUrb KeWaitForSingleObject returned\r\n”);

ntStatus = STATUS_IO_TIMEOUT; // presumably we haven’t had an error as the irp hasn’t completed
}
else
{
ntStatus = iostatus.Status;
}

if (!NT_SUCCESS(ntStatus))
{
ser_print(“SendAwaitUrb KeWaitForSingleObject iostatus 0x%08lx\r\n”,ntStatus);
}
}

KeClearEvent(&event);
IoCompleteRequest(Irp, IO_NO_INCREMENT);
ser_string(“SendAwaitUrb calling KeWaitForSingleObject\r\n”);
ASSERT_KEWAIT_IRQL(0);
KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
ser_string(“SendAwaitUrb KeWaitForSingleObject returned\r\n”);

return ntStatus;
}

Assuming I understand your question correctly: Completing an IRP with a completion routine registered, will NEVER cause the completion routine of the driver calling IoCompleteRequest to run. This is, according to the architecture, “how things work.”

By way of implementation, it so happens that when you register a completion routine, that completion routine is set in the NEXT I/O stack location of the IRP. And IoCompleteRequest starts looking for completion routines to call in the CURRENT I/O stack location. But, again, that’s implementation.

Does that answer your question? Sorry if I misunderstood, I’ve only been tangentially following this thread,

Peter
OSR

xxxxx@deva.co.uk wrote:

However, I don’t understand why calling the irp complete function doesn’t result in the completion routine running if the IoCallDriver fails.

There is still a pretty serious problem in your code. You should never,
ever complete an IRP that you created. What IoCompleteRequest does is
pass the request back up to the next driver in the stack. For requests
you create, there IS no next driver. You run off the end of the IRP
stack, and disaster ensues. A completion routine is called when the
next driver DOWN completes the request.

When you create an IRP, and the IRP comes back to you (either through an
IoCallDriver error or through a completion routine), its life is over.
For an IoCallDriver error, you just need to delete the IRP. In a
completion routine, you delete the IRP and return
STATUS_MORE_PROCESSING_REQUIRED so that no one else tries to touch it.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.