System hang - at wit's end

Hi,

I’m having absolutely no success searching for a system hang. Nor can I
explain the behaviour. I’m wondering if it’s an assumption I’ve made wrt
the WDF framework?

It’s a KMDF driver and the hang is in an IOCTL callback event handler.
The callback first acquires and releases a mutex (though I can’t see
that this is important), then clears and then waits on a notification
event. The wait is Executive, KernelMode, Alertable=FALSE, Timeout=NULL
which, AFAIK is safe for IRQL<=DISPATCH_LEVEL.

The wait on the notification event appears to hang the entire system -
mouse and all - although I can in WinDbg.

The notification event is set in a system thread
(PsCreateSystemThread), ultimately signalled from an InterruptDPC
callback. FWIW this thread also happens to briefly acquire the same
mutex as the IOCTL callback - but since there’s never two resources held
at once, there should be no deadlock.

No other threads wait on this notification event.

I’d use WinDbg but, regardless of which symbol pack I download and
install, it insists that the ^%!$# kernel &^@#$ symbols are &^%@#$
wrong. Therefore I can’t use any of the windbg extensions that might
shed some light on this problem.

Anyone got any ideas of where to go from here?

Regards,


Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

Mark McDougall wrote:

The wait on the notification event appears to hang the entire system -
mouse and all - although I can in WinDbg.

Oh yeah, the notification event object is declared as a static global in
the module in which it’s used. No, it’s not my code. As far as I know,
that should be non-pagable storage.

Regards,


Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

> The wait is Executive, KernelMode, Alertable=FALSE, Timeout=NULL

which, AFAIK is safe for IRQL<=DISPATCH_LEVEL.

This is wrong…

If you want to call KeWait…() at DISPATCH_LEVEL without BSOD, then *Timeout must be zero, i.e. thread does not enter the waiting state even if event is in unsignalled state. If you specify NULL pointer, wait is not satisfied until the event is signalled, so that your code is bound to crash if you run it at DISPATCH_LEVEL…

I’d use WinDbg but, regardless of which symbol pack I download and
install, it insists that the kernel symbols are wrong.

How can that possibly happen??? Are you sure that you specify the right symbol path in WinDbg???

Concerning the rest, it is hard to say anything without actually seeing your code - there is a good chance that it is just buggy…

Anton Bassov

> It’s a KMDF driver and the hang is in an IOCTL callback event handler.

The callback first acquires and releases a mutex (though I can’t see
that this is important), then clears and then waits on a notification
event. The wait is Executive, KernelMode, Alertable=FALSE, Timeout=NULL
which, AFAIK is safe for IRQL<=DISPATCH_LEVEL.

You should not wait in dispatch routine for a long time. Pend the IRP instead.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

xxxxx@hotmail.com wrote:

If you want to call KeWait…() at DISPATCH_LEVEL without BSOD, then
*Timeout must be zero, i.e. thread does not enter the waiting state
even if event is in unsignalled state. If you specify NULL pointer,
wait is not satisfied until the event is signalled, so that your code
is bound to crash if you run it at DISPATCH_LEVEL…

Hmmm… that’s not how I read the documentation.
The doco states “if Timeout <> 0” rather than “*Timeout”…
FWIW this is code that has been running for many, many months in a
non-KMDF driver…
I guess it’s something I’ll have to follow up…

How can that possibly happen??? Are you sure that you specify the
right symbol path in WinDbg???

Yes, I tried many times, with different packs and specifying each path
singly, then all together. Yes, it’s very, very frustrating.

Concerning the rest, it is hard to say anything without actually
seeing your code - there is a good chance that it is just buggy…

Well part of the problem is that I’m trying to integrate existing code
from a non-WDF driver (not written by me) into a KMDF driver based on
the Toaster Bus Driver. The existing code has been around for quite a
while. My philosophy has been - if it ain’t broke, don’t fix it.
Naturally I’m not treating it like a complete black box - I made every
attempt to ensure that it would be compatible with the new framework -
only I missed a bit…

It’s amazing how things work - I’ve been hunting this problem for a few
weeks now (not quite full-time) without any success. Then as a last
resort I post to ntdev, only to suddenly get an idea about (quite
literally) 3 mins later - sitting on the crapper no less - what the
problem might be.

And I was right! :open_mouth:

The problem was that I was using
attributes.SynchronizationScope = WdfSynchronizationScopeDevice;
which meant that the DPC callback was synchronised internally by the
framework with the DeviceIoctl callback. Since the Ioctl waited on an
event ultimately signalled by the DPC callback, things ground to a halt
quickly.

Another trap for young players. :frowning:

Regards,


Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

Maxim S. Shatskih wrote:

You should not wait in dispatch routine for a long time. Pend the IRP
instead.

IIUC the WDF framework can call the DeviceIoctl callback function at
IRQL=DISPATCH_LEVEL. This is where the code waits for an event to do
some synchronous I/O. There’s no ‘IRPs’ or ‘pending’ in this area of the
framework.

Regards,


Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

Assuming that your debugger has access to the internets try !symfix. This
will set your symbols to the public symbol server and end that particular
problem.

As others have pointed out you cannot wait at DISPATCH_LEVEL - just make
sure you are not at the level by testing it in your code. That is probably
not your problem as you say this code has been running for quite a while
(although ‘the code’ has been modified from WDM to KMDF so you of course
inserted new bugs into the old running code.)

Fixing the symbols ought to give you a clue. Then the normal admonitions
apply: run against verifier; build with prefast and fix everything that it
legitimately complains about; consider a checked kernel.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-282405-
xxxxx@lists.osr.com] On Behalf Of Mark McDougall
Sent: Monday, April 02, 2007 4:40 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] System hang - at wit’s end

xxxxx@hotmail.com wrote:

> If you want to call KeWait…() at DISPATCH_LEVEL without BSOD, then
> *Timeout must be zero, i.e. thread does not enter the waiting state
> even if event is in unsignalled state. If you specify NULL pointer,
> wait is not satisfied until the event is signalled, so that your code
> is bound to crash if you run it at DISPATCH_LEVEL…

Hmmm… that’s not how I read the documentation.
The doco states “if Timeout <> 0” rather than “*Timeout”…
FWIW this is code that has been running for many, many months in a
non-KMDF driver…
I guess it’s something I’ll have to follow up…

> How can that possibly happen??? Are you sure that you specify the
> right symbol path in WinDbg???

Yes, I tried many times, with different packs and specifying each path
singly, then all together. Yes, it’s very, very frustrating.

> Concerning the rest, it is hard to say anything without actually
> seeing your code - there is a good chance that it is just buggy…

Well part of the problem is that I’m trying to integrate existing code
from a non-WDF driver (not written by me) into a KMDF driver based on
the Toaster Bus Driver. The existing code has been around for quite a
while. My philosophy has been - if it ain’t broke, don’t fix it.
Naturally I’m not treating it like a complete black box - I made every
attempt to ensure that it would be compatible with the new framework -
only I missed a bit…

It’s amazing how things work - I’ve been hunting this problem for a few
weeks now (not quite full-time) without any success. Then as a last
resort I post to ntdev, only to suddenly get an idea about (quite
literally) 3 mins later - sitting on the crapper no less - what the
problem might be.

And I was right! :open_mouth:

The problem was that I was using
attributes.SynchronizationScope = WdfSynchronizationScopeDevice;
which meant that the DPC callback was synchronised internally by the
framework with the DeviceIoctl callback. Since the Ioctl waited on an
event ultimately signalled by the DPC callback, things ground to a halt
quickly.

Another trap for young players. :frowning:

Regards,


Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
> 21-25 King St, Rockdale, 2216
> Ph: +612-9599-3255 Fax: +612-9599-3266
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer</http:>

Mark Mcdougal wrote:

IIUC the WDF framework can call the DeviceIoctl callback
function at IRQL=DISPATCH_LEVEL. This is where the code
waits for an event to do some synchronous I/O. There’s no
‘IRPs’ or ‘pending’ in this area of the framework.

Yes, and if you’re at DISPATCH_LEVEL, your code cannot “wait” or “do some synchronous I/O”. If you want to do synchronous I/O, you’ll need to pend a work item. When Maxim said “pend the IRP”, in WDF speak that would mean “put the request in a manual queue and complete it later”.

> IIUC the WDF framework can call the DeviceIoctl callback function at

IRQL=DISPATCH_LEVEL. This is where the code waits for an event to do
some synchronous I/O. There’s no ‘IRPs’ or ‘pending’ in this area of the
framework.

First, you cannot wait on DISPATCH_LEVEL. This is a law. You can only “wait
with zero timeout”, which is - test the event object state, and not a wait.

So, if you really need to wait, use some other WDF callback like
InCallerContext which is called on PASSIVE_LEVEL.

Second, waiting in dispatch path is a bad idea. This is not a law, but a good
recommendation. The reason is that waiting in dispatch path makes overlapped IO
impossible on your driver.

Pending the IRP is a correct way. KMDF has the notion of queues, which are the
“containers where the IRPs are pended”. Use the KMDF queue.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim wrote:

Second, waiting in dispatch path is a bad idea. This is not a law,
but a good recommendation. The reason is that waiting in dispatch
path makes overlapped IO impossible on your driver.

Also, don’t forget that InCallerContext() is not really the dispatch path – I believe the original IRP has already been pended by this point.

If you really want to wait in the dispatch path (and you shouldn’t), you will need EvtDeviceWdmIrpPreprocess().

Marc, if you want more info on a NULL vs 0 length wait, please read http://blogs.msdn.com/doronh/archive/2006/08/25/724741.aspx

Both EvtDeviceWdmIrpPreprocess EvtIoInCallerContext are workarounds (that should not be used) to the same problem, your EvtIoDeviceControl routine is being called at DISPATCH_LEVEL. One of 2 things is causing this:

  1. The sender of the I/O is a kernel component sending the request at DISPATCH_LEVEL. in this case, both of the workaround callbacks will also be at dispatch…so no help here

  2. you configured the framework to have some type of locking on the queue or device. to get this type of synchronization, KMDF will call your io callback at dispatch (so that it synchronize against other callbacks, timers, etc). If this is the case, removing the KMDF automatic locking will drop the IRQL back to the caller’s IRQL (most likely PASSIVE_LEVEL) and you can now do your infinite wait (but as others have noted, it is better to put the request into a manual queue or mark the request as cancelable and return).

I would guess #2 is what is going on and you should consider not using the KMDF locking functionality to get back to passive. The other 2 callbacks (preprocess, in context) are not meant to be used as a way to get a different IRQL

d

Mark McDougall wrote:

And I was right! :open_mouth:

The problem was that I was using
attributes.SynchronizationScope = WdfSynchronizationScopeDevice;
which meant that the DPC callback was synchronised internally by the
framework with the DeviceIoctl callback. Since the Ioctl waited on an
event ultimately signalled by the DPC callback, things ground to a halt
quickly.

Another trap for young players. :frowning:

Yes, this is a very easy and very painful mistake to make. I know this,
because I made the exact same mistake on my first KMDF driver. When
reading the description, it seems like “gosh, more synchronization
should be better, right?” It’s only after you realize what this really
does behind the scenes that the danger becomes apparent.

This issue is discussed in detail in the upcoming WDF book from Microsoft.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Just use symbol server. Unless you are using the most recent beta of
Longhorn, this should not be happening (even this may work now).

!sym noisy
.sympath srv*c:\symbols*http://msdl.microsoft.com/download/symbols
.reload -f -n
lml

the first turns on diagnostic information to help figure out what is
wrong
the second sets the symbol path, where ‘c:\symbols’ may be replaced
with any extant directory you wish to use as a local cache.
the third reloads all kernel symbols fully (not delayed, so you can
find out what the deal is immediately)
the last displays all modules with symbols loaded. There will be a
fair number of modules that do not have symbols loaded (probably all
third party drivers, and a few from Microsoft), but if, minimally, nt &
hal have error messages or something saying that only exports have been
loaded, you indeed do have another problem.

mm

>> xxxxx@hotmail.com 2007-04-02 04:09 >>>
The wait is Executive, KernelMode, Alertable=FALSE, Timeout=NULL
which, AFAIK is safe for IRQL<=DISPATCH_LEVEL.

This is wrong…

If you want to call KeWait…() at DISPATCH_LEVEL without BSOD, then
*Timeout must be zero, i.e. thread does not enter the waiting state even
if event is in unsignalled state. If you specify NULL pointer, wait is
not satisfied until the event is signalled, so that your code is bound
to crash if you run it at DISPATCH_LEVEL…

I’d use WinDbg but, regardless of which symbol pack I download and
install, it insists that the kernel symbols are wrong.

How can that possibly happen??? Are you sure that you specify the right
symbol path in WinDbg???

Concerning the rest, it is hard to say anything without actually seeing
your code - there is a good chance that it is just buggy…

Anton Bassov


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

xxxxx@Microsoft.com wrote:

Marc, if you want more info on a NULL vs 0 length wait, please read
http://blogs.msdn.com/doronh/archive/2006/08/25/724741.aspx

Heh, after my last post yesterday I was searching for more information
and did indeed stumble across this very article. Thanks!

  1. you configured the framework to have some type of locking on the
    queue or device. to get this type of synchronization, KMDF will call
    your io callback at dispatch (so that it synchronize against other
    callbacks, timers, etc). If this is the case, removing the KMDF
    automatic locking will drop the IRQL back to the caller’s IRQL (most
    likely PASSIVE_LEVEL) and you can now do your infinite wait (but as
    others have noted, it is better to put the request into a manual
    queue or mark the request as cancelable and return).

I would guess #2 is what is going on and you should consider not
using the KMDF locking functionality to get back to passive. The
other 2 callbacks (preprocess, in context) are not meant to be used
as a way to get a different IRQL

Yes, I was using WdfSynchronizationScopeDevice, only because it was
left-over from the Toaster Bus example and I didn’t realise the effects
it had on the InterruptDPC for one. Actually I forgot all about it a few
days after I started on this driver and only had a revelation yesterday
after posting here as I mentioned.

I have removed the locking (WdfSynchronizationScopeNone) as there was no
real need for it in my driver. From what you and others have said, it
looks like it’ll solve all my problems. I’ll confirm that soon, but
early indications are that it’s working fine.

I understand that the current driver architecture isn’t exactly ideal,
but unfortunately that is also outside the scope of my charter atm. I’ve
inherited the driver from a customer only because they don’t have the
resources to port it to new hardware and enhanced functionality and time
is tight - at least with an operational driver the application writers
can get a start.

I’ll flag the issues and they can decide where to go from there.

Regards,


Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

Tim Roberts wrote:

Yes, this is a very easy and very painful mistake to make. I know this,
because I made the exact same mistake on my first KMDF driver. When
reading the description, it seems like “gosh, more synchronization
should be better, right?” It’s only after you realize what this really
does behind the scenes that the danger becomes apparent.

This issue is discussed in detail in the upcoming WDF book from Microsoft.

Yes, I only wish it was available in time for this project.

The next best thing is having Doran available to answer all your
questions! :wink:

Regards,


Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

Martin O’Brien wrote:

Just use symbol server. Unless you are using the most recent beta of
Longhorn, this should not be happening (even this may work now).

!sym noisy
.sympath srv*c:\symbols*http://msdl.microsoft.com/download/symbols
.reload -f -n
lml

Thanks, I’ll give it a shot.

And thanks to everyone else who chimed in on this issue!

Regards,


Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266</http:>

no problem

>> markm@vl.com.au 2007-04-02 21:08 >>>
Martin O’Brien wrote:

Just use symbol server. Unless you are using the most recent beta
of
Longhorn, this should not be happening (even this may work now).

!sym noisy
.sympath srv*c:\symbols*http://msdl.microsoft.com/download/symbols
.reload -f -n
lml

Thanks, I’ll give it a shot.

And thanks to everyone else who chimed in on this issue!

Regards,


Mark McDougall, Engineer
Virtual Logic Pty Ltd, http:
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer</http:>