Debugging a System Hang

OK so, here is what I am guessing is going on.

1 - An IRP (for any device?) is getting processed, and the Dispatch Manager
Database spin lock is getting acquired.
2 - My ISR is firing and calling IoCompleteIrp. This in turn attempts to
aquire the Dispatch Manager Database spin lock and BAMM! I’m toast.

That leaves me with two questions (for now):
1- why did the !PCR command display an IRQL of 0?
2 - Is there only one DispatchManager Database, or is there effectively one
per driver?

Joe

my guess was that the dispatcher lock was already held by the thread
your ISR interrupted. Attempting to complete the I/O request would
result in an attempt to acquire this lock again - since it’s already
held you deadlock.

at interrupt level your driver is not really allowed to interact with
the system very much. Most of the locking mechanisms used by the kernel
APIs cannot be used at > DISPATCH_LEVEL because of the risk of deadlock.
At ISR level you should limit yourself to poking the hardware so it
continues doing useful work & posting a DPC to process anything which
has completed.

you should be queueing a DPC routine (IoRequestDpc or
KeInitializeDpc/KeInsertQueueDpc) and holding off the completion of the
IRP until the DPC runs. If there are multiple work items which should
be completed you would push them all onto a list in your device
extension and have the DPC pop the entire list (make sure you
synchronize properly with your ISR) and complete everything.

-p

-----Original Message-----
From: Joe D [mailto:xxxxx@voicenet.com]
Sent: Wednesday, June 26, 2002 12:55 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Debugging a System Hang

Peter W,

Why yes I am. I havn’t quite figured out why that is bad yet. But
I am sure you are about to tell me? I will see if I can figure it out
on my own, but I certainly won’t refuse your advice.

Thanks,
Joe

Peter Wieland wrote in message
news:xxxxx@ntdev…

you’re not calling IoCompleteRequest inside your ISR are you?

-p

-----Original Message-----
From: Joe D [mailto:xxxxx@voicenet.com]
Sent: Wednesday, June 26, 2002 10:38 AM
To: NT Developers Interest List
Subject: [ntdev] Re: Debugging a System Hang

Peter,
Thanks for the info and links. I don’t mean to be asking you
specifically to answer my questions, so any input from anyone else is
certainly welcome.

I am somewhat able to reproduce the problem. It apears to be more
likely with a heavier load on the OS with regard to File IO, but it is
still somewhat random. I will try and install to checked build of the
OS to see what happens.

I clicked on the frame which appears to point to my driver, and all
I got were "???"s. Does that mean that the code is swapped out? That
indeed would be a problem.

What exactly is the dispatch manager database? It sound like
something used by the IO manager to process IRPs and dispatch them to
device drivers. It sounds like it would also be used in calling StartIo
routines? When does this spinlock get acquired and released?

I also did a !PCR on this machine, and it says that the IRQL is
zero. Since I am spinning attempting to acquire a spin lock, would that
neccesarily indicate that this thread/process already has the spin lock
acquired, or could I be in a deadly embrace? Is there any way to
determine what other process has the dispatch manager database spinlock
acquired?

Also, given the stack trace
ChildEBP RetAddr Args to Child
f52e16a0 80119594 81df7128 820707c0 f52e16dc
hal!KeAcquireSpinLockRaiseToSynch+0x34
f52e16b0 80112b35 81df7128 ff443688 00000000
nt!KeInsertQueueApc+0x12
f52e16dc eb0e2130 80b1200c 00000002 80b12000
nt!IofCompleteRequest+0x201
WARNING: Stack unwind information not available. Following
frames may be wrong.
80b12034 0e1fb000 fdba7000 00000000 00000000 MyDriver+0x2130
0690b000 00000000 00000000 00000000 00000000 +0xe1fb000

can I tell if the address of the IRP that is beeing completed is at
80b1200c, or do I have to do more digging? When I do a !irp 80b1200c, I
get the message that says the IRP signature does not match.

I am still in the process of reading the IOCompletion article. My
device driver has no completion routines involved. It is a relatively
simple driver that talks to a piece of custom hardware. Applications
talk to the device via custom IOCTLs. It basically simply sends
messages back and forth, or waits for a specific message to come back
from the device.

As always… Thanks,
Joe D

Peter Viscarola wrote in message news:xxxxx@ntdev…
>
>
> “Joe D” wrote in message news:xxxxx@ntdev…
> >
> > Peter,
> >
> > Thanks for the info. We are not using ERESOURCES in our driver.

> > We
> are
> > running the MP version of NT, but we only have a single processor
> installed.
> >
>
> Hmmm… That makes the probelm even stranger, then. It’s not like
> there’s another processor that could be holding the dispatcher
> database lock,
right?
> Weird…
>
> > What would be the benefit of
> > running the checked kernel and HAL?
> >
>
> Oh, one of my FAVORITE questions:
>
> If you’re not testing your driver with the checked kernel and HAL,
> you’re not testing your driver properly.
>
> The checked kernel and HAL have lots of “cross-checking” built into
> them. This ranges from parameter validation for various DDK functions,

> to verification of internal state and structures. This differs from
> the free build of the system, which foregoes much of this checking,
> given that the O/S architecture is basically that kernel mode
> components implicitly
“trust”
> each other. Testing on the checked build is extremely valuable to
> driver writers because of the checks it performs.
>
> This is all described in the (XP and later) DDK, in the DDK docs, see
> the section “Driver Development Tools”… “The Checked Build Of
> Windows” (just type The Checked Build of Windows (no quotes) into the
> box at the index tab).
>
> You don’t have to install the full checked build to get these
> benefits.
You
> can install JUST the checked kernel and HAL. See the DDK or
> http://www.osr.com/ntinsider/2001/checking/checked.htm for
> instructions.
>
> > How do you know that the IoCompleteRequest is running in an
arbitrary
> > thread? Is it because the Stack Unwind Information was not
> > available,
or
> is
> > it just due to the general nature of drivers (completing IRPs in
> > DPCs).
> >
>
> Nah. It’s just a gift I was given. Ooops, sorry. No, actually, I
> can
tell
> that somebody’s completing an I/O request asynchronously (calling
> IoCompleteRequest), because of the call to KeInsertQueueApc – This
wouldn’t
> be done if the request was being completed sychronously (in thread
context).
> See
> http://www.osr.com/ntinsider/1997/iocomp/iocomp.htm (not an article
> for beginners or the faint of heart, and sort of aimed at FS and FS
> Filter Driver writers).
>
> Can you look at the stack frame that’s in your driver (in WinDbg’s
> stack window, select that stack location) and see what your driver is
> doing?
The
> IRP will definitely still be around at this point (it’s not returned
> until after the APC has run)…
>
> Peter
> OSR
>
>
>
>
>


You are currently subscribed to ntdev as: xxxxx@microsoft.com To
unsubscribe send a blank email to %%email.unsub%%


You are currently subscribed to ntdev as: xxxxx@microsoft.com To
unsubscribe send a blank email to %%email.unsub%%

the Dispatcher is NT’s equivalent of a process/thread scheduler.

-p

-----Original Message-----
From: Joe D [mailto:xxxxx@voicenet.com]
Sent: Wednesday, June 26, 2002 1:33 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Debugging a System Hang

Ok well as I read the docs on IoCompleteRequest, it says that it must be
called at IRQL <= DispatchLevel. This is code that I didn’t (recently)
add, but I didn’t catch it either.

I do see how this could cause significant problems.

Was my guess on what the Dispatch Manager Database does correct?

I will correct it immediately. :slight_smile:

Thanks everybody,
Joe

Joe D wrote in message news:xxxxx@ntdev…
>
> Peter W,
>
> Why yes I am. I havn’t quite figured out why that is bad yet.
> But I
am
> sure you are about to tell me? I will see if I can figure it out on
> my
own,
> but I certainly won’t refuse your advice.
>
> Thanks,
> Joe
>
>
> Peter Wieland wrote in message
> news:xxxxx@ntdev…
>
> you’re not calling IoCompleteRequest inside your ISR are you?
>
> -p
>
> -----Original Message-----
> From: Joe D [mailto:xxxxx@voicenet.com]
> Sent: Wednesday, June 26, 2002 10:38 AM
> To: NT Developers Interest List
> Subject: [ntdev] Re: Debugging a System Hang
>
>
> Peter,
> Thanks for the info and links. I don’t mean to be asking you
> specifically to answer my questions, so any input from anyone else is
> certainly welcome.
>
> I am somewhat able to reproduce the problem. It apears to be more

> likely with a heavier load on the OS with regard to File IO, but it is

> still somewhat random. I will try and install to checked build of the

> OS to see what happens.
>
> I clicked on the frame which appears to point to my driver, and
> all I got were "???"s. Does that mean that the code is swapped out?
> That indeed would be a problem.
>
> What exactly is the dispatch manager database? It sound like
> something used by the IO manager to process IRPs and dispatch them to
> device drivers. It sounds like it would also be used in calling
> StartIo routines? When does this spinlock get acquired and released?
>
> I also did a !PCR on this machine, and it says that the IRQL is
> zero. Since I am spinning attempting to acquire a spin lock, would
> that neccesarily indicate that this thread/process already has the
> spin lock acquired, or could I be in a deadly embrace? Is there any
> way to determine what other process has the dispatch manager database
> spinlock acquired?
>
>
> Also, given the stack trace
> ChildEBP RetAddr Args to Child
> f52e16a0 80119594 81df7128 820707c0 f52e16dc
> hal!KeAcquireSpinLockRaiseToSynch+0x34
> f52e16b0 80112b35 81df7128 ff443688 00000000
> nt!KeInsertQueueApc+0x12
> f52e16dc eb0e2130 80b1200c 00000002 80b12000
> nt!IofCompleteRequest+0x201
> WARNING: Stack unwind information not available. Following
> frames may be wrong.
> 80b12034 0e1fb000 fdba7000 00000000 00000000 MyDriver+0x2130
> 0690b000 00000000 00000000 00000000 00000000 +0xe1fb000
>
> can I tell if the address of the IRP that is beeing completed is at
> 80b1200c, or do I have to do more digging? When I do a !irp 80b1200c,

> I get the message that says the IRP signature does not match.
>
> I am still in the process of reading the IOCompletion article. My
> device driver has no completion routines involved. It is a relatively

> simple driver that talks to a piece of custom hardware. Applications
> talk to the device via custom IOCTLs. It basically simply sends
> messages back and forth, or waits for a specific message to come back
> from the device.
>
> As always… Thanks,
> Joe D
>
>
> Peter Viscarola wrote in message news:xxxxx@ntdev…
> >
> >
> > “Joe D” wrote in message news:xxxxx@ntdev…
> > >
> > > Peter,
> > >
> > > Thanks for the info. We are not using ERESOURCES in our
> > > driver.
>
> > > We
> > are
> > > running the MP version of NT, but we only have a single processor
> > installed.
> > >
> >
> > Hmmm… That makes the probelm even stranger, then. It’s not like
> > there’s another processor that could be holding the dispatcher
> > database lock,
> right?
> > Weird…
> >
> > > What would be the benefit of
> > > running the checked kernel and HAL?
> > >
> >
> > Oh, one of my FAVORITE questions:
> >
> > If you’re not testing your driver with the checked kernel and HAL,
> > you’re not testing your driver properly.
> >
> > The checked kernel and HAL have lots of “cross-checking” built into
> > them. This ranges from parameter validation for various DDK
> > functions,
>
> > to verification of internal state and structures. This differs from

> > the free build of the system, which foregoes much of this checking,
> > given that the O/S architecture is basically that kernel mode
> > components implicitly
> “trust”
> > each other. Testing on the checked build is extremely valuable to
> > driver writers because of the checks it performs.
> >
> > This is all described in the (XP and later) DDK, in the DDK docs,
> > see the section “Driver Development Tools”… “The Checked Build Of
> > Windows” (just type The Checked Build of Windows (no quotes) into
> > the box at the index tab).
> >
> > You don’t have to install the full checked build to get these
> > benefits.
> You
> > can install JUST the checked kernel and HAL. See the DDK or
> > http://www.osr.com/ntinsider/2001/checking/checked.htm for
> > instructions.
> >
> > > How do you know that the IoCompleteRequest is running in an
> arbitrary
> > > thread? Is it because the Stack Unwind Information was not
> > > available,
> or
> > is
> > > it just due to the general nature of drivers (completing IRPs in
> > > DPCs).
> > >
> >
> > Nah. It’s just a gift I was given. Ooops, sorry. No, actually, I
> > can
> tell
> > that somebody’s completing an I/O request asynchronously (calling
> > IoCompleteRequest), because of the call to KeInsertQueueApc – This
> wouldn’t
> > be done if the request was being completed sychronously (in thread
> context).
> > See
> > http://www.osr.com/ntinsider/1997/iocomp/iocomp.htm (not an article
> > for beginners or the faint of heart, and sort of aimed at FS and FS
> > Filter Driver writers).
> >
> > Can you look at the stack frame that’s in your driver (in WinDbg’s
> > stack window, select that stack location) and see what your driver
> > is doing?
> The
> > IRP will definitely still be around at this point (it’s not returned

> > until after the APC has run)…
> >
> > Peter
> > OSR
> >
> >
> >
> >
> >
>
>
>
>
>
>
>
>
>
> —
> You are currently subscribed to ntdev as: xxxxx@microsoft.com To
> unsubscribe send a blank email to %%email.unsub%%
>
>
>
>
>
>


You are currently subscribed to ntdev as: xxxxx@microsoft.com To
unsubscribe send a blank email to %%email.unsub%%

> What exactly is the dispatch manager database? It sound like
something

used by the IO manager to process IRPs and dispatch them to device
drivers.

No. It is scheduler which shares the CPU among threads.
It is Kexxx functions (KeSetEvent and such), not Ioxxx.

It sounds like it would also be used in calling StartIo routines?
When does
this spinlock get acquired and released?

It is system-wide cancel spinlock.

Since I am spinning attempting to acquire a spin lock, would that

Maybe you have called KeSetEvent and such from the ISR?

Max

> make sense on a single processor. If a single processor grabs a

real live spinlock, doesn’t it lock out ALL other work, which might
explain the hang…?

No. Even if you run MP kernel on UP machine, then spinlocks are the
same as KeRaiseIrql(DISPATCH_LEVEL), since one DISPATCH_LEVEL function
cannot interrupt another: threads switching is suspended, and DPCs are
executed serially (on the same CPU).

Interrupts are another issue.

So, using an MP kernel on a single processor is not just a bad idea,

Why not? It works.

Max

> IoCompleteRequest() need to acquire a spin lock behind the scenes

Also IoCompleteRequest (or completion routines called by it) usually
call KeSetEvent.
This is a disaster, since the ISR could interrupt the dispatcher code
itself, and deadlock on KiDispatcherLock in KeSetEvent.

Max

> I used to have a list of the TOP Things an NT Device Driver Writer
Should

Never Do.

Never lower the IRQL not raised by you.
Never call KeSetEvent and IoCompleteRequest from the ISR. Or - more
generally - always obey the documented IRQL rules.
Never violate the IoMarkIrpPending/STATUS_PENDING rule.

…and so on.

Max

Hi All,

I just wanted to thank everyone for helping with this problem.
Especially Peter W and Peter V. I haven’t really been able to reliably
reproduce the problem in the first place, but so far no hangs with the
requisite changes.

Prior to a month or so ago, the driver was not calling IoCompleteRequest
in the ISR. New code was added by someone else. I caught some similar
mistakes, this one got by me.

I am taking Peter V’s suggestion of running the checked build of NT
against my driver, but I am running problems doing that (subject of a
message in a different thread if I can’t figure it out on my own). I will
insist that any future changes to the driver get some test time on a checked
build of NT.

Anyways, thanks again to all who replied.

Joe

you should also make sure that future changes to the driver are tested
under the driver verifier. The verifier can catch many things that even
a checked build won’t detect.

-p

-----Original Message-----
From: Joe D [mailto:xxxxx@voicenet.com]
Sent: Friday, June 28, 2002 12:20 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Debugging a System Hang

Hi All,

I just wanted to thank everyone for helping with this problem.
Especially Peter W and Peter V. I haven’t really been able to reliably
reproduce the problem in the first place, but so far no hangs with the
requisite changes.

Prior to a month or so ago, the driver was not calling
IoCompleteRequest in the ISR. New code was added by someone else. I
caught some similar mistakes, this one got by me.

I am taking Peter V’s suggestion of running the checked build of NT
against my driver, but I am running problems doing that (subject of a
message in a different thread if I can’t figure it out on my own). I
will insist that any future changes to the driver get some test time on
a checked build of NT.

Anyways, thanks again to all who replied.

Joe


You are currently subscribed to ntdev as: xxxxx@microsoft.com To
unsubscribe send a blank email to %%email.unsub%%

P,
I didn’t think Driver Verifier worked on NT 4.0. If it does… I will.
Thanks,
Joe

Peter Wieland wrote in message
news:xxxxx@ntdev…

you should also make sure that future changes to the driver are tested
under the driver verifier. The verifier can catch many things that even
a checked build won’t detect.

-p

-----Original Message-----
From: Joe D [mailto:xxxxx@voicenet.com]
Sent: Friday, June 28, 2002 12:20 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Debugging a System Hang

Hi All,

I just wanted to thank everyone for helping with this problem.
Especially Peter W and Peter V. I haven’t really been able to reliably
reproduce the problem in the first place, but so far no hangs with the
requisite changes.

Prior to a month or so ago, the driver was not calling
IoCompleteRequest in the ISR. New code was added by someone else. I
caught some similar mistakes, this one got by me.

I am taking Peter V’s suggestion of running the checked build of NT
against my driver, but I am running problems doing that (subject of a
message in a different thread if I can’t figure it out on my own). I
will insist that any future changes to the driver get some test time on
a checked build of NT.

Anyways, thanks again to all who replied.

Joe


You are currently subscribed to ntdev as: xxxxx@microsoft.com To
unsubscribe send a blank email to %%email.unsub%%

good point

-----Original Message-----
From: Joe D [mailto:xxxxx@voicenet.com]
Sent: Friday, June 28, 2002 12:35 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Debugging a System Hang

P,
I didn’t think Driver Verifier worked on NT 4.0. If it does… I
will. Thanks, Joe

Peter Wieland wrote in message
news:xxxxx@ntdev…

you should also make sure that future changes to the driver are tested
under the driver verifier. The verifier can catch many things that even
a checked build won’t detect.

-p

-----Original Message-----
From: Joe D [mailto:xxxxx@voicenet.com]
Sent: Friday, June 28, 2002 12:20 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Debugging a System Hang

Hi All,

I just wanted to thank everyone for helping with this problem.
Especially Peter W and Peter V. I haven’t really been able to reliably
reproduce the problem in the first place, but so far no hangs with the
requisite changes.

Prior to a month or so ago, the driver was not calling
IoCompleteRequest in the ISR. New code was added by someone else. I
caught some similar mistakes, this one got by me.

I am taking Peter V’s suggestion of running the checked build of NT
against my driver, but I am running problems doing that (subject of a
message in a different thread if I can’t figure it out on my own). I
will insist that any future changes to the driver get some test time on
a checked build of NT.

Anyways, thanks again to all who replied.

Joe


You are currently subscribed to ntdev as: xxxxx@microsoft.com To
unsubscribe send a blank email to %%email.unsub%%


You are currently subscribed to ntdev as: xxxxx@microsoft.com To
unsubscribe send a blank email to %%email.unsub%%

Thanks Max, but that wasn’t exactly what I was looking for. After digging
through my pile of reference materials, I found what I was looking for. It
is actually a KB article titled “INFO: Tips for Windows NT Driver
Developers – Things to Avoid (Q186775)”. Very helpful.

Joe D

Maxim S. Shatskih wrote in message
news:xxxxx@ntdev…
>
> > I used to have a list of the TOP Things an NT Device Driver Writer
> Should
> > Never Do.
>
> Never lower the IRQL not raised by you.
> Never call KeSetEvent and IoCompleteRequest from the ISR. Or - more
> generally - always obey the documented IRQL rules.
> Never violate the IoMarkIrpPending/STATUS_PENDING rule.
>
> …and so on.
>
> Max
>
>
>
>

Update: No new server hangs. I should be more careful with reviewing new
code.

Thanks again to everyoine on the list.
Joe D.