hang on KeLowerIrql from HIGH_LEVEL

Can anyone please suggest the sorts of things I might look for when I
get a hang on a KeLowerIrql lowering the irql from HIGH_LEVEL to
DISPATCH_LEVEL?

The code in question is this:

KeRaiseIrql(HIGH_LEVEL, &old_irql);
DoWorkThatNeedsToBeDoneAtHighLevel(…);
KdPrint((__DRIVER_NAME " B\n"));
KeLowerIrql(old_irql);
KdPrint((__DRIVER_NAME " C\n"));

“B” gets printed but “C” doesn’t. KdPrint in this case is a macro that
outputs debugging info via io ports where the KdPrint messages are
captured in Xen log files, so the “<=DIRQL” rules don’t apply, and
removing the first KdPrint still doesn’t allow it to progress to the
second.

I’m guessing that something DoWorkThatNeedsToBeDoneAtHighLevel(…) is
doing is upsetting the system somewhere so that either KeLowerIrql
breaks, or the act of enabling interrupts again is in turn breaking the
system, but any suggestions as to how I might narrow my search would be
much appreciated. DoWorkThatNeedsToBeDoneAtHighLevel(…) is a series of
functions, too much code to post here. Unfortunately I have had little
luck with windbg when dealing with code that runs at HIGH_LEVEL.

Thanks

James

> Can anyone please suggest the sorts of things I might look for when I

get a hang on a KeLowerIrql lowering the irql from HIGH_LEVEL to
DISPATCH_LEVEL?

I haven’t found it yet, but I forgot I hadn’t enabled the verifier since
the last rebuild, and it has found a deadlock already!

James

Are you sure your original IRQL is elevated to DISPATCH_LEVEL ??? If it is not, then the most likely scenario is that DoWorkThatNeedsToBeDoneAtHighLevel(…) queues a DPC, and DPC routine
(or DoWorkThatNeedsToBeDoneAtHighLevel(…)) is buggy. For example, DoWorkThatNeedsToBeDoneAtHighLevel(…) may obtain a spinlock that is supposed to be acquired by DPC routine and forget to release it…

When you restore IRQL to the level below DPC one, software interrupt 0x41 that got requested by KeInsertQueueDpc() fires immediately, so that your DPC routine gets executed straight away. If spinlock is being held by a given CPU and your DPC routine tries to acquire it, you are just bound to deadlock…

Anton Bassov

James, I used to hang at HIGH_LEVEL too. I guessed that it was because I was masking IPI for too long. Early on, record the value of DIRQL and then raise to that when you’re going to try to migrate.

IPI_LEVEL - 1 works.
Mark Roddy

On Sun, Feb 8, 2009 at 10:48 AM, wrote:

> James, I used to hang at HIGH_LEVEL too. I guessed that it was because I
> was masking IPI for too long. Early on, record the value of DIRQL and then
> raise to that when you’re going to try to migrate.
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

> -----Original Message-----

From: xxxxx@lists.osr.com [mailto:bounce-353920-
xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Monday, 9 February 2009 00:00
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] hang on KeLowerIrql from HIGH_LEVEL

Are you sure your original IRQL is elevated to DISPATCH_LEVEL ??? If
it is
not, then the most likely scenario is that
DoWorkThatNeedsToBeDoneAtHighLevel(…) queues a DPC, and DPC routine
(or DoWorkThatNeedsToBeDoneAtHighLevel(…)) is buggy. For example,
DoWorkThatNeedsToBeDoneAtHighLevel(…) may obtain a spinlock that is
supposed to be acquired by DPC routine and forget to release it…

The code I posted is queued from a dpc, and does not queue any dpc’s
itself. All it does is make some hypercalls to xen and re-initialize
some memory structures (and no, it doesn’t allocate any memory :slight_smile:

When you restore IRQL to the level below DPC one, software interrupt
0x41
that got requested by KeInsertQueueDpc() fires immediately, so that
your
DPC routine gets executed straight away. If spinlock is being held by
a
given CPU and your DPC routine tries to acquire it, you are just bound
to
deadlock…

The code does not acquire any spinlocks. The whole idea of running at
HIGH_LEVEL (and getting all other CPU’s to spin at HIGH_LEVEL) is that
no spinlocks are necessary, and I am guaranteed to have no
interruptions.

This code used to work, but I have done a fairly major refactoring and
have broken something.

The verifier error I alluded to earlier turned out to be nothing to do
with this, it was just a bug in a PreProcessIrp routine.

James

>

James, I used to hang at HIGH_LEVEL too. I guessed that it was
because I
was masking IPI for too long. Early on, record the value of DIRQL and
then raise to that when you’re going to try to migrate.

I wrote a ‘SyncAtHighLevel’ routine that also gets used to patch the
kernel. Raising to DIRQL isn’t going to be sufficient for that - I need
to go to HIGH_LEVEL. If I needed to patch the kernel on 2003 which
doesn’t use the TPR register at all I would also need to issue a cli.

Before I updated, the suspend/migrate stuff was working fine, so I’m
pretty sure I just have a bug somewhere…

I’m wondering if DIRQL is all I need in this case though… I’m doing
the suspend/resume, so once the resume starts I can’t make any
hypercalls to Xen until the hypercall page is set up again, and then I
need to set up the event channels etc before the IRQ’s will be delivered
again. Outside of my code, everything that could possibly involve a
hypercall or anything would have to be done at DISPATCH_LEVEL, so maybe
DIRQL would be sufficient after all.

Thanks

James

> The whole idea of running at HIGH_LEVEL (and getting all other CPU’s to spin at HIGH_LEVEL)

is that no spinlocks are necessary, and I am guaranteed to have no interruptions.

There is a logical fallacy here - disabling interrupts _on_a_given_cpu does not eliminate a need for a spinlock for MP synchronization, does it???

BTW, what do you mean by “getting all other CPU’s to spin at HIGH_LEVEL”??? This phrase in itself sounds pretty much like a description of some custom spinlock implementation with the functionality that is identical to that of spinlock_irqsave() under Linux. If you made any mistake in this part, you are just bound to face exactly the same problems you would encounter if misusing system-provided spinlocks, and there is a good chance this is exactly what happens here…

Anton Bassov

> > The whole idea of running at HIGH_LEVEL (and getting all other CPU’s
to

spin at HIGH_LEVEL)
> is that no spinlocks are necessary, and I am guaranteed to have no
interruptions.

There is a logical fallacy here - disabling interrupts _on_a_given_cpu
does not eliminate a need for a spinlock for MP synchronization, does
it???

BTW, what do you mean by “getting all other CPU’s to spin at
HIGH_LEVEL”??? This phrase in itself sounds pretty much like a
description
of some custom spinlock implementation with the functionality that is
identical to that of spinlock_irqsave() under Linux. If you made any
mistake in this part, you are just bound to face exactly the same
problems
you would encounter if misusing system-provided spinlocks, and there
is a
good chance this is exactly what happens here…

For patching the kernel, I don’t just need to protect my code with a
spinlock, I need to protect everything. For making the xen suspend
hypercall I think I need to do the same thing, although steve suggested
that DIRQL might be sufficient as I only need to protect anything that
might call my code.

The algorithm is basically this:

Setup:
set spin_flag = TRUE
set nr_spinning = 0
set nr_to_spin = number_of_cpus - 1
raise IRQL to DISPATCH_LEVEL
schedule a DPC on every CPU
lower IRQL

DPC on CPU# > 0:
raise IRQL to HIGH_LEVEL
interlocked increment nr_spinning
spin while spin_flag is TRUE
lower IRQL

DPC on CPU# == 0:
raise IRQL to HIGH_LEVEL
spin while nr_spinning < nr_to_spin
// now we know that no other CPU is going to touch anything
do whatever work needs to be done at HIGH_LEVEL
set spin_flag = FALSE
spin while nr_spinning > 0
lower IRQL

For testing, I’m running a UP kernel so I don’t have to worry about
other CPU’s or anything, just going to HIGH_LEVEL is sufficient, but the
algorithm above should explain what I’m talking about.

The above implementation is fairly generic and is called with a
‘callback’ and ‘context’ parameter.

James

When these virtual machines migrate, we need to lock the “migrate” thread to CPU0, then send all other CPUs into a spin at high IRQL. Then CPU0 needs to be able be able to execute some critical code without the possibility of being interrupted, until the machine resumes on a different node.

The operation takes an amount of time that would be considered a long time to disable interrupts for any other ‘normal’ things that you would do in a kernel driver. So when I ran into the same lockups that James is experiencing, I made the assumption that it’s just too long to lock out interrupts > DIRQL. Figuring out what DIRQL happens to be in the context of our SCSIPort driver and raising to that particular level fixed the problem for me.

I’ve always assumed that code would make these drivers ‘WHQL challenged’ as well.

Well, as I said, there is a logical fallacy in your code. Look at the following line:

spin while nr_spinning > 0

I do see a line that increments nr_spinning, but I don’t see any line in so far that decrements or clears it.
Therefore, CPU 0 is just bound to spin in an infinite loop without any chance of breaking out of it - you just have no chance to ever reach the line where CPU 0 actually lowers IRQL…

What are you trying to achieve by designing so convoluted solution??? Why don’t you want just to use KeAcquireSpinlockAtDpcLevel() ??? BTW, here is a tip for you - KeAcquireSpinlockAtDpcLevel() will work just fine in this context, but KeAcquireSpinlock() will not. Have you got any suggestion why it works this way (if you don’t I will explain it to you, but let’s see how you understand it)…

Anton Bassov

Anton, KeAcquireSpinlockAtDpcLevel isn’t going to stop the thread running on CPU0 from fielding an interrupt.

> Well, as I said, there is a logical fallacy in your code. Look at the

following line:

spin while nr_spinning > 0

I do see a line that increments nr_spinning, but I don’t see any line
in
so far that decrements or clears it.
Therefore, CPU 0 is just bound to spin in an infinite loop without
any
chance of breaking out of it - you just have no chance to ever reach
the
line where CPU 0 actually lowers IRQL…

That’s a bug in my pseudo code. The bug is not there in the real code.
And I’m just testing on a single CPU so this isn’t the problem.

What are you trying to achieve by designing so convoluted solution???
Why
don’t you want just to use KeAcquireSpinlockAtDpcLevel() ??? BTW, here
is
a tip for you - KeAcquireSpinlockAtDpcLevel() will work just fine in
this
context, but KeAcquireSpinlock() will not. Have you got any suggestion
why
it works this way (if you don’t I will explain it to you, but let’s
see
how you understand it)…

You are kidding aren’t you? I’ve never tried it, but surely the verifier
would throw a fit if I tried to call KeAcquireSpinlockAtDpcLevel at >
DISPATCH_LEVEL. And if not the current version then maybe a future
version. Okay, it would probably work as it doesn’t touch IRQL, but it
is clearly a broken solution.

James

>

Anton, KeAcquireSpinlockAtDpcLevel isn’t going to stop the thread
running
on CPU0 from fielding an interrupt.

I think Anton was suggesting that I use KeAcquireSpinlockAtDpcLevel at
HIGH_LEVEL instead of using my custom spin routine.
KeAcquireSpinlockAtDpcLevel would actually work as it doesn’t touch the
IRQL - it assume’s we are already at DISPATCH_LEVEL, but if the verifier
checks IRQL then all is lost.

To use KeAcquireSpinlockAtDpcLevel I would need to make sure that CPU0
was guaranteed to take the spinlock first, and that would add even more
complexity to my solution. The spin routine is 2 lines, so I don’t
really see the problem.

James

> And I’m just testing on a single CPU so this isn’t the problem.

Think about it carefully, and you will realize that if you run the above code on UP system you are going to deadlock as well…

I think Anton was suggesting that I use KeAcquireSpinlockAtDpcLevel at HIGH_LEVEL instead
of using my custom spin routine.

Correct…

KeAcquireSpinlockAtDpcLevel would actually work as it doesn’t touch the IRQL - it assume’s
we are already at DISPATCH_LEVEL,

Correct - in actuality, KeAcquireSpinlockAtDpcLevel() and KeReleaseSpinlockFromDpcLevel() are Windows equivalents of respectively spin_lock() and spin_unlock() under Linux, i.e. bare-bone spinlock functions that don’t deal with anything, apart from the condition variable itself…

And if not the current version then maybe a future version.

The logic in itself is reasonable, but it does not seem to apply in this particular case…

The very purpose of KeAcquireSpinlockAtDpcLevel() is to avoid doing unnecessary work, so that they are very unlikely to make it work like KeAcquireSpinlock() . IIRC, unlike KeAcquireSpinlock() that is exported by HAL.DLL, KeAcquireSpinlockAtDpcLevel() is exported by ntoskrnl.exe, because, unlike KeAcquireSpinlock(), it is not HAL-specific, so that it does not make sense to implement it in HAL.DLL…

Anton Bassov

> > And if not the current version then maybe a future version.

The logic in itself is reasonable, but it does not seem to apply in
this
particular case…

The very purpose of KeAcquireSpinlockAtDpcLevel() is to avoid doing
unnecessary work, so that they are very unlikely to make it work like
KeAcquireSpinlock() . IIRC, unlike KeAcquireSpinlock() that is
exported by
HAL.DLL, KeAcquireSpinlockAtDpcLevel() is exported by ntoskrnl.exe,
because, unlike KeAcquireSpinlock(), it is not HAL-specific, so that
it
does not make sense to implement it in HAL.DLL…

The docs say of KeAcquireSpinlockAtDpcLevel “IRQL: DISPATCH_LEVEL”.

There has been some discussion on the Xen mailing list to improve lock
contention by yielding the vcpu to Xen instead of spinning (eg spin for
a short time then yield). The idea being that the vcpu holding the
spinlock might not be currently scheduled by Xen, so by yielding the cpu
that is waiting for the spinlock the one holding the spinlock gets a
chance to finish and release it. I have considered patching the Windows
kernel to do the same sort of thing.

I don’t know that Microsoft implementing something like that in the
Windows kernel would actually change the outcome of the situation under
discussion here, but it is an example of why obeying the documented
restrictions is a good idea.

In any case, I can’t see how a spinlock would work for me here. As far
as I can see a spinlock is designed to solve a different problem than
the one I am solving here.

James

> > And I’m just testing on a single CPU so this isn’t the problem.

Think about it carefully, and you will realize that if you run the
above
code on UP system you are going to deadlock as well…

The code in question (despite any bugs in my pseudocode :slight_smile: reduces to
the following on a UP system:

Dpc for CPU 0 ()
{
// IRQL is DISPATCH_LEVEL
KeRaiseIrql(HIGH_LEVEL, &old_irql);
highsync_info->function0(highsync_info->context);
KeLowerIrql(old_irql);
KeSetEvent(&highsync_info->highsync_complete_event, IO_NO_INCREMENT,
FALSE);
}

old_irql is DISPATCH_LEVEL, so KeLowerIrql(old_irql) is only going to
set IRQL back to DISPATCH_LEVEL, effectively only enabling interrupts
again.

A KdPrint before KeLowerIrql produces output, a KdPrint after does not.

James

> There has been some discussion on the Xen mailing list to improve lock contention by yielding

the vcpu to Xen instead of spinning (eg spin for a short time then yield). The idea being
that the vcpu holding the spinlock might not be currently scheduled by Xen, so by yielding
the cpu that is waiting for the spinlock the one holding the spinlock gets a chance to finish and release it.

In fact, there is nothing particularly new here. AFAIK AIX UNIX uses this kind of synchronization construct, i.e. combination of spinlock and mutex concepts in a single construct - if you are allowed to block you can use it as a mutex and yield execution if you have failed to acquire it immediately; otherwise you have to use it as a spinlock and spin until successful acquisition. …

I don’t know that Microsoft implementing something like that in the Windows kernel would
actually change the outcome of the situation under discussion here,

Well, even if they decide to implement something like that, once you cannot yield execution at elevated IRQL , KeAcquireSpinlockAtDpcLevel() is not going to get affected anyway…

In any case, I can’t see how a spinlock would work for me here. As far as I can see
a spinlock is designed to solve a different problem than the one I am solving here.

Well, actually it is not that easy to understand what kind of problem you are trying to solve here. Please note that the same DPC cannot be queued to more than one CPU at a time. Therefore, the line like “schedule a DPC on every CPU” in setup implies that there must be a dedicated KDPC object for every CPU (i.e. every KDPC has only one certain CPU specified in its affinity mask) - otherwise, you have no chance to make your DPC run on all CPUs without re-queuing it…

At this point the only question that arises is " well, why don’t you want to queue just CPU 0’s DPC in setup and the ones to all other CPUs right from CPU 0’s DPC routine after it has already finished its actual job, i.e. immediately before returning ( in order to make it easier, you can just define different service routines for CPU 0 and all other CPUs). If all other DPCs have to synchronize something between themselves, they can use a spinlock (in fact, I don’t see any reason why they would have to use HIGH_LEVEL IRQL anyway). What is the point of all the trouble you are trying to go through???

Anton Bassov

>

> In any case, I can’t see how a spinlock would work for me here. As
far
as I can see
> a spinlock is designed to solve a different problem than the one I
am
solving here.

Well, actually it is not that easy to understand what kind of problem
you
are trying to solve here. Please note that the same DPC cannot be
queued
to more than one CPU at a time. Therefore, the line like “schedule a
DPC
on every CPU” in setup implies that there must be a dedicated KDPC
object
for every CPU (i.e. every KDPC has only one certain CPU specified in
its
affinity mask) - otherwise, you have no chance to make your DPC run on
all
CPUs without re-queuing it…

KeRaiseIrql(HIGH_LEVEL, &old_irql);
for (i = 0; i < ActiveProcessorCount; i++)
{
if (i == 0)
KeInitializeDpc(&highsync_info->dpcs[i],
XenPci_HighSyncCallFunction0, highsync_info);
else
KeInitializeDpc(&highsync_info->dpcs[i],
XenPci_HighSyncCallFunctionN, highsync_info);
KeSetTargetProcessorDpc(&highsync_info->dpcs[i], (CCHAR)i);
KeSetImportanceDpc(&highsync_info->dpcs[i], HighImportance);
KeInsertQueueDpc(&highsync_info->dpcs[i], NULL, NULL);
}
KeLowerIrql(old_irql);

So yes, I create a DPC for each processor, target it to that processor,
set the Importance to High (maybe not really necessary?) and then make
it go. I do the scheduling at HIGH_LEVEL, although DISPATCH_LEVEL is
probably all that is required, and then drop back to PASSIVE_LEVEL to
let the DPC scheduled on the current processor start. All the routines
above are documented as “IRQL: Any level”.

At this point the only question that arises is " well, why don’t you
want
to queue just CPU 0’s DPC in setup and the ones to all other CPUs
right
from CPU 0’s DPC routine after it has already finished its actual job,
i.e. immediately before returning ( in order to make it easier, you
can
just define different service routines for CPU 0 and all other CPUs).
If
all other DPCs have to synchronize something between themselves, they
can
use a spinlock (in fact, I don’t see any reason why they would have to
use
HIGH_LEVEL IRQL anyway). What is the point of all the trouble you are
trying to go through???

I’ve repeated this a few times. I patch the Windows kernel
(specifically, anything that touches the TPR) and so I have to be
absolutely sure that nothing calls the windows kernel at any time
until the patching is complete, lest it call the kernel in a ‘half
patched’ state.

In particular, because I’m patching all TPR access, I also can’t make
any calls to anything that raises or lowers IRQL, for obvious reasons.

The same synchronisation code is also used to ensure that all CPU’s are
doing nothing when I call Xen’s suspend code. The patching side of
things works fine, it’s just coming back from suspend that causes the
hang. I’m sure that the cause of the hang is a bug in my code which is
called from my synchronisation code, not a bug in the synchronisation
code itself, but being sure doesn’t necessarily make me right :slight_smile:

Actually, my ‘spin’ code calls KeStallExecutionProcessor… I wonder how
safe that is…?

James

> I’ve repeated this a few times. I patch the Windows kernel (specifically, anything that touches the TPR)

This is what you should have started with. Please note that IRQL is changed by writing to TPR. Therefore,
the above line implies that, among other things, you patch KeLowerIrql(). In other words, your question may be presented simply as " The very first time I call a function that I have earlier patched the system hangs, although it worked fine before I had patched it. What’s wrong with my code?"…

Anton Bassov