Raising IRQL on one core blocks threads on other cores?

That is very curious, as it sounds to ME like there are no interrupts being serviced either.

Ah! I have to ask: VM or physical machine?

Peter

The driver receives this, raises the IRQL to DISPATCH_LEVEL, does the loop, then lowers the IRQL back to its original level.

What exactly it does in the loop? Remember that many useful APIs cannot be called at DISPATCH.
If you call such APIs, the system won’t necessarily catch you (crash) but it may break.

– pa

@“Peter_Viscarola_(OSR)” said:
That is very curious, as it sounds to ME like there are no interrupts being serviced either.

Ah! I have to ask: VM or physical machine?

Peter

this is on a physical machine. It’s got a dual core. Kinda weird I think if interrupts aren’t being serviced.

@Pavel_A said:

The driver receives this, raises the IRQL to DISPATCH_LEVEL, does the loop, then lowers the IRQL back to its original level.

What exactly it does in the loop? Remember that many useful APIs cannot be called at DISPATCH.
If you call such APIs, the system won’t necessarily catch you (crash) but it may break.

– pa

The loop looks like:

int i;
volatile int t;
for(i = 0; i < 1000000000; i++) t++;

Dual core, or one core with two hyperthreads? I believe interrupts are handled differently.

Allow me to introduce you to KeStallExecutionProcessor.

I entirely neglected to consider if it had hyperthreads. Viewing it in CPU-Z, it reports 2 cores and 2 threads. Task Manager also shows 2 cores, so it seems like it doesn’t use hyperthreads.

And as for KeStallExecutionProcessor, that would have made testing this a lot more elegant.

it sounds to ME like there are no interrupts being serviced either.

Well, taking into account that Windows interrupt processing is mostly done in DPC’s, rather than ISRs, I am not really surprized
with this part. If interrupts are mostly routed to the core that spins at elevated IRQL, this delay in interrupt processing is perfectly understandable - an ISR queues a DPC to the target core’s DPC queue (a normal priority DPC is queued to the same CPU that has serviced an ISR,right), and this DPC cannot start running until IRQL drops below DISPATCH_LEVEL. Taking into consideration that most threads are IO-driven and wait for the inputs, the whole system gets eventually suspended, and freezes until the data from the hardware arrives

Anton Bassov

Mr. Bassov’s observations are correct, but I’m not buying his conclusions. Unless the OP isn’t carefully describing his situation.

Most of us have seen cores get “lost” by a deadlock on a spin lock; I know I certainly have. That’s basically what the OP is describing. The system can seem OK for “a while”… then it eventually descends into a sort of weird, dead, state for the reasons Mr. Bassov lists. But I wouldn’t describe this as looking like “no threads were able to run on either core” or the system being “unresponsive.” This happens in pieces and gradually, yes. But only after some time.

Peter

I don’t suppose that somewhere outside my code a spinlock is acquired, and a thread on the other core is waiting for it? I suppose that type of situation would cause the other core to stay spinning while it is waiting for the first core to release its lock. Please correct me if I’m misunderstanding something.

I don’t know, but if I had this problem I would use windbg to look at what
the fork was going on with all the other ‘cpus’.

Mark Roddy

Most of us have seen cores get “lost” by a deadlock on a spin lock;

Well, this is a slightly different story, albeit a similar once

In case of a spinlock- related deadlock you are,indeed, just bound to deadlock completely at some point. If a CPU tries
to acquire the target spinlock in this situation, it is just bound to deadlock. Once spinlocks normally get acquired on the code paths that happen to be generally useful, and, hence, get executed on more or less regular basis, all cores are eventually going to get out of play at some point, trying to acquire a lock that will never ever get released.

However, in our (HYPOTHETICAL,of course) , case, we lose that ability to execute DPCs that queued to the DPC queue of the spinning core.
The same DPC cannot get enqueued more than once, can it? Therefore, every time an interrupt gets routed to that target core, we are just bound to “lose” a DPC that ISR queues - they will get placed to the target core’s queue.

then it eventually descends into a sort of weird, dead, state for the reasons Mr. Bassov lists

Fair enough - in case of a spinlock-related deadlock you may start noticing some “funny” effects well before all CPUs are actually deadlocked, and it is going to happen because of “DPC starvation” that I have described above

Another point to consider is that, in order to get the observations that the OP gets (i.e. of a frozen GUI), all we have to do is to “lose” a SINGLE
DPC (i.e. the one that gets queued by the mouseclass driver) this way…

Anton Bassov

What a long reply. Concision is a virtue, Mr. Bassov.

In case of a spinlock- related deadlock you are, indeed, just bound to deadlock completely at some point

Well… yes or no. It depends on the paths in which the spin-lock is acquired and what’s happening on the device.

In case of a spinlock-related deadlock you may start noticing some “funny” effects well before all CPUs are actually deadlocked, and it is going to happen because of “DPC starvation” that I have described above

Which, you know, is what I said…

the observations that the OP gets (i.e. of a frozen GUI), all we have to do is to “lose” a SINGLE DPC

Well… yes? More accurately, we need the USB HCD to queue a DPC on the core that is no longer servicing the DPC list (because it’s “otherwise engaged” at IRQL DISPATCH_LEVEL). It’d be bad luck for this to happen right away. And even in this case, you’d still see other UI elements working. The keyboard won’t immediately die.

In summary, as interrupts eventually get routed to the “busy” core, the DPCs that service the “bottom half” of those interrupts will be blocked, and no further DPC will be queued for those device instances for which a DPC is pending, as Mr. Bassov correctly notes.

So, again… the system “eventually descends into a sort of weird, dead, state” – “eventually” here might be a few seconds. But I can’t see this being immediate.

Mr. Roddy’s comment is really the most insightful: Just look at what’s happening with WinDbg, and we can all stop guessing?

Peter

Wow, thank you all for the very insightful replies. As it stands, I’ve been testing this on my local machine, which may not be the best idea. If I can get it working in a Virtual Machine then I think I can take a look with WinDbg what’s going on. Very new to this all, so I apologize for my ignorances.

Hmmmmm… I’m not sure I’d trust results on a VM.

Get a (physical machine) test system hooked up.

ETA: More cores would be better, as well. Get a test system with 4 or 8 cores and you should definitely see the slower degradation we’d expect. On a 2 core machine, you have a 50% chance of interrupts happening on the “locked” core. Also, get something with GUI elements running (they’ll run until, you know, the graphics card gets hung).

Peter

There are (mostly older) systems with chipsets that cause ALL interrupts to occur on CPU 0 instead of being evenly spread among CPUs. Regardless, if an interrupt occurs on the CPU that you are spinning and there’s a DPC involved, it won’t run until the spinning DPC is done so that may lock up the GUI / system.

//Daniel

um no?

If there is an interrupt on the cpu that is running a dpc thread, the
interrupt service will run and then the dpc will resume. Also a dpc is not
a hardware interrupt, it cannot by itself block other cpus.

Mark Roddy

That’s what I think I said. The spinning DPC will resume. Perhaps an exception to this is if a HighImportance DPC is queued, then it should run immediately.

//Daniel

finding any machine in 2020 that has only 2 cores seems hard, but I think that many current desktop grade systems do route all interrupts to cpu 0 even in 2020.

in any case it will be a highly platform specific behavior that will certainly change between each physical machine and between VMs on different hypervisors and with differences in the physical machines that they run on too

Perhaps an exception to this is if a HighImportance DPC is queued, then it should run immediately.

Absolutely not. There is never DPC preemption.

All setting the Importance of a DPC does is influence (a) whether the DPC Object is queue at the head or the tail of the DPC List, and (b) If the DPC is targeted to a processor other than the current processor, whether a IPC is sent to the remote processor to inform it that it now has something on the DPC list that needs to be processed.

Peter

Absolutely not. There is never DPC preemption.

That’s what I would expect too, but here’s what the docs say about HighImportance:
Place the DPC at the beginning of the DPC queue, and begin processing the queue immediately.

If KeInsertQueueDpc is executed by the ISR, cannot it process the DPC queue before the ISR returns ?

//Daniel

If KeInsertQueueDpc is executed by the ISR, cannot it process the DPC queue before the ISR returns

No. How could this work. The ISR is running at DIRQL and holding the interrupt spin lock… we need to run the DPC at IRQL DISPATCH_LEVEL and not holding the lock.

How could DPC preemption work, given everything we know about DPCs?

Peter