High IRQLs for filter developers

Hey,
I’m not a device driver developer - Mostly doing filter development - Is there any situation you think I will encounter:

  • running code in > DISPATCH_LEVEL?
  • Using KeInsertQueueDpc?
  • Using IPI?

As far as I know:

  1. Only ISRs run in > DISPATCH_LEVEL so it’s not relevant for filter developers unless some callback is called in an ISR - I don’t think any callback is called at this IRQL because it’s a high IRQL…
  2. Completion routines can only be called in <= DISPATCH_LEVEL (Because you cannot complete an IRP in > DISPATCH_LEVEL…)
  3. The only place I think a DPC can be useful for me is for timers… (KeSetTimer)

Also, I wonder:

  • Why can’t I queue a work item in > DISPATCH_LEVEL? Sounds very weird that to perform PASSIVE_LEVEL operations I have to go through a DPC… (Ignoring ThreadedDpc because it can be disabled globally and the DPC will be called in DISPATCH_LEVEL)
  • Why can’t I call KeInsertQueue in > DISPATCH_LEVEL? Is there any synchronization object that is allowed in DISPATCH_LEVEL and not allowed in DIRQL?

Most of these questions are probably not that relevant for normal filtering but I’m curious anyway…

Thanks!:slight_smile:

Your analysis is correct. In general, IRQLs > DISPATCH_LEVEL are restricted to device drivers. I can’t think of any case where a filter driver developer would encounter IRQLs > DISPATCH_LEVEL.

You can’t queue work items or DPCs at IRQLs > DISPATCH_LEVEL because the operations and/or queues that Windows uses for these features are protected by DISPATCH_LEVEL locks (or the equivalent).

Peter

I’m not a device driver developer - Mostly doing filter development - Is there any situation you think I will encounter:

running code in > DISPATCH_LEVEL?
Using KeInsertQueueDpc?
Using IPI?

Out of all the above, only KeInsertQueueDpc() may THEORETICALLY apply to filter drivers. In practical terms, I hardly imagine why
someone would want to use it in any “proper” driver that does not actually handle interrupts. However, in case if you are just desperate to use it in order to implement some"hack", you can do it in any context.

For example, IIRC, one of the infamous DRM schemes that was used around 15 years ago (I think it was called StarForce or something similar to that) was filtering a CD drive. What it was actually doing was re-queuing a DPC to the head of the queue over and over again
( in order to work around 100us limitation required for driver certification), effectively freezing the system ( UP systems were prevalent at the time) for indefinite periods of time so that the game could not get copied.

Running at DIRQL applies only to ISRs and to the code that synchronises with them, so that this part is reserved only for certain types of hardware devices (normally only PCI and PCIe ones). Concerning IPIs, Windows drivers are not supposed to send the ones.
This is done (or at lest was done at the time when I was writing Windows drivers 12 years ago) by, IIRC, HalRequestSoftwareInterrupt(), but this function was undocumented and was not supposed to be called by anyone, apart from the kernel. AFAIK, nothing had changed in this respect since.

Is there any synchronization object that is allowed in DISPATCH_LEVEL and not allowed in DIRQL?

Of course…

For example, you cannot use “regular” DISPATCH_LEVEL spinlocks at DIRQL. In order to understand why, consider what happens if a CPU acquires a spinlock at DISPATCH_LEVEL, then gets interrupted while still holding it, and ISR tries to acquire the same spinlock again. Once you cannot acquire spinlocks recursively, a deadlock is guaranteed in this situation.

This is why you cannot use “regular” DISPATCH_LEVEL spinlocks at DIRQL. In order to synchronise with ISR you need a special interrupt spinlock that elevates IRQL to DIRL, rather than to DISPATCH_LEVEL.

Why can’t I queue a work item in > DISPATCH_LEVEL?

Apparently, queuing a workitem involves a “regular” spinlock acquisition - after all, workitem queue must be a global one, and,hence,
must be accessible to multiple CPUs at once. Therefore, you need to get a lock before you can add new items to it . You already know why you cannot use “regular” spinlocks at IRQL > DISPATCH_LEVEL, right. This is the reason why you cannot queue a workitem
at IRQL > DISPATCH_LEVEL

Why can’t I call KeInsertQueue in > DISPATCH_LEVEL?

Well, I don’t really know what you mean by “KeInsertQueue”, but assuming that you mean KeInsertQueueDpc(), I can assure you that this
function may be used at IRQL > DISPATCH_LEVEL without a slightest problem. Otherwise, how would you queue a DPC, in the first place???

https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-keinsertqueuedpc

This function inserts a DPC into the queue that is local to the caller CPU, and, hence,does not require locking - simply disabling interrupts on the CPU may suffice here. Therefore, this function does not have IRQL-related requirements…

Anton Bassov

Thank you both for the comprehensive answer!
Anton - KeInsertQueue: https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-keinsertqueue - this function probably uses the “regular” spinlocks you spoke about to synchronize and that’s why it cannot be used in DIRQL.

about the work items - theoretically, they could implement lock-free queues with Interlocked functions but I guess it’s harder and the OS was not designed this way…

Queuing work items from the ISR: Yup… you always need to step-down from ISR to DPC, and then from DPC to a scheduled work item. In fact, if you look at WdfInterruptQueueWorkItemForIsr (which is designed to be a handy way to allow WDF devs to queue work items directly from the ISR, just as you describe) you’ll see that it queues a DPC and the DPC then queues the work item.

Peter

about the work items - theoretically, they could implement lock-free queues with Interlocked functions

Please note that a lock-free queue implemented with InterlockedCompareExchange() and friends is, in actuality, not necessarily
as “wonderful” solution as it may seem to be at the first glance. For example, consider a busy queue with insertions and removal getting performed by multiple CPUs all the time. By the virtue of bus locking that interlocked functions imply, only one of them may be successful
at any particular moment, right. Therefore, all other contenders will have to repeat their attempts, and do it in a loop until eventually succeeding.

In terms of bus pressure, this is, apparently, much worse than a “classical” test-and-set-based spinlock with its “thundering herd”
scenario, which may occur when a current owner releases a lock. At this point all the contenders break out of their inner loops,
and attempt an interlocked test-and-set all at once, although only one of them may succeed at the time. This is why it is called
a “thundering herd” scenario, which puts an unnecessary pressure on the system bus.

In case of test-and-set-based implementation of a spinlock, the above mentioned “thundering herd” scenario arises only in spikes,
i.e. may occur only when a current owner releases a lock. The system bus may “take a breath and relax” while the spinlock is being held.
However, in case of a lock-free queue there is no chance of a respite for a bus - what you are getting here is a tight polling loop
of interlocked operations, with all the pressure on the system bus that it implies.

In fact, if you look at the whole thing from the system bus’s perspective, what you are going to see in the place of this fancy " lock-free queue" is just the most naive implementation of spinlock acquisition ever known,

[enter ironical mode]

although Mr.Kyler believes this is the only proper spinlock implementation in existence.

https://community.osr.com/discussion/comment/122444#Comment_122444

                    [begin quote]

                     Everything on this thread has been worthless. Everyone knows the correct implementation of a spinlock is:

                     SPIN: BBSSI #0,LOCK,SPIN

                     - Dan.

                      [end quote]

[leave ironical mode]

Certainly, I am not saying that a lock-free queue is necessarily worse than a spinlock - surely it has its own advantages.
Unlike a spinlock, it does not have an owner, and, hence, does not need to get released. Therefore, even if you get preempted after a successful operation on a queue you are not going to block anything. Furthermore, once an operation on it is atomic, it has no chance of getting into an inconsistent state due to interrupt or preemption. As a result, there is no need to subject these operations to any IRQL-related constraints.

The only thing that I am saying is that these algorithms should be used judiciously, because they may do more harm than good
under some workloads, at least if implemented naively. Therefore, everything should be judged on case-by-case basis. One has to bear in mind that any particular interlocked operation per se is not THAT expensive., at least as long as it gets performed occasionally. The real overhead comes in only when you start performing them in a loop. Therefore, the main issue here is with a loop. As long as you can find a way to avoid it, everything is going to be just fine. For example, when you are in a position to do something useful in between your attempted interlocked ops, a lock-free queue seems to be, indeed, quite an attractive option

However, taking into consideration the specifics of a workitem queue, I would say that it is, apparently, better off with being protected by a spinlock.

Anton Bassov

better off with being protected by a spinlock.

Which is, incidentally, almost certainly an in-stack queued spinlock. But I really don’t recall, and absent a specific request or reason, haven’t looked.

Peter

Which is, incidentally, almost certainly an in-stack queued spinlock.

Most likely yes,especially if we recall that in-stack queued spinlock (i.e. XCHG-based one) is an optimisation that eliminates a “thundering herd” issue that a “classical” test-and-set-based spinlock is vulnerable to.

Again, it does not necessarily imply that a in-stack queued spinlock is always better than a “classical” one. Again, it depends on whether you have something useful to do at the moment. If you do, you may KeTryToAcquireSpinLock() . This approach may be possible only with a “classical” spinlock that is built around test-and-set. Both XCHG-based in-stack queued spinlock under Windows and XADD-based ticket lock under Linux disallow this possibility - once you have taken your place in a queue, you already cannot, for understandable reasons, leave it until you have actually acquired a spinlock. Otherwise, you are going to screw up the things for the other contenders, as well as for a lock owner.

Strange enough, the participants of the thread below mostly did not find any use to it - as it usually happened, Alberto was the only one who had the potential to see the various possibilities and to think “out of the box”

https://community.osr.com/discussion/72984.

To make it even more interesting, this function seems to have been removed from the newer OS releases, and is not mentioned anywhere in the documentation any longer. …

In any case, taking into consideration the specifics of a workitem queue, an in-stack queued spinlock seems to be more suitable for the purpose, compared to a “classical” one…

Anton Bassov

Certainly, I am not saying that a lock-free queue is necessarily worse than a spinlock - surely it has its own advantages.

Thank you for your answer, I’ll take it into consideration if I’ll do such a thing;

Moreover, I looked at KeInsertQueueDpc and I saw that it actually raises the IRQL to HIGH_LEVEL - Is it because DPCs can be queued at any IRQL?

So I just want to verify I got it correctly: you have to be at least the same IRQL as the code you want to synchronize with - for example if you have an interrupt handler that runs at IRQL 6 and you want to synchronize with it you have to raise the current IRQL to 6 to do so and interrupt spin locks in windows allows the caller to automatically increment to IRQL to the IRQL of the ISR
Also sounds like spin locks (or other locking mechanisms) are not handy when the the code that synchronizes always runs on the same CPU because when you raise the IRQL no other code on the same CPU can preempt your code.

If a device sends an interrupt while the IRQL on the CPU is high, does the interrupt gets into a “pending” state until the IRQL lowers?
I saw KeInsertQueueDpc calls HalSendSoftwareInterrupt(DISPATCH_LEVEL) - as far as I know this is used to run all the pending DPCs in the queue of the CPU - if the IRQL was high when KeInsertQueueDpc called HalSendSoftwareInterrupt(DISPATCH_LEVEL), the interrupt is pending until the IRQL lowers below DISPATCH_LEVEL then the DPC queue is flushed, right?

you have to be at least the same IRQL as the code you want to synchronize with

Hmmmm… yes. Spinlocks on Windows have an IRQL that is implicitly associated with them. This must be at least IRQL DISPATCH_LEVEL and is the highest IRQL from which they may ever be acquired. “Normal” Executive Spinlocks use IRQL DISPATCH_LEVEL. The Interrupt Spin Lock (one per KINTERRUPT) uses the Device IRQL (the IRQL at which the device interrupts and at which the ISR runs).

So, acquiring a Spin Lock in WIndows has two steps:

  1. The IRQL of the processor is raised to the Spin Lock’s IRQL. This, as I said before, is at least IRQL DISPATCH_LEVEL. This (as you noted) disables dispatching, causes the current thread to run until it lowers its IRQL back below DISPATCH_LEVEL, and is sufficient to lock THE CURRENT PROCESSOR.

  2. The thread attempts to acquire the Spin Lock (using an Interlocked Bit Test and Set operation… by implementation if not by architecture). The thread does not return from the attempt to acquire the Spin Lock until the lock is held.

Also sounds like spin locks (or other locking mechanisms) are not handy when the the code that synchronizes always runs on the same CPU

Hmmmm… Yes? I guess? But, that’s a pretty odd circumstance, isn’t it? I mean, I’ve got a system with N processors… why would my code be restricted to running on only one specific processor. Please don’t say “locality of reference” or “caching”… because (remember) Windows treats multiple threaded processors (“hyperthreading”) as discrete logical processors… but also understands the physical topology.

If a device sends an interrupt while the IRQL on the CPU is high, does the interrupt gets into a “pending” state until the IRQL lowers?

Yes.

as far as I know this is used to run all the pending DPCs in the queue of the CPU

Well, not necessarily. It can also be used to call the Dispatcher to perform a scheduling operation.

the interrupt is pending until the IRQL lowers below DISPATCH_LEVEL then the DPC queue is flushed, right?

Yes.

Peter

Also sounds like spin locks (or other locking mechanisms) are not handy when the the code that synchronizes
always runs on the same CPU because when you raise the IRQL no other code
on the same CPU can preempt your code.

Basically, raising IRQL is just a way of synchronising with the code that is guaranteed to run on the same logical CPU. As long as all your contenders are guaranteed to run on the same logical CPU, there is, indeed, no need for spinlocks - simply raising IRQL may suffice here.
This is the reason why UP HAL’s implementations of KeAcquireSpinlock() and KeAcquireSpinlockAtDpcLevel() boil down to respectively KeRaiseIrql (DISPATHCH_LEVEL) and NOP. However, this approach is very obviously insufficient if your contenders may be
running on the other logical CPUs, so that you need a “full-fledged” spinlock in this case.

This is why KeRaiseIrql() and friends are of very little use to general-purpose driver writers these days, and may only apply to some
“special-needs” drivers that limit the affinities of their threads and DPCs to specific CPUs, cores and threads
(like, for example, the ones that try to (mis)use Windows NT as a soft RTOS)

Moreover, I looked at KeInsertQueueDpc and I saw that it actually raises the IRQL to HIGH_LEVEL - Is it
because DPCs can be queued at any IRQL?

So I just want to verify I got it correctly: you have to be at least the same IRQL as the code you want to
synchronize with - for example if you have an interrupt handler that runs at IRQL 6 and you want to synchronize
with it you have to raise the current IRQL to 6 to do so and interrupt spin locks in windows allows the caller to
automatically increment to IRQL to the IRQL of the ISR

Basically, there are 2 reasons why you may want to raise IRQL

  1. You want to ensure the atomicity of the operation, in a sense of disallowing any external updates to
    the variables that it may access, until the target operation completes. Your question concerning KeInsertQueueDpc()
    is very obviously about this scenario. Consider what happens if an update to a DPC queue gets interrupted, and an interrupt handler
    decides to update it as well. It is obvious that a DPC queue may be left in inconsistent state in this case. Therefore, in order to avoid this “unfortunate scenario”, KeInsertQueueDpc() raises IRQL to HIGH_LEVEL, effectively disabling interrupts on a given CPU,
    until update is complete

  2. You are holding a synch construct that cannot get acquired recursively. This is why the system has to raise IRQL
    to the highest one of all the possible contenders, before attempting an interlocked op that a spinlock is built on top of. Otherwise,
    a recursive acquisition attempt may potentially be made if your code gets interrupted and interrupt handler tries to acquire the same spinlock, effectively deadlocking the CPU until the reboot.

If a device sends an interrupt while the IRQL on the CPU is high, does the interrupt gets into a “pending” state until the IRQL lowers?

Yes - the OS may rely upon the hardware-provided features in order to ensure this part…

I saw KeInsertQueueDpc calls HalSendSoftwareInterrupt(DISPATCH_LEVEL) - as far as I know this is used to run
all the pending DPCs in the queue of the CPU

Well, this is just a request for a scheduling interrupt on the target CPU.OTOH, in practical terms, a scheduling interrupt handler, indeed,
flushes a DPC queue in case if it not empty

  • if the IRQL was high when KeInsertQueueDpc called HalSendSoftwareInterrupt(DISPATCH_LEVEL), the interrupt is pending
    until the IRQL lowers below DISPATCH_LEVEL then the DPC queue is flushed, right?

Well, probably, it is better to say " is about to go below DISPATCH_LEVEL", at least as long as our target OS version provides a purely software-based implementation of the IRQL concept. OTOH, it really depends on the OS and HAL versions. For example, if our target OS/HAL version relies upon TPR for the purpose, a pending DISPATCH_LEVEL interrupt will, indeed, fire immediately after the OS lowers IRQL below DISPATCH_LEVEL by means of writing to TPR.

In any case, this is just a minor and insignificant detail - I am just trying to be pedantic
(and, apparently, to show off as well :slight_smile: )

Anton Bassov

anton’s comments are esentially obsolete as there have been no UP systems in a long time. he is not wronf, but answers about how a PDP8 works might be as useful

anton’s comments are esentially obsolete as there have been no UP systems in a long time.

Actually, I was just answering the OP’s question concerning synchronisation on the same CPU (namely, whether
KeAcquireSpinlock() and KeRaiseIrql (DISPATHCH_LEVEL) are equivalent in this case), so that the example of the now obsolete
UP HAL seemed (at least to me) right for the purpose. After all, this was a purely conceptual question and not the one about
the details of any particular implementation. Never mind…

PS. Once we are at it, are you 100% sure there are no more single-core CPUs produced? If my memory serves me well, I think I saw
a low-end Windows laptop with a single-core Celeron in a shop as recently as around a year ago

answers about how a PDP8 works might be as useful

You just made my day, Marion…

The funniest thing here is that, as long as we are speaking about the concept of IRQL, the examples of the ancient hardware like PDP11 may, indeed, be more than useful. The problem is that the very concept of the OS prioritising hardware interrupts to one another does not really make sense under any architecture, other than PDP11. AFAIK, one of the peculiar features specific to PDP11
was a non-uniform cost of accessing hardware devices, which depended on the particular device’s location on the bus.

Therefore, the very concepts of IRQL (under VMS) and spl() (under UNIX) were developed for the sole purpose of addressing
this peculiarity of PDP11. Windows NT just borrowed the concept of IRQL from VMS. In actuality, IRQL and spl() don’t really make sense anywhere, apart from PDP11. Although you may still see spl() under the most recent versions of NetBSD and OpenBSD occasionally, the only reason for this is that no one was really bothered to clean it up. FreeBSD had replaced it with the spin mutexes ages ago, and Solaris/Illumos does not seem to be using it for anything, apart from disabling and enabling interrupts on the CPU.

In other words, the very concept of IRQL was completely obsolete at the time Windows NT was designed
(OK, I shut up now - otherwise, “The Hanging Judge” may get upset, and I had used up my “lucky coin” more than a year ago)

Anton Bassov

one of the peculiar features specific to PDP11 was a non-uniform cost of accessing hardware devices

I am not aware of this “feature” and it doesn’t make sense relative to what I know about the hardware architecture and OS architectures… as someone who worked at Digital teaching people how to write device drivers for multiple PDP-11 operating systems, and who wrote a lot of PDP-11 drivers myself, I think I would have heard of this.

Therefore, the very concepts of IRQL (under VMS) and spl() (under UNIX) were developed for the sole purpose of addressing this peculiarity of PDP11. Windows NT just borrowed the concept of IRQL from VMS. In actuality, IRQL and spl() don’t really make sense anywhere, apart from PDP11

Yeah… no.

The idea of relative interrupt prioritization was present (In the PDP-11 and the IBM PC) to allow the interrupts from devices that require lower latencies to take priority over devices that can tolerate longer latencies. It’s as simple as that. Every bit cost, back in the day, and a common 8 line terminal interface (the DZ11) only had a 16 character FIFO that was shared by all 8 lines. Even at 9600 baud (the max speed supported) it was pretty easy to overrun the FIFO.

IRQL is a larger concept that builds on this, and (amusingly enough) very closely mirrors the PDP-11 Interrupt scheme, which includes a dedicated set of IRQs that are used only for software prioritization (IIRC, IRQs 1, 2, and 3 were not usable for devices on the PDP-11).

You are correct that, by the time of Windows NT, hardware IRQs were effectively useless… and (again IIRC) NT always randomly assigned IRQLs to hardware devices and thus ISRs… a fact that usually blows the mind of embedded and real-time device people.

IRQLs are primarily a mechanism for the OS to organize it’s work, prior to returning to its default state of “running the highest priority user thread.” They provide us a mechanism to do work that’s “more important than user stuff” but “less important than device interrupt service” that is designed to keep interrupt latency low. RSX111M did this using something Cutler called the Fork List, VMS likewise, Windows with the DPC list, Linux bottom half, etc. A side effect of this, of course, is that you get the capability for multiple levels of spin locking (which is sometimes cited as the reason for IRQLs)… but this is a happy bonus and not the primary architectural motivation for IRQLs.

Peter

As another person who worked on PDP-11 device drivers, in my case writing them as freelance consultant since 1972 through 1978, I like Peter never heard the claim. Most of my drivers were for DEC OSes but I did write some Unix drivers in that time, and helped port UNIX to a Honeywell architecture, I had contact with the Bell Lab’s team and never heard these claims. So Anton, where did you get this dreck?

So Anton, where did you get this dreck?

Actually, IIRC, somewhere in this NG. I’ve got to double-check it in order to provide some more info - otherwise, in addition to putting a foot in my mouth yet another time, I am going to make some libelous claim as well.

In fact, this claim in itself is not as nonsensical as it may seem to be at the first glance. For example,consider a NUMA system where every CPU core has its own RAM bank. Depending on the bank’s location on the bus relative to a given CPU core (i.e whether it happens to be a “local” bank or a “remote” one), the cost of memory access by a given CPU may be different for different banks. Therefore, I don’t see any reason why the same logic may not apply to memory-mapped devices on some " XYZ architecture"

Anton Bassov

this claim in itself is not as nonsensical as it may seem to be at the first glance

There’s nothing inherently nonsensical about the possibility that such an issue existed. It is merely untrue that it applies to the PDP-11. And to extend such an idea to the entire concept of why IRQs exist (when the simplest explanation – that some devices need “more urgent” service than others – is more likely to be the correct one)… you know, when you hear hoofbeats, think horses.

Peter

If you want to go with NUMA architectures then we are talking a different beast. PDP-11’s were not NUMA, but DEC or one of its spin-offs did have NUMA based technology that could potentially apply in the Alpha or later timeframe. Of course by the time that came out, Windows NT had shipped so I don’t think you can make the claim of that for either Unix or current Windows.

I should mention in helping port Unix to a Honeywell machine, we talked a lot about the machine architectural design decisions of Unix with the Bell Lab’s folks. This was because the machine in question had two types of addressing, one for words and one for bytes, the early Unix code did not handle this well.

Imagine the amount of Christmas spirit and good cheer that I must possess… I haven’t locked either this thread or the one (that eventually, after several posts the size of War and Peace, turned out to be) about I/O Completion Ports and NtLockFile. I’m clearly ending the year with some major thread drift. Perhaps even facilitating such drift. I shall strive and endeavor to be better in the New Year.

Peter

If my memory serves me well, I think I saw a low-end Windows laptop with a single-core Celeron

You know, I hate to be the one to defend Anton. But I recently (like, in the past 2 years) worked on an embedded systems project that used exactly such a processor to handle all the C3 between a UAV and the ground: A single-core Celeron (that, as far as I could tell, had been designed primarily for a laptop). In response to my (repeated) scoffing at the single-core CPU I was informed that the primary driver for using this processor was the heat budget for the aircraft.

A new design for the processor system was eventually created, along with new heat sinks and a new multi-processor capable driver… whether this ever actual got BUILT or deployed I have no idea. It was a super interesting/fun project and one I was proud to work on. My only regret, and it’s a biggie, is that I wasn’t allowed to take any pictures. For the avoidance of moral ambiguity, the UAV was designed for surveillance only.

Peter