I’m very, very new to NT kernel programming, though definitely not to NT in general, so there are many things I don’t know. Like, until this week, I didn’t know that OBJ_KERNEL_HANDLE meant that a handle was also in every process in addition to only being kernel-accessible. My kernel mode log file I/O code was a lovely mess of ZwDuplicateObject calls as a result. =)
During an interprocessor interrupt triggered by KeIpiGenericCall, I’d like to synchronize the concurrent execution of all processors. Specifically, I’d like to have all processors wait (at IPI_LEVEL) except one for a piece of code to run on that blessed processor, then once the barrier is reached by the blessed processor, all processors release from the barrier and run a second piece of code at IPI_LEVEL before returning. Then a second barrier, the implicit one from KeIpiGenericCall itself, takes effect before resuming where everything left off at the previous respective IRQLs.
How could I implement such a synchronization barrier at one of the highest IRQLs there are in NT? The best I could come up with with naive reading of documentation was this:
- Initialize a spin lock.
- Set a volatile variable A to 0.
- Set a volatile variable B to -1.
- KeRaiseIrqlToDpcLevel.
- Acquire that spin lock with a KLOCK_QUEUE_HANDLE on the stack and KeAcquireInStackQueuedSpinLockAtDpcLevel (note on why not delete step 4 and use KeAcquireInStackQueuedSpinLock in a bit).
- KeIpiGenericCall; letters are within callback.
6a. If B is -1, set B to KeQueryActiveProcessorCountEx(ALL_PROCESSOR_GROUPS) using InterlockedCompareExchange. Further uses of B refer to the “winner”'s B, but under the assumption that adding a CPU can’t happen, all B’s should be same.
6b. C = InterlockedIncrement(A).
6c. If C < B, KeAcquireInStackQueuedSpinLockAtDpcLevel of the same spin lock using a new on-stack KLOCK_QUEUE_HANDLE. Flush the instruction cache for the patched memory range (from step 6d). Release the spin lock with KeReleaseInStackQueuedSpinLockFromDpcLevel.
6d. If C == B, patch code memory. KeReleaseInStackQueuedSpinLockFromDpcLevel, but pass the KLOCK_QUEUE_HANDLE from the original function!!
6e. If C > B, bugcheck.
6f. Return. - KfLowerIrql.
The weird part above is that the CPU unlocking the spin lock is quite potentially a different CPU than the one that locked the spin lock. It’s even referencing another thread’s stack, but it’s guaranteed that the other thread exists and has not returned from that function yet. Step 6d sets off a chain of releases that frees all the CPUs.
This all seems like a terrible hack, even more so than the runtime code generation in kernel mode that I’m already doing.
It’d be nice if instead of that mess above I could schedule a DPC on each processor while at IPI_LEVEL so that when exiting IPI_LEVEL a routine of mine could run at SYNCH_LEVEL, which is also quite a high level. Alas, I don’t see a way in the documentation to schedule a DPC to run at anything but DISPATCH_LEVEL (probably why it’s called “dispatch level”…). The routine I’d like to run is called KeSweepIcacheRange, an export of ntoskrnl.exe. Because I can’t run that routine–IPI_LEVEL -> DISPATCH_LEVEL -> SYNCH_LEVEL leaves a hazard while at DISPATCH_LEVEL–I have to do the instruction cache flush using my own code running at IPI_LEVEL, and it needs to be done on every processor in order for the cache flush to be safe.
Yes, I know that x86 CPUs don’t need instruction cache flushing in the normal sense. But I never said that my driver is for x86-32/64. =^_^= This, of course, means that my driver only has to work on Windows 8.0 through 8.1 SP1 in terms of what kernel APIs I can use.
Unlike many other people who write hacky tools that modify systems in undocumented ways, I actually care about following standards, and preserving system stability and security. I might be doing undocumented, hacky things, but I don’t want to make machines unstable. If I wanted that, I’d go work for nProtect or Kaspersky.
Melissa