Synchronization barriers during an IPI

I’m very, very new to NT kernel programming, though definitely not to NT in general, so there are many things I don’t know. Like, until this week, I didn’t know that OBJ_KERNEL_HANDLE meant that a handle was also in every process in addition to only being kernel-accessible. My kernel mode log file I/O code was a lovely mess of ZwDuplicateObject calls as a result. =)

During an interprocessor interrupt triggered by KeIpiGenericCall, I’d like to synchronize the concurrent execution of all processors. Specifically, I’d like to have all processors wait (at IPI_LEVEL) except one for a piece of code to run on that blessed processor, then once the barrier is reached by the blessed processor, all processors release from the barrier and run a second piece of code at IPI_LEVEL before returning. Then a second barrier, the implicit one from KeIpiGenericCall itself, takes effect before resuming where everything left off at the previous respective IRQLs.

How could I implement such a synchronization barrier at one of the highest IRQLs there are in NT? The best I could come up with with naive reading of documentation was this:

  1. Initialize a spin lock.
  2. Set a volatile variable A to 0.
  3. Set a volatile variable B to -1.
  4. KeRaiseIrqlToDpcLevel.
  5. Acquire that spin lock with a KLOCK_QUEUE_HANDLE on the stack and KeAcquireInStackQueuedSpinLockAtDpcLevel (note on why not delete step 4 and use KeAcquireInStackQueuedSpinLock in a bit).
  6. KeIpiGenericCall; letters are within callback.
    6a. If B is -1, set B to KeQueryActiveProcessorCountEx(ALL_PROCESSOR_GROUPS) using InterlockedCompareExchange. Further uses of B refer to the “winner”'s B, but under the assumption that adding a CPU can’t happen, all B’s should be same.
    6b. C = InterlockedIncrement(A).
    6c. If C < B, KeAcquireInStackQueuedSpinLockAtDpcLevel of the same spin lock using a new on-stack KLOCK_QUEUE_HANDLE. Flush the instruction cache for the patched memory range (from step 6d). Release the spin lock with KeReleaseInStackQueuedSpinLockFromDpcLevel.
    6d. If C == B, patch code memory. KeReleaseInStackQueuedSpinLockFromDpcLevel, but pass the KLOCK_QUEUE_HANDLE from the original function!!
    6e. If C > B, bugcheck.
    6f. Return.
  7. KfLowerIrql.

The weird part above is that the CPU unlocking the spin lock is quite potentially a different CPU than the one that locked the spin lock. It’s even referencing another thread’s stack, but it’s guaranteed that the other thread exists and has not returned from that function yet. Step 6d sets off a chain of releases that frees all the CPUs.

This all seems like a terrible hack, even more so than the runtime code generation in kernel mode that I’m already doing.

It’d be nice if instead of that mess above I could schedule a DPC on each processor while at IPI_LEVEL so that when exiting IPI_LEVEL a routine of mine could run at SYNCH_LEVEL, which is also quite a high level. Alas, I don’t see a way in the documentation to schedule a DPC to run at anything but DISPATCH_LEVEL (probably why it’s called “dispatch level”…). The routine I’d like to run is called KeSweepIcacheRange, an export of ntoskrnl.exe. Because I can’t run that routine–IPI_LEVEL -> DISPATCH_LEVEL -> SYNCH_LEVEL leaves a hazard while at DISPATCH_LEVEL–I have to do the instruction cache flush using my own code running at IPI_LEVEL, and it needs to be done on every processor in order for the cache flush to be safe.

Yes, I know that x86 CPUs don’t need instruction cache flushing in the normal sense. But I never said that my driver is for x86-32/64. =^_^= This, of course, means that my driver only has to work on Windows 8.0 through 8.1 SP1 in terms of what kernel APIs I can use.

Unlike many other people who write hacky tools that modify systems in undocumented ways, I actually care about following standards, and preserving system stability and security. I might be doing undocumented, hacky things, but I don’t want to make machines unstable. If I wanted that, I’d go work for nProtect or Kaspersky.

Melissa

Are you trying to win “Idiot of the decade” contest? I’m really sorry to disappoint you, but my vote is for a guy who flooded NTDEV with “…don’t send me any more replies. every patience has its limits” posts
(believe it or not, but he was doing it for few weeks). However, I can assure you that you have all chances to get into the top 10 or even top 5 - to be honest, I haven’t seen so much nonsense in a single post for quite a while…

Anton Bassov

Hello myriachan@

If i get correctly you need to execute a 2-phase operation
in your IPI-callback. And while the Phase1 is to be run only
at a single processor, the Phase2 is to be run on all
processors after the Phase1 has finished. If this is the
case then you basically don’t need to use spinlocks or
count the processors, but just use interlocked operations
simulating a condition variable such as in src below:

#include <ntddk.h>

LONG phase1_counter = 0;
LONG phase2_counter = 0;

void phase1()
{
InterlockedIncrement(&phase1_counter);
}

void phase2()
{
InterlockedIncrement(&phase2_counter);
}

LONG const phase1_barrier_is_free = 0;
LONG const phase1_barrier_is_acquired = 1;
LONG volatile phase1_barrier = phase1_barrier_is_free;
BOOLEAN volatile phase1_has_been_done = FALSE;

ULONG_PTR ipi_callback(ULONG_PTR arg)
{
// for IPI-callback
ASSERT ( KeGetCurrentIrql() == IPI_LEVEL );

while // acquire the phase1 barrier
(
InterlockedCompareExchange
(
&phase1_barrier
, phase1_barrier_is_acquired
, phase1_barrier_is_free
) != phase1_barrier_is_acquired
)
{}

if ( ! phase1_has_been_done )
{
phase1();
phase1_has_been_done = TRUE;
}

InterlockedExchange // release the phase1 barrier
(
&phase1_barrier
, phase1_barrier_is_free
);

phase2();
return STATUS_SUCCESS;
}

void DriverUnload(DRIVER_OBJECT* pDriverObject)
{
}

extern “C” NTSTATUS DriverEntry
(
DRIVER_OBJECT* pDriverObject
, UNICODE_STRING* pRegistryPath
)
{
pDriverObject->DriverUnload = DriverUnload;
KeIpiGenericCall(ipi_callback, 0);
DbgPrint(“ipi phase1_counter = %d”, phase1_counter);
DbgPrint(“ipi phase2_counter = %d”, phase2_counter);
return STATUS_SUCCESS;
}</ntddk.h>

Don’t be such an asshole, Anton.

I’ve been enjoying Ms. Myria’s posts over the past couple of weeks. They’re clear, respectful, and well thought out for her level of experience… And she’s even tried to help people with a few answers. All things your posts generally lack.

So, please. Be helpful or STFU.

Mr. Kt133a: looks good to me…

Peter
OSR
@OSRDrivers

You should avoid spinlocks. lockless algorithms for synchronization of
processors: “processor corrals” are more or less documented here and
elsewhere.

Mark Roddy

On Thu, May 8, 2014 at 2:44 AM, wrote:

> I’m very, very new to NT kernel programming, though definitely not to NT
> in general, so there are many things I don’t know. Like, until this week,
> I didn’t know that OBJ_KERNEL_HANDLE meant that a handle was also in every
> process in addition to only being kernel-accessible. My kernel mode log
> file I/O code was a lovely mess of ZwDuplicateObject calls as a result. =)
>
> During an interprocessor interrupt triggered by KeIpiGenericCall, I’d like
> to synchronize the concurrent execution of all processors. Specifically,
> I’d like to have all processors wait (at IPI_LEVEL) except one for a piece
> of code to run on that blessed processor, then once the barrier is reached
> by the blessed processor, all processors release from the barrier and run a
> second piece of code at IPI_LEVEL before returning. Then a second barrier,
> the implicit one from KeIpiGenericCall itself, takes effect before resuming
> where everything left off at the previous respective IRQLs.
>
> How could I implement such a synchronization barrier at one of the highest
> IRQLs there are in NT? The best I could come up with with naive reading of
> documentation was this:
>
> 1. Initialize a spin lock.
> 2. Set a volatile variable A to 0.
> 3. Set a volatile variable B to -1.
> 4. KeRaiseIrqlToDpcLevel.
> 5. Acquire that spin lock with a KLOCK_QUEUE_HANDLE on the stack and
> KeAcquireInStackQueuedSpinLockAtDpcLevel (note on why not delete step 4 and
> use KeAcquireInStackQueuedSpinLock in a bit).
> 6. KeIpiGenericCall; letters are within callback.
> 6a. If B is -1, set B to
> KeQueryActiveProcessorCountEx(ALL_PROCESSOR_GROUPS) using
> InterlockedCompareExchange. Further uses of B refer to the “winner”'s B,
> but under the assumption that adding a CPU can’t happen, all B’s should be
> same.
> 6b. C = InterlockedIncrement(A).
> 6c. If C < B, KeAcquireInStackQueuedSpinLockAtDpcLevel of the same spin
> lock using a new on-stack KLOCK_QUEUE_HANDLE. Flush the instruction cache
> for the patched memory range (from step 6d). Release the spin lock with
> KeReleaseInStackQueuedSpinLockFromDpcLevel.
> 6d. If C == B, patch code memory.
> KeReleaseInStackQueuedSpinLockFromDpcLevel, but pass the
> KLOCK_QUEUE_HANDLE from the original function!!
> 6e. If C > B, bugcheck.
> 6f. Return.
> 7. KfLowerIrql.
>
> The weird part above is that the CPU unlocking the spin lock is quite
> potentially a different CPU than the one that locked the spin lock. It’s
> even referencing another thread’s stack, but it’s guaranteed that the other
> thread exists and has not returned from that function yet. Step 6d sets
> off a chain of releases that frees all the CPUs.
>
> This all seems like a terrible hack, even more so than the runtime code
> generation in kernel mode that I’m already doing.
>
> It’d be nice if instead of that mess above I could schedule a DPC on each
> processor while at IPI_LEVEL so that when exiting IPI_LEVEL a routine of
> mine could run at SYNCH_LEVEL, which is also quite a high level. Alas, I
> don’t see a way in the documentation to schedule a DPC to run at anything
> but DISPATCH_LEVEL (probably why it’s called “dispatch level”…). The
> routine I’d like to run is called KeSweepIcacheRange, an export of
> ntoskrnl.exe. Because I can’t run that routine–IPI_LEVEL ->
> DISPATCH_LEVEL -> SYNCH_LEVEL leaves a hazard while at DISPATCH_LEVEL–I
> have to do the instruction cache flush using my own code running at
> IPI_LEVEL, and it needs to be done on every processor in order for the
> cache flush to be safe.
>
> Yes, I know that x86 CPUs don’t need instruction cache flushing in the
> normal sense. But I never said that my driver is for x86-32/64. =^_^=
> This, of course, means that my driver only has to work on Windows 8.0
> through 8.1 SP1 in terms of what kernel APIs I can use.
>
> Unlike many other people who write hacky tools that modify systems in
> undocumented ways, I actually care about following standards, and
> preserving system stability and security. I might be doing undocumented,
> hacky things, but I don’t want to make machines unstable. If I wanted
> that, I’d go work for nProtect or Kaspersky.
>
> Melissa
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

What problem are you trying to solve by using IPI and having all other processors to spin while the chosen one is doing something??

> Don’t be such an asshole, Anton.

Well, apparently, I am, indeed, an arsehole (actually, you can even refer to me with a C-word if you wish - I don’t really mind it), but I prefer to call things with their own names, even if it sounds harsh and unpleasant for some. Let’s look at few excerpts from the OP’s post, and you (hopefully) realize that I had a point.

Let’s start from the most obvious one

Patching kernel code plus deliberately bugchecking whenever you encounter something that your hack is not ready to deal with seem to be just wonderful practices, don’t you think. Now look below:

When I saw the combination of the above 2 statements in a single post, I just already could not help myself

The weird part above is that the CPU unlocking the spin lock is quite potentially a different CPU than
the one that locked the spin lock.

What a “profound understanding” of the spinlock semantics, as well as of general kernel-level concepts
(i.e. the things that are OS-agnostic), don’t you think…

I’ve been enjoying Ms. Myria’s posts over the past couple of weeks.

Me too…

However, the guy who flooded NTDEV with “…don’t send me any more replies. every patience has its limits” posts was even more entertaining, at least on my books…

So, please. Be helpful or STFU.

Actually, I tried my best to be helpful - I helped the OP to realize that ABSOLUTELY everything that he/she/it had said in the original post was just utterly nonsensical. The OP’s objective is nonsensical in itself, and the means that he/she/it tries to achieve it with are even more nonsensical, because he/she/it obviously lacks fundamental understanding of how OSes and kernels work. This is not the situation when you explain
to posters what their errors are - doing so is equivalent to explaining differential calculus to someone who is obviously yet to master the concepts of multiplication and division…

Anton Bassov

Since you are new to kernel programming, it might be good for you to
explain the problem you think this will solve. There may be much easier
ways to accomplish your goal.

Once a spin lock is acquired, the thread is raised to DPC level. This
means that it cannot be rescheduled, and therefore cannot be run on
another core. So your concern is misplaced.

There is a reason that there is no primitive to do this easily: it
probably doesn’t need to be done. So you need to explain what goal you
are trying to achieve.
Joe

I’m very, very new to NT kernel programming, though definitely not to NT
in general, so there are many things I don’t know. Like, until this week,
I didn’t know that OBJ_KERNEL_HANDLE meant that a handle was also in every
process in addition to only being kernel-accessible. My kernel mode log
file I/O code was a lovely mess of ZwDuplicateObject calls as a result. =)

During an interprocessor interrupt triggered by KeIpiGenericCall, I’d like
to synchronize the concurrent execution of all processors. Specifically,
I’d like to have all processors wait (at IPI_LEVEL) except one for a piece
of code to run on that blessed processor, then once the barrier is reached
by the blessed processor, all processors release from the barrier and run
a second piece of code at IPI_LEVEL before returning. Then a second
barrier, the implicit one from KeIpiGenericCall itself, takes effect
before resuming where everything left off at the previous respective
IRQLs.

How could I implement such a synchronization barrier at one of the highest
IRQLs there are in NT? The best I could come up with with naive reading
of documentation was this:

  1. Initialize a spin lock.
  2. Set a volatile variable A to 0.
  3. Set a volatile variable B to -1.
  4. KeRaiseIrqlToDpcLevel.
  5. Acquire that spin lock with a KLOCK_QUEUE_HANDLE on the stack and
    KeAcquireInStackQueuedSpinLockAtDpcLevel (note on why not delete step 4
    and use KeAcquireInStackQueuedSpinLock in a bit).
  6. KeIpiGenericCall; letters are within callback.
    6a. If B is -1, set B to
    KeQueryActiveProcessorCountEx(ALL_PROCESSOR_GROUPS) using
    InterlockedCompareExchange. Further uses of B refer to the “winner”'s B,
    but under the assumption that adding a CPU can’t happen, all B’s should be
    same.
    6b. C = InterlockedIncrement(A).
    6c. If C < B, KeAcquireInStackQueuedSpinLockAtDpcLevel of the same spin
    lock using a new on-stack KLOCK_QUEUE_HANDLE. Flush the instruction cache
    for the patched memory range (from step 6d). Release the spin lock with
    KeReleaseInStackQueuedSpinLockFromDpcLevel.
    6d. If C == B, patch code memory.
    KeReleaseInStackQueuedSpinLockFromDpcLevel, but pass the
    KLOCK_QUEUE_HANDLE from the original function!!
    6e. If C > B, bugcheck.
    6f. Return.
  7. KfLowerIrql.

The weird part above is that the CPU unlocking the spin lock is quite
potentially a different CPU than the one that locked the spin lock. It’s
even referencing another thread’s stack, but it’s guaranteed that the
other thread exists and has not returned from that function yet. Step 6d
sets off a chain of releases that frees all the CPUs.

This all seems like a terrible hack, even more so than the runtime code
generation in kernel mode that I’m already doing.

It’d be nice if instead of that mess above I could schedule a DPC on each
processor while at IPI_LEVEL so that when exiting IPI_LEVEL a routine of
mine could run at SYNCH_LEVEL, which is also quite a high level. Alas, I
don’t see a way in the documentation to schedule a DPC to run at anything
but DISPATCH_LEVEL (probably why it’s called “dispatch level”…). The
routine I’d like to run is called KeSweepIcacheRange, an export of
ntoskrnl.exe. Because I can’t run that routine–IPI_LEVEL ->
DISPATCH_LEVEL -> SYNCH_LEVEL leaves a hazard while at DISPATCH_LEVEL–I
have to do the instruction cache flush using my own code running at
IPI_LEVEL, and it needs to be done on every processor in order for the
cache flush to be safe.

Yes, I know that x86 CPUs don’t need instruction cache flushing in the
normal sense. But I never said that my driver is for x86-32/64. =^_^=
This, of course, means that my driver only has to work on Windows 8.0
through 8.1 SP1 in terms of what kernel APIs I can use.

Unlike many other people who write hacky tools that modify systems in
undocumented ways, I actually care about following standards, and
preserving system stability and security. I might be doing undocumented,
hacky things, but I don’t want to make machines unstable. If I wanted
that, I’d go work for nProtect or Kaspersky.

Melissa


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OK, Anton (and everybody else reading this) let me help you understand.

When somebody starts their post with something like the following:

It is neither helpful nor acceptable to answer that post with:

no matter how nonsensical the content of the post. I don’t care if their post contains an excerpt from Ulysses.

Rather, you can choose say something to the effect of: “Wow, that algorithm you posted REALLY doesn’t make any sense… what are you trying to accomplish?” (See Dr. Newcomer’s post… helpful), provide an example algorithm that you think might meet the OP’s requirements like Mr. Mr. Kt133a (potentially very helpful), or… (here’s an idea) IGNORE THE POST (at least not negatively helpful).

Telling the OP that they’re an idiot doesn’t help. They started their post TELLING you they know that they don’t know anything about this topic. So saying “I haven’t seen so much nonsense in a single post for quite a while” isn’t doing anyone any good (well, beyond obliquely telling the OP that you think they’re seriously on the wrong track).

The point isn’t whether the OPs algorithm was worthy of the nobel prize or the bin. It’s about at least TRYING to be helpful in a positive way. Or ignoring the post altogether.

Peter
OSR
@OSRDrivers

> Telling the OP that they’re an idiot doesn’t help.

Please note that I did not say the OP was an idiot. What I said was “Are you trying to win “Idiot of the decade” contest”, because normally it is not so easy to imagine so many silly statements in a single post (particularly the one about deliberately bugchecking if you are not ready to deal with the situation), unless these statements were made on purpose…

Rather, you can choose say something to the effect of: “Wow, that algorithm you posted REALLY
doesn’t make any sense… what are you trying to accomplish?”

Well, if it was only the question of algorithm…

I don’t know about you, but I had a weird feeling when I saw this post that the OP had tried to pack as many absurd statements into a single post as only possible…

The point isn’t whether the OPs algorithm was worthy of the nobel prize or the bin.

BTW, I would rephrase it as “it does not matter if the OP’s algorithmis was worthy of Nobel prize or
Dan Kyler one” (you remember his “only proper implementation of a spinlock in existence”, don’t you)…

Anton Bassov

Oh yes… Now, THERE’s a point on which we can agree.

Peter
OSR
@OSRDrivers

By the way, there a bit offtopic question has been arised -
what stacks are used for IPI-callbacks executing ?
Is it a per-processor preallocated dedicated IPI stack (such as for DPCs), or is it the kernel stack of an arbitrary thread interrupted by IPI? Or may be something else?

When I last looked, it was the kernel stack of the arbitrary thread.

Given that Windows 8.1 apparently handles interrupts on a dedicated interrupt stack (see http://www.osr.com/2014/04/18/kx-headers-windows-8-1-wdk/) this may no longer be the case.

Peter
OSR
@OSRDrivers

> Is it a per-processor preallocated dedicated IPI stack (such as for DPCs), or is it the kernel stack

of an arbitrary thread interrupted by IPI?

I think the latter applies to all hardware interrupts, including IPI. The reason why DPCs now have their own stack
(IIRC, it did not always work this way) is because DPC call chain may consume quite a bit of stack space.
For example, consider the situation when NIC miniport indicates incoming data to NDIS, which calls bound protocol’s Receive handler, which, in turn, calls NdisSendPackets() which, in turn calls MiniportSend().

As opposed to that, interrupt handlers are supposed to queue a DPC and return as quickly as possible - all actual work is supposed to be done in DPC routine.Therefore, you can allow them to hijack the kernel stack of currently running thread(or DPC stack)…

Anton Bassov

> Oh yes… Now, THERE’s a point on which we can agree.

OK, then I would like to amend my reply to the OP. It should stand as following: “Are you trying to win Dan Kyler prize”?

Anton Bassov

On the processor on which KeIpiGenericCall is executed, the callback is called on the same stack.

0: kd> k
Child-SP RetAddr Call Site
ffffd000224af898 fffff8032f39d12e foo!IPICallback
ffffd000224af8a0 fffff80056f38dd8 nt!KeIpiGenericCall+0xb6
ffffd000224af8f0 fffff80056f39068 foo!GenerateAnIpi+0x58

1: kd> k
Child-SP RetAddr Call Site
ffffd001c0165f48 fffff8032f3714a2 foo!IPICallback
ffffd001c0165f50 fffff8032f3e152c nt!KiIpiProcessRequests+0x1b2
ffffd001c0165fb0 fffff8032f3e12ff nt!KiIpiInterruptSubDispatch+0x7c
ffffd001c0157ad0 fffff8032f3deac2 nt!KiIpiInterrupt+0xff
ffffd001c0157c60 0000000000000000 nt!KiIdleLoop+0x32

(Windows 8.1, x64)

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Thursday, May 08, 2014 12:50 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Synchronization barriers during an IPI

When I last looked, it was the kernel stack of the arbitrary thread.

Given that Windows 8.1 apparently handles interrupts on a dedicated interrupt stack (see http://www.osr.com/2014/04/18/kx-headers-windows-8-1-wdk/) this may no longer be the case.

Peter
OSR
@OSRDrivers


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

>KeSweepIcacheRange

Looking at the header of this function, it offers a AllProcessors parameter
when enabled parameter which when enabled already does the IPI stuff already
by itself. However it is not documented, which makes it a bad idea to use
it.

It looks like the OP wants to flush the instruction cache, which is a
necessary thing to do for instance when one dynamically updates code such as
a compiler does. The normal way to do this is to call FlushInstructionCache
from usermode.

//Daniel

Thank you for replies about IPI-stack. The things are clear enough now.

And it seems to be very good that Windows has been provided finally with the dedicated interrupt stack. A number of sudden horrible bugchecks induced by a stack overflow at interrupt time is expected to be eliminated.

I don’t have a good way to do proper quoting right now, I’m sorry.

The reason for all this is that I want to ensure that no other core is executing the code I’m patching while I patch it. As for why not FlushInstructionCache, I think it wouldn’t appreciate kernel addresses very much.

Another idea I had is to schedule DPCs for every processor and run the barriers at dispatch level. Once that’s done, KeSweepIcacheRange with all processors so I don’t have to write the ARM assembly code myself.

I’ve already dealt with the sleeping giant whose initials are P and G.

Melissa

> The reason for all this is that I want to ensure that no other core is executing the code

I’m patching while I patch it.

And why do you think you need to patch the kernel code, in the first place??? Are you writing a piece of malware, or just trying to win Dan Kyler prize?

Another idea I had is to schedule DPCs for every processor and run the barriers at dispatch level.

I somehow presume you never had an idea to simply give it up, right…

Anton Bassov