I'm making a logging function that writes into a circular buffer in a non-paged memory. The catch is that such function may run at any IRQL, from any code, including (theoretically) ISRs. I need a spinlock to protect my shared circular buffer. I couldn't find an OS-provided spinlock that could run at any IRQL, thus I needed to write my own.
Can anyone check my logic here?
(I need this for an ARM64 CPU, but I guess it should be similar for x64 as well - except for the types of memory barriers needed.)
The key flaw in your plan is that YieldProcessor is a user-mode API. There are no kernel APIs that yield the processor at arbitrary IRQLs. If there were, there would already be spinlocks.
There is, however, an InterlockedAdd intrinsic that works in kernel. It's certainly possible to write zero-lock circular buffer code with that.
Hmm, YieldProcessor intrinsic actually translates to the yield instruction on ARM, or pause instruction on x64. Both should work fine at any EL level (on ARM), or ring on x64.
Other than that, Tim, can you show what do you mean about writing the circular buffer with InterlockedAdd intrinsic?
YieldProcessor is alias for the _mm_pause or __yield intrinsic on most platforms. On x64 that's a REP NOP or PAUSE instruction. That tells the CPU that is it a busy wait so that it doesn't try to escalate power levels etc. when because of busy looping
But I'm not sure how that translates to other platforms. ARM in particular is one I have not worked on and am therefore much less confident in
I have not attempted to check the OP's logic. Common wisdom is not to try to implement a lock like this. I have done that myself multiple times, but I don't think I would attempt to do so again.
All you need for a circular buffer log is to be able to advance the pointer atomically. Once you have the advanced pointer value, you can compute where your newly reserved region started by subtracting the length, even if someone else has advanced it after you.
@Tim_Roberts, but what if there are too many writers that overflow the circular buffer? That will corrupt the data in it w/o a lock. Am I missing something?
Why not use KeAcquireSpinLockAtDpcLevel? Despite the name it doesn’t require the calling irql to be dispatch level. So you raise to high_level, call the api and leverage a tried and true os spin lock
If the circular buffer overflows, you're screwed in any case. You just have to decide what level of screwage you will tolerate. You absolutely cannot delay at a raised IRQL, so you're either going to ignore/discard the new messages, or overwrite the oldest messages. Remember that the kernel debug log is a circular buffer. They handle this.
But be realistic about it. If the circular buffer is large enough, what's the real likelihood of that happening? Just do something sensible and make sure you don't crash.
@Mark_Roddy thanks. Those lockless ring buffers seem too complicated. There shouldn't be any high concurrency with logging functions, so my spinlock won't waste too many CPU cycles.
I need to rethink how I save my irql in a lck though - as that will interfere with the Load-link/store-conditionals for the load/store Arm64 atomic instructions for the lock itself. Ideally I shouldn't be writing anything near the cache line where I save the spinlock bit (&lck address) in the _interlockedbittestandset_acq, _interlockedbittestandreset_rel instructions.
This is exactly the reason why I'm posting this question here - to ask people if I did something wrong in my code above. I appreciate everyone's suggestions, but at this point I want to learn how to write those spinlocks. Is it too much to ask?
It is easy to write an unoptimized version. A production quality version is quite hard. the first thing to research is memory barriers (fences) because locks protect access to other memory. There is also no reason to touch the lock variable twice in the acquire code (just use another variable) and the single bit version is usually costlier than exchanging the whole value. and you will want to have some estimate of the CPU frequency so you can decide how many loop iterations indicate a deadlock and you should crash. and in at least a debug build you will want to track the thread that acquires and check for recursive acquisition (and crash) or release from the wrong context (and crash).
From your original post I was under the impression that you were trying to solve a real problem that you thought was not possible to solve using the existing windows kernel apis. Obviously I was mistaken.
@Mark_Roddy yes, I was originally not sure if there's an OS provided function for that. Now I learned about KeAcquireSpinLockAtDpcLevel. Which is a badly named API, IMHO, that made me initially believe that it wouldn't work at any IRQL, like I originally asked about.
Btw, I looked into what MSFT had done in KeAcquireSpinLockAtDpcLevel and it seems like they are doing something very similar - except that they also communicate to the HyperV (EL2) code layer from within this spinlock if we go into a long wait (via HvlNotifyLongSpinWait function.) Otherwise they also have a similar loop with a yield instruction.
Here's a quick excerpt in assembly:
; x20 = pointer passed in, or PKSPIN_LOCK
mov x21, #1 ; bit that is set in the spinlock
mov w19, #0 ; counter
lbl_repeat:
ldr w8, [HvlLongSpinCountMask]
add w19, w19, #1 ; counter++
tst w8, w19
beq lbl_call_hv ; if(counter & HvlLongSpinCountMask == 0) then notify hv
dmb ISHST ; memory barrier
yield ; stall CPU
; This is where the actual spinlock happens
ldr x8, [x20]
cbnz x8, lbl_repeat ; repeat if spinlock is not 0
swpa x21, x8, [x20] ; swap spinlock with 1
cbnz x8, lbl_repeat ; repeat if spinlock is not 0
; At this point we acquired the spinlock
The only part that is different is calling hyperv. They do some checks via a call to KiCheckVpBackingLongSponWaitHypercall and then invoke HvlNotifyLongSponWait.
@Doron_Holan, I wonder if you have access to anyone who owns that specific implementation and can ask them about the purpose of those HyperV calls?
@MBond2 btw, if you notice in the assembly code above they actually "touch" the spinlock twice. But I guess this is because ARM64 doesn't have a compare-and-exchange instruction.
I don't work at Microsoft anymore. My educated guess is that this is part of the HyperV enlightened kernel functionality to allow Windows as a VM to play nicely with the hypervisor.