InStackQueuedSpinLock vs reader/writer spinlock, who faster?

tanda996 · November 20, 2020, 4:22am

i’m developing WFP callouts driver and most action doing in DISPATCH_LEVEL (at callout routine). Have many spin lock variants, i’m confused at them.
i know their orientation is difference, but in case both function mostly in READ contention, who will run faster?. thanks.

Tim_Roberts · November 20, 2020, 4:49am

Speed is not really the issue for locking; they are very rarely a bottleneck. You choose the lock that fits your need and your restrictions.

Jan_Bottorff · November 20, 2020, 6:24am

Something not so apparent in read/write spinlocks is if you have a lot of cores you can get lock contention on RELEASING the read lock, because there is an interlocked decrement of the reader count. On simple spinlocks, if you own the lock, it’s a simple write to release it, so can never have release lock contention. I had to do some head scratching the first time I saw a performance profile that showed many cores spending significant time in the read/write lock release function.

Lock performance degrades badly as the cores get further away from each other. A few years ago on a Xeon I measured the time required to do an interlocked increment at a dozen or two clock cycles when on the same core, and that degraded 10x to a hundred or two clocks when competing with cores on the same chip (L1/L2 cache thrashes), and degraded another 10x when competing with cores on a different socket/NUMA node (inter-core cache line ownership transfer has to go across an inter processor bus, which is way slower than on chip busses). If it takes on average 1000 cycles, that’s like 300ns or only 3M operations/sec (at 3Ghz). This becomes really problematic if you want to get 5M IOPS through a single queue. Data structures that are NUMA aware can perform better on typical servers. Some many core processors look like multiple NUMA nodes in a single socket, so the latency can degrade more than expected for a single physical socket.

Also note the application architecture matters a lot too. For example, if you have 64 cores running on 64 threads doing I/Os in the same process, there is an OS process address space lock has to get acquired/released when an I/O locks memory pages to do DMA. Having 64 processes with 1 thread/core each spread this address space locking across 64 locks. In a disk performance test I did last year using diskspd, 64 threads in a single processor could only achieve about 1M IOPs. Making the test run 64 diskspd processes, each with 1 thread/core, that I/O performance went up to over 3M IOPs, with no difference in kernel code or hardware configuration.

On ARM64 systems, not all the read/write locks use the ARMv8.1-A atomic instructions, and many currently still use a load exclusive/store exclusive pair of instructions, which can have even more contention on release because of a longer time window the memory location needs to be unchanged. Load exclusive does not prevent another core from writing the location, it just provides a way for the current core to atomically succeed/fail the store if no other core has changed it. Load exclusive should really be called load with awareness of change for a specified memory location. Intel has interlocked compare exchange, which can’t detect if some other core has updated a value between the current core doing a read and then compare exchange, it can only detect if the value is currently the same, not if it has temporarily had other values between the read and compareExchange. The plus on Intel is it’s had interlocked memory instructions for a long time, so OS code can assume they are present. If you really need read and then compareExchange, I like the ARM64 design better, as you can tell if the value has ever changed after the read.

For optimal performance, avoid any shared writable memory, RCU (see https://lwn.net/Articles/262464/) is often a good strategy. Locks often work ok if you have a few cores, but if you have 256 cores, locks become a lot less attractive. The problem of getting high performance on many cores is way more complex than which lock will run faster.

Peter_Viscarola_OSR · November 20, 2020, 1:18pm

As usual, Mr. Bottorff has taken the time to provide us with some great wisdom. There are many interesting and important points in his post… any one of which could be expanded to article length.

First, let me point you to what I at least consider the definitive article on synchronization in Windows drivers, written by Mr. Ionescu for The NT Insider some years ago. This should answer any questions you have.

Next, let me try to make things simpler for you: if you need a spin lock, and you need read/write (that is, shared/exclusive) types of access… you need reader writer spin locks. If you don’t need read/write, in a modern driver you probably want in stack queued spin locks, because they scale better across multiple CPUs.

Peter