How many spinlocks can I create?

Is there any limit on the number of spinlocks created for my driver? Or am I not restricted? Or it’s dependent on the kind of spinlock?

There is no limit to spin locks. Just be sure you have a locking hierarchy so that you don’t get deadlocks of thread 1 holding lock A and trying to get lock B, while thread 2 holds lock B waiting to get lock A.

1 Like

A spin-lock is just an int variable, so the limit is the amount of
memory available, literally.
It is also the worst way of synchronizing stuff, and should be used
ONLY if no other way exists.

Is there any limit on the number of spinlocks created for my driver? Or am I
not restricted? Or it’s dependent on the kind of spinlock?

It is also the worst way of synchronizing stuff, and should be used ONLY if no other way exists.

Nonsense.

Please tell me which sync primitive has lower overhead in the uncontested acquisition path.

Peter

How often are locks uncontested?
But you have a point… when contention is unlikely, the cost of most
others sync mechs is higher.

When possible, I would still go for atomics vs. spinlock, or RW locks in
terms of performance.

Please tell me which sync primitive has lower overhead in the uncontested

It is also the worst way of synchronizing stuff, and should be used ONLY if no other way exists.

Well, it is the “worst” in a sense that a CPU cannot do anything (apart from processing interrupts, of course) while spinning in a do-nothing loop while trying to acquire a lock. Therefore, it should be, indeed, used judiciously. OTOH, as long as you don’t hold it for too long (no pun intended), it is the most efficient synch construct in existence, especially if we take into account that synchronisation on any dispatcher-level synch construct is invariably going to involve spinlock acquisition behind thescenes

When possible, I would still go for atomics vs. spinlock, or RW locks in terms of performance.

Well, when you go for atomics, what you actually do is providing your own implementation of a spinlock. At the risk of invoking Mr.Burn’s wrath I have to point out that there are some (admittedly rare) situations when providing a custom spinlock implementation may be, indeed, beneficial

Anton Bassov

(physical memory)/(sizeof pointer)

Mark Roddy

1 Like

Note that queued spin locks often have a higher cost to release than to acquire, but can perform much better under contention

Note that queued spin locks often have a higher cost to release than to acquire,

How come??? After all, both getting into a queue and releasing an “improved” lock like a queued one on Windows or a ticket one on Linux involves an atomic operation (i.e. respectively CMPXCHG and XADD). Therefore, the costs are equal…

but can perform much better under contention

Sure - implementing spinlock acquisition in a queue -like fashion allows you to avoid a “thundering herd” scenario that occurs when a spinlock gets released, and all waiters attempt to acquire it at once by means of the atomic test-and-set. This may be quite costly on a system-wide basis because of the increased pressure on the bus on any architecture that maintains the cache coherence between the cores by means of MESI protocol and its variants. This may be not a big deal on a low-end system with 2-4 cores, but consider the higher-end one with,say, 64-128 ones…

Anton Bassov

ordinary spin lock release is a single atomic instruction (usually decrement or set). queued spin locks have a race between releasers and wood be acquirers who are tiring to queue at the moment of release. this means a loop in the release phase.

the main purpose of the queued design is to avoid such issues on systems with many cores. because under contention, those threads (cores) all poll on different addresses

ordinary spin lock release is a single atomic instruction (usually decrement or set). queued spin locks have a race between
releasers and wood be acquirers who are tiring to queue at the moment of release. this means a loop in the release phase.

the main purpose of the queued design is to avoid such issues on systems with many cores. because under contention,
those threads (cores) all poll on different addresses

You seem to be totally confused here, Marion - I would argue with practically every statement above. Instead, let’s simply look what the “simple” and “advanced” spinlocks conceptually look like, so that you will see everything with your own eyes

Conventional spinlock - acquire path

while(1)
{
while (target) {}

if (! _interlockedbittestandset(target,…) ) break;

}

Conventional spinlock - release path

target=0;

When the lock gets released (BTW, it does not even need to be an interlocked op - a simple MOV may suffice), all waiters break out of their inner loops, and try and interlocked test-and-set at once. The drawback of this approach is multiple atomic operations being attempted at once (i.e a “thundering herd” scenario), which puts the unnecessary pressure on the bus.

The “advanced” lock eliminates this issue by introducing two separate variables that are originally initialised to the same value

Advanced spinlock - acquire path

var1= InterlockedIncrement(ptr);

tmp = var1-1;

while(1) { if (tmp==var2) break;}

Advanced spinlock - release path

var2=var1;

The latter approach eliminates the “thundering herd” scenario, and ensures that all requests are granted in the order of arrival.
I made it depend on InterlockedIncrement() in this example, but it could be InterlockedCompareExchange() as well.
The key point here is that an atomic operation and polling are done on separate variables, and the one that gets modified by the atomic operation is not touched by the owner upon the release. Therefore, release is just lightning fast.

Anton Bassov

Sorry, I made a typo - once var1 is a globally visible variable that gets modified by interlocked operation, the release path should be ‘var2=tmp+1’, rather than ‘var2=var1’, because ‘var2’ has to be set to the value that ‘var1’ had at the moment a lock was acquired

Anton Bassov

That is not how queued spin locks are implemented. And you know it; or you should. and before you extol the virtues of this ‘simple’ design further, remember that it does not address the ‘thundering herd’ problem at all since all waiters still poll in tight loops using interlocked operations on the same memory addresses. The level of memory coherency traffic that the silicon has to process does not vary with respect to the specific value for the comperand that InterlockedCompareExchange is passed. These are still multiple cores spinning on the same memory address and each invalidating the cache validity of the others on every update. whereupon they all must thunder away - even though these terms usually apply to UM callers more than KM callers.

The benefit of queued spin locks is that the waiters don’t contend in the same way. This only has any value under a heavily contended lock (or a highly loaded system) and it quite different from the ‘advanced’ lock you describe.

before you extol the virtues of this ‘simple’ design further, remember that it does not address the ‘thundering herd’ problem
at all since all waiters still poll in tight loops using interlocked operations on the same memory addresses.

OMG…

You are so confused on this topic, Marion…

Interlocked operation is the one that is done with LOCK prefix. It may apply only to those instructions that involve read->update-write sequence i.e. ADD,INC, DEC ,CMPXCHG et al. Simple reads and writes are done with MOV instruction that cannot be used with LOCK prefix, for the (hopefully) obvious reasons…

As you can see with your own eyes, there is only one interlocked operation per acquisition, and it is made before the polling loop starts. Polling is done with a MOV instruction, so that it does not involve interlocked ops of any description

BTW,please ignore my most recent post with a “bug fix” - once var1/tmp are local variables, var2 is the global one, and ptr is a pointer to var2 's corresponding global “twin” variable that holds the target value, the previous one was 100% correct. Duh!!! I just wonder which particular body part I was thinking with…

These are still multiple cores spinning on the same memory address and each invalidating the cache validity of the others on every update.

Don’t forget that an update of the variable that is being polled may be done only by the owner and not by the contenders. Therefore, each CPU/core invalidates and reloads the cache line only once upon the release, so that there is no “thundering herd issue” here whatsoever.

However, in case of “ordinary” lock, all contenders attempt to modify the target variable with an interlocked op when the spinlock gets released. Therefore, every CPU has to reload the cache line N times, where N is the number of contenders, because N attempted interlocked ops are made. Taking into consideration that N-1 of these reloads are made for no reason whatsoever, I hope you understand where the “thundering herd” issue with its associated overhead comes from - it does not come from simply reading the variable.

In any case, if you think that cache invalidating part may be eliminated altogether somehow… well,just try to think logically. Let’s assume the CPUs A and B poll the respectively X and Y, i.e. two different variables. In such case, how do you think they may ever come to the conclusion about the lock ownership???

That is not how queued spin locks are implemented.

Could you please provide your schematic implementation of it. I just want to see how you think about the whole problem that we discuss

Anton Bassov

Children, children… Can we, maybe, just agree to shelve this discussion… because I don’t THINK it’s helping anyone.

For the record: With In-Stack Queued Spin Locks, each waiter waits on his own lock queue entry. On release, the lock is “handed off” to the next waiter. The function that matters is KxWaitForLockOwnerShip… if that’s of any help to you.

Geoff Chappell has a very nice, clear, and accurate description of how In Stack Queued Spin Locks work, if you care to read it.

Peter

For the record: With In-Stack Queued Spin Locks, each waiter waits on his own lock queue entry. On release,
the lock is “handed off” to the next waiter. The function that matters is KxWaitForLockOwnerShip… if that’s of any help to you.

Actually, I think that the Linux “ticket lock” term describes the whole concept pretty well, although Windows had implemented this concept almost a decade before Linux did.

For example, consider getting into a queue in a bank. First of all, you “take your ticket”. In context of this discussion, you do it by means of performing an interlocked op, effectively getting your own lock queue entry that is meaningful only for you. In my example, this is ‘var1/tmp’ local variable. Then the only thing left for you to do is to watch the advertising board, waiting for your number of interest to appear on it. At this point you know that it is your tun to be served now. In context of this discussion, you do it by means of polling the global variable (var2 in my example). When you are done, you handle your ticket to a teller, who displays the next waiter’s number (i.e. your number +1) on the board (“var2=var1” line in my example). Therefore, the whole thing works in a civilised and orderly fashion.

Now imagine what would happen if there was no queue - every time the last served client leaved the desk all the waiters would rush forward, each of them expecting to be the first one at the desk. Ridiculous, don’t you think. In actuality, this is how the “conventional” spinlock works, which fully explains the origins of the “thundering herd” term.

Now imagine if there was no advertising board anywhere in sight. How on Earth would you discover that it is actually your turn to go to the counter??? In context of this discussion, this is exactly what would have happened if all the contenders were not polling the global variable

Anton Bassov

Peter is correct - this is an unhelpful conversation. This will be my last post on this thread.

Please don’t try to correct Peter’s factual statements about how queued spin locks are implemented by comparing them to a totally different construct in Linux - which could also be implemented in Windows in the same way that queued locks could be implemented in Linux.

Yes, ticket locks are a thing. Yes they do a job. But they are not queued spin locks.

if you want to discuss in appropriate technical language the relative merits of these two different types of locks, of course I eager to enjoin a conversation. If you want to tell me that i am stupid, than that is your prerogative

Marion,

If you want to tell me that i am stupid, than that is your prerogative

I am most definitely not trying to do anything like that, am I…

if you want to discuss in appropriate technical language the relative merits of these two different types of locks, of
course I eager to enjoin a conversation

This is exactly what I want. More on it below

Yes, ticket locks are a thing. Yes they do a job. But they are not queued spin locks.

To be honest, up to this point I don’t see any conceptual difference between a queued lock and ticket lock.
The only differences that I see is are, first, the name, and, second, the particular interlocked ops that they are built around are different,
i.e. respectively XCHG or CPMXCHG and XADD. Neither of these"differences" change anything , because the concept of this type of a
lock remains the same - first you do an interlocked op on the variable X and then poll the variable Y in order to find out that you are
the one who currently owns the lock, effectively eliminating the “thundering herd” issue

Earlier on this thread I have provided yet another schematic implementation of this concept, which I decided to build around INC instruction. We can even give it its own name if you wish (for example,an “ordered lock”), but neither the name nor the particular
instruction changes anything ,because the concept of this “improved” lock always remain the same.

This is how I see things at the moment. However, you seem to be saying that there is some other concept possible
(i.e the one that does not involve polling the global variable by the contenders), and claiming that a queued spinlock is, in actuality, the implementation of this concept. To be honest, I just cannot see how it may be possible solely from the logical standpoint. Therefore, I am just asking you to provide some schematic code, showing me how this concept may actually work. Could you please do it

Peter is correct - this is an unhelpful conversation.

Actually, this conversation may be VERY helpful to someone who is not happy about the system-provided spinlock, so that he believes that a custom implementation may do better in his particular situation. As one of the posts on this thread had showed, such guys are not that exceptional. This conversation may educate such a guy on the topic, and ensure that he does not do something as stupid as a tight polling loop of interlocked ops in his quest for the “improvement”

Anton Bassov

You are trying my patience, Anton.

To help you avoid getting banned, I’ve locked this thread.

Peter

Marion,

After having had thought a bit, I have to admit that you were absolutely correct here - in-stack queued and ticket locks are,indeed, two different types of a spinlock with their own relative merits and disadvantages. Therefore, I have no option other than retracting my statements on the subject

Anton Bassov