Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV
Before Posting...
Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


How many spinlocks can I create?

Alex_FunkyAlex_Funky Member - All Emails Posts: 171

Is there any limit on the number of spinlocks created for my driver? Or am I not restricted? Or it's dependent on the kind of spinlock?

Comments

  • Don_BurnDon_Burn Member - All Emails Posts: 1,715

    There is no limit to spin locks. Just be sure you have a locking hierarchy so that you don't get deadlocks of thread 1 holding lock A and trying to get lock B, while thread 2 holds lock B waiting to get lock A.

  • Dejan_MaksimovicDejan_Maksimovic Member - All Emails Posts: 354
    via Email
    A spin-lock is just an int variable, so the limit is the amount of
    memory available, literally.
    It is also the worst way of synchronizing stuff, and should be used
    ONLY if no other way exists.

    > Is there any limit on the number of spinlocks created for my driver? Or am I
    > not restricted? Or it's dependent on the kind of spinlock?
  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,157
    edited October 18

    It is also the worst way of synchronizing stuff, and should be used ONLY if no other way exists.

    Nonsense.

    Please tell me which sync primitive has lower overhead in the uncontested acquisition path.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • Dejan_MaksimovicDejan_Maksimovic Member - All Emails Posts: 354
    via Email
    How often are locks uncontested?
    But you have a point... when contention is unlikely, the cost of most
    others sync mechs is higher.


    When possible, I would still go for atomics vs. spinlock, or RW locks in
    terms of performance.



    Please tell me which sync primitive has lower overhead in the uncontested
  • anton_bassovanton_bassov Member MODERATED Posts: 5,190

    It is also the worst way of synchronizing stuff, and should be used ONLY if no other way exists.

    Well, it is the "worst" in a sense that a CPU cannot do anything (apart from processing interrupts, of course) while spinning in a do-nothing loop while trying to acquire a lock. Therefore, it should be, indeed, used judiciously. OTOH, as long as you don't hold it for too long (no pun intended), it is the most efficient synch construct in existence, especially if we take into account that synchronisation on any dispatcher-level synch construct is invariably going to involve spinlock acquisition behind thescenes

    When possible, I would still go for atomics vs. spinlock, or RW locks in terms of performance.

    Well, when you go for atomics, what you actually do is providing your own implementation of a spinlock. At the risk of invoking Mr.Burn's wrath I have to point out that there are some (admittedly rare) situations when providing a custom spinlock implementation may be, indeed, beneficial

    Anton Bassov

  • Mark_RoddyMark_Roddy Member - All Emails Posts: 4,375
    via Email
    (physical memory)/(sizeof pointer)

    Mark Roddy
  • MBond2MBond2 Member Posts: 210

    Note that queued spin locks often have a higher cost to release than to acquire, but can perform much better under contention

  • anton_bassovanton_bassov Member MODERATED Posts: 5,190

    Note that queued spin locks often have a higher cost to release than to acquire,

    How come??? After all, both getting into a queue and releasing an "improved" lock like a queued one on Windows or a ticket one on Linux involves an atomic operation (i.e. respectively CMPXCHG and XADD). Therefore, the costs are equal....

    but can perform much better under contention

    Sure - implementing spinlock acquisition in a queue -like fashion allows you to avoid a "thundering herd" scenario that occurs when a spinlock gets released, and all waiters attempt to acquire it at once by means of the atomic test-and-set. This may be quite costly on a system-wide basis because of the increased pressure on the bus on any architecture that maintains the cache coherence between the cores by means of MESI protocol and its variants. This may be not a big deal on a low-end system with 2-4 cores, but consider the higher-end one with,say, 64-128 ones.....

    Anton Bassov

  • MBond2MBond2 Member Posts: 210

    ordinary spin lock release is a single atomic instruction (usually decrement or set). queued spin locks have a race between releasers and wood be acquirers who are tiring to queue at the moment of release. this means a loop in the release phase.

    the main purpose of the queued design is to avoid such issues on systems with many cores. because under contention, those threads (cores) all poll on different addresses

  • anton_bassovanton_bassov Member MODERATED Posts: 5,190

    ordinary spin lock release is a single atomic instruction (usually decrement or set). queued spin locks have a race between
    releasers and wood be acquirers who are tiring to queue at the moment of release. this means a loop in the release phase.

    the main purpose of the queued design is to avoid such issues on systems with many cores. because under contention,
    those threads (cores) all poll on different addresses

    You seem to be totally confused here, Marion - I would argue with practically every statement above. Instead, let's simply look what the "simple" and "advanced" spinlocks conceptually look like, so that you will see everything with your own eyes

    Conventional spinlock - acquire path

    while(1)
    {
    while (target) {}

    if (! _interlockedbittestandset(target,..) ) break;

    }

    Conventional spinlock - release path

    target=0;

    When the lock gets released (BTW, it does not even need to be an interlocked op - a simple MOV may suffice), all waiters break out of their inner loops, and try and interlocked test-and-set at once. The drawback of this approach is multiple atomic operations being attempted at once (i.e a "thundering herd" scenario), which puts the unnecessary pressure on the bus.

    The "advanced" lock eliminates this issue by introducing two separate variables that are originally initialised to the same value

    Advanced spinlock - acquire path

    var1= InterlockedIncrement(ptr);

    tmp = var1-1;

    while(1) { if (tmp==var2) break;}

    Advanced spinlock - release path

    var2=var1;

    The latter approach eliminates the "thundering herd" scenario, and ensures that all requests are granted in the order of arrival.
    I made it depend on InterlockedIncrement() in this example, but it could be InterlockedCompareExchange() as well.
    The key point here is that an atomic operation and polling are done on separate variables, and the one that gets modified by the atomic operation is not touched by the owner upon the release. Therefore, release is just lightning fast.

    Anton Bassov

  • anton_bassovanton_bassov Member MODERATED Posts: 5,190

    Sorry, I made a typo - once var1 is a globally visible variable that gets modified by interlocked operation, the release path should be 'var2=tmp+1', rather than 'var2=var1', because 'var2' has to be set to the value that 'var1' had at the moment a lock was acquired

    Anton Bassov

  • MBond2MBond2 Member Posts: 210

    That is not how queued spin locks are implemented. And you know it; or you should. and before you extol the virtues of this 'simple' design further, remember that it does not address the 'thundering herd' problem at all since all waiters still poll in tight loops using interlocked operations on the same memory addresses. The level of memory coherency traffic that the silicon has to process does not vary with respect to the specific value for the comperand that InterlockedCompareExchange is passed. These are still multiple cores spinning on the same memory address and each invalidating the cache validity of the others on every update. whereupon they all must thunder away - even though these terms usually apply to UM callers more than KM callers.

    The benefit of queued spin locks is that the waiters don't contend in the same way. This only has any value under a heavily contended lock (or a highly loaded system) and it quite different from the 'advanced' lock you describe.

  • anton_bassovanton_bassov Member MODERATED Posts: 5,190

    before you extol the virtues of this 'simple' design further, remember that it does not address the 'thundering herd' problem
    at all since all waiters still poll in tight loops using interlocked operations on the same memory addresses.

    OMG....

    You are so confused on this topic, Marion....

    Interlocked operation is the one that is done with LOCK prefix. It may apply only to those instructions that involve read->update-write sequence i.e. ADD,INC, DEC ,CMPXCHG et al. Simple reads and writes are done with MOV instruction that cannot be used with LOCK prefix, for the (hopefully) obvious reasons......

    As you can see with your own eyes, there is only one interlocked operation per acquisition, and it is made before the polling loop starts. Polling is done with a MOV instruction, so that it does not involve interlocked ops of any description

    BTW,please ignore my most recent post with a "bug fix" - once var1/tmp are local variables, var2 is the global one, and ptr is a pointer to var2 's corresponding global "twin" variable that holds the target value, the previous one was 100% correct. Duh!!! I just wonder which particular body part I was thinking with....

    These are still multiple cores spinning on the same memory address and each invalidating the cache validity of the others on every update.

    Don't forget that an update of the variable that is being polled may be done only by the owner and not by the contenders. Therefore, each CPU/core invalidates and reloads the cache line only once upon the release, so that there is no "thundering herd issue" here whatsoever.

    However, in case of "ordinary" lock, all contenders attempt to modify the target variable with an interlocked op when the spinlock gets released. Therefore, every CPU has to reload the cache line N times, where N is the number of contenders, because N attempted interlocked ops are made. Taking into consideration that N-1 of these reloads are made for no reason whatsoever, I hope you understand where the "thundering herd" issue with its associated overhead comes from - it does not come from simply reading the variable.

    In any case, if you think that cache invalidating part may be eliminated altogether somehow.... well,just try to think logically. Let's assume the CPUs A and B poll the respectively X and Y, i.e. two different variables. In such case, how do you think they may ever come to the conclusion about the lock ownership?????

    That is not how queued spin locks are implemented.

    Could you please provide your schematic implementation of it. I just want to see how you think about the whole problem that we discuss

    Anton Bassov

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,157
    edited October 25

    Children, children... Can we, maybe, just agree to shelve this discussion... because I don't THINK it's helping anyone.

    For the record: With In-Stack Queued Spin Locks, each waiter waits on his own lock queue entry. On release, the lock is "handed off" to the next waiter. The function that matters is KxWaitForLockOwnerShip... if that's of any help to you.

    Geoff Chappell has a very nice, clear, and accurate description of how In Stack Queued Spin Locks work, if you care to read it.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • anton_bassovanton_bassov Member MODERATED Posts: 5,190

    For the record: With In-Stack Queued Spin Locks, each waiter waits on his own lock queue entry. On release,
    the lock is "handed off" to the next waiter. The function that matters is KxWaitForLockOwnerShip... if that's of any help to you.

    Actually, I think that the Linux "ticket lock" term describes the whole concept pretty well, although Windows had implemented this concept almost a decade before Linux did.

    For example, consider getting into a queue in a bank. First of all, you "take your ticket". In context of this discussion, you do it by means of performing an interlocked op, effectively getting your own lock queue entry that is meaningful only for you. In my example, this is 'var1/tmp' local variable. Then the only thing left for you to do is to watch the advertising board, waiting for your number of interest to appear on it. At this point you know that it is your tun to be served now. In context of this discussion, you do it by means of polling the global variable (var2 in my example). When you are done, you handle your ticket to a teller, who displays the next waiter's number (i.e. your number +1) on the board ("var2=var1" line in my example). Therefore, the whole thing works in a civilised and orderly fashion.

    Now imagine what would happen if there was no queue - every time the last served client leaved the desk all the waiters would rush forward, each of them expecting to be the first one at the desk. Ridiculous, don't you think. In actuality, this is how the "conventional" spinlock works, which fully explains the origins of the "thundering herd" term.

    Now imagine if there was no advertising board anywhere in sight. How on Earth would you discover that it is actually your turn to go to the counter???? In context of this discussion, this is exactly what would have happened if all the contenders were not polling the global variable

    Anton Bassov

  • MBond2MBond2 Member Posts: 210

    Peter is correct - this is an unhelpful conversation. This will be my last post on this thread.

    Please don't try to correct Peter's factual statements about how queued spin locks are implemented by comparing them to a totally different construct in Linux - which could also be implemented in Windows in the same way that queued locks could be implemented in Linux.

    Yes, ticket locks are a thing. Yes they do a job. But they are not queued spin locks.

    if you want to discuss in appropriate technical language the relative merits of these two different types of locks, of course I eager to enjoin a conversation. If you want to tell me that i am stupid, than that is your prerogative

  • anton_bassovanton_bassov Member MODERATED Posts: 5,190

    Marion,

    If you want to tell me that i am stupid, than that is your prerogative

    I am most definitely not trying to do anything like that, am I...

    if you want to discuss in appropriate technical language the relative merits of these two different types of locks, of
    course I eager to enjoin a conversation

    This is exactly what I want. More on it below

    Yes, ticket locks are a thing. Yes they do a job. But they are not queued spin locks.

    To be honest, up to this point I don't see any conceptual difference between a queued lock and ticket lock.
    The only differences that I see is are, first, the name, and, second, the particular interlocked ops that they are built around are different,
    i.e. respectively XCHG or CPMXCHG and XADD. Neither of these"differences" change anything , because the concept of this type of a
    lock remains the same - first you do an interlocked op on the variable X and then poll the variable Y in order to find out that you are
    the one who currently owns the lock, effectively eliminating the "thundering herd" issue

    Earlier on this thread I have provided yet another schematic implementation of this concept, which I decided to build around INC instruction. We can even give it its own name if you wish (for example,an "ordered lock"), but neither the name nor the particular
    instruction changes anything ,because the concept of this "improved" lock always remain the same.

    This is how I see things at the moment. However, you seem to be saying that there is some other concept possible
    (i.e the one that does not involve polling the global variable by the contenders), and claiming that a queued spinlock is, in actuality, the implementation of this concept. To be honest, I just cannot see how it may be possible solely from the logical standpoint. Therefore, I am just asking you to provide some schematic code, showing me how this concept may actually work. Could you please do it

    Peter is correct - this is an unhelpful conversation.

    Actually, this conversation may be VERY helpful to someone who is not happy about the system-provided spinlock, so that he believes that a custom implementation may do better in his particular situation. As one of the posts on this thread had showed, such guys are not that exceptional. This conversation may educate such a guy on the topic, and ensure that he does not do something as stupid as a tight polling loop of interlocked ops in his quest for the "improvement"

    Anton Bassov

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,157

    You are trying my patience, Anton.

    To help you avoid getting banned, I’ve locked this thread.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • anton_bassovanton_bassov Member MODERATED Posts: 5,190

    Marion,

    After having had thought a bit, I have to admit that you were absolutely correct here - in-stack queued and ticket locks are,indeed, two different types of a spinlock with their own relative merits and disadvantages. Therefore, I have no option other than retracting my statements on the subject

    Anton Bassov

This discussion has been closed.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Writing WDF Drivers 7 Dec 2020 LIVE ONLINE
Internals & Software Drivers 25 Jan 2021 LIVE ONLINE
Developing Minifilters 8 March 2021 LIVE ONLINE