Baffling issue with Queued spin-locks

Kanchan_Joshi-2 · January 5, 2013, 7:12am

In my disk upper-filter driver, I was using normal spin-locks. Things were working fine( including WHQL tests). Then I stumbled upon "queued spin locks". Greed had me in its grip, on reading that they are more efficient.
I changed all the ordinary spin-locks to queued spin locks. Things shattered.
After enabling driver verifier, I get to see an assert which declares timeout for DPC watchdog.
I wonder what went wrong.

Output of "analyze -v", and "running -it" are below. Appreciate the help.

0: kd> !analyze -v

Unknown bugcheck code (0)
Unknown bugcheck description
Arguments:
Arg1: 00000000
Arg2: 00000000
Arg3: 00000000
Arg4: 00000000

Debugging Details:

PROCESS_NAME: System

DPC_TIMEOUT_TYPE: DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED

DPC_TIME_LIMIT: 784

FAULTING_IP:
nt!KeAccumulateTicks+3c5
82a80bf4 cd2c int 2Ch

EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 82a80bf4 (nt!KeAccumulateTicks+0x000003c5)
ExceptionCode: c0000420 (Assertion failure)
ExceptionFlags: 00000000
NumberParameters: 0
Assertion: *** DPC watchdog timeout
This is NOT a break in update time
This is most likely a BUG in an ISR
Perform a stack trace to find the culprit
The period will be doubled on continuation
Use gh to continue!!

ERROR_CODE: (NTSTATUS) 0xc0000420 - An assertion failure has occurred.

EXCEPTION_CODE: (NTSTATUS) 0xc0000420 - An assertion failure has occurred.

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

BUGCHECK_STR: 0x0

CURRENT_IRQL: 1c

LAST_CONTROL_TRANSFER: from 82a800be to 82a80bf4

STACK_TEXT:
99f52a38 82a800be 00002711 00000000 00001600 nt!KeAccumulateTicks+0x3c5
99f52a78 82a7ff6b 82e1a960 73a66d18 00000000 nt!KeUpdateRunTime+0x145
99f52ad4 82a84c17 82b46c02 82b46c02 000000d1 nt!KeUpdateSystemTime+0x613
99f52ad4 82e1a960 82b46c02 82b46c02 000000d1 nt!KeUpdateSystemTimeAssist+0x13
99f52b5c 82d42976 00000000 91755960 99f52b02 hal!KeReleaseQueuedSpinLock+0x60
99f52b70 86bec474 00000000 91755960 91755960 nt!VerifierKeReleaseInStackQueuedSpinLock+0x68
99f52c24 86be9e7a 8c98ee60 ba588ff0 00000000 myfltr!processItem+0x144
99f52c50 82c0ef5e 00000000 83158a51 00000000 myfltr!WorkerThreadRoutine+0xba
99f52c90 82ab6219 86be9dc0 00000000 00000000 nt!PspSystemThreadStartup+0x9e
00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x19

STACK_COMMAND: kb

FOLLOWUP_IP:
myfltr!DataMigrationWorker+144
86bec474 8b45d0 mov eax,dword ptr [ebp-30h]

SYMBOL_STACK_INDEX: 6

SYMBOL_NAME: myfltr!processItem+144

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: myfltr

IMAGE_NAME: myfltr.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 50e5531f

FAILURE_BUCKET_ID: 0x0_VRF_myfltr!processItem+144

BUCKET_ID: 0x0_VRF_myfltr!processItem+144

Followup: MachineOwner

0: kd> !running -it

System Processors: (0000000f)
Idle Processors: (0000000e)

Prcbs Current Next
0 82b30d20 91755960 859d3798 ................

ChildEBP RetAddr
99f52a38 82a800be nt!KeAccumulateTicks+0x3c5
99f52a78 82a7ff6b nt!KeUpdateRunTime+0x145
99f52ad4 82a84c17 nt!KeUpdateSystemTime+0x613
99f52ad4 82e1a960 nt!KeUpdateSystemTimeAssist+0x13
99f52b5c 82d42976 hal!KeReleaseQueuedSpinLock+0x60
99f52b70 86bec474 nt!VerifierKeReleaseInStackQueuedSpinLock+0x68
99f52c24 86be9e7a myfltr!myfltr!processItem+0x144
99f52c50 82c0ef5e myfltr!myfltr!WorkerThreadRoutine+0xba
99f52c90 82ab6219 nt!PspSystemThreadStartup+0x9e
00000000 00000000 nt!KiThreadStartup+0x19

1 807c6120 807cb800 ................

ChildEBP RetAddr
807e2c24 00000000 nt!KiIdleLoop+0x1a

2 88b00120 88b05800 ................

ChildEBP RetAddr
88b1cc24 00000000 nt!KiIdleLoop+0x1a

3 88b37120 88b3c800 ................

ChildEBP RetAddr
88b53c24 00000000 nt!KiIdleLoop+0x1a

James_Harper · January 5, 2013, 7:33am

>

In my disk upper-filter driver, I was using normal spin-locks. Things were
working fine( including WHQL tests). Then I stumbled upon “queued spin
locks”. Greed had me in its grip, on reading that they are more efficient.
I changed all the ordinary spin-locks to queued spin locks. Things shattered.
After enabling driver verifier, I get to see an assert which declares timeout
for DPC watchdog.
I wonder what went wrong.

Are you sure you changed all of the calls to be queued spinlocks?

Is your PKLOCK_QUEUE_HANDLE allocated on the stack? Have a read of Doron’s blog entry on this http://blogs.msdn.com/b/doronh/archive/2006/03/08/546934.aspx and make sure you aren’t doing any of the stuff he says not to do.

James

anton_bassov · January 5, 2013, 11:40am

The statement above is just a perfect demonstration of general correctness of Knuth’s famous statement saying that “Premature optimization is the root of all evil (or at least most of it) in programming”.
Everything was fine, but then you had decided to “improve” things and, obviously, accidentally introduced some random bug. The same story happened to me quite a few times.

Check if you have replaced all calls to the “regular” spinlocks with the ones to the queued ones…

Anton Bassov

James_Harper · January 5, 2013, 6:18pm

> [quote]

In my disk upper-filter driver, I was using normal spin-locks. Things were
working fine( including WHQL tests). Then I stumbled upon “queued spin
locks”. Greed had me in its grip, on reading that they are more efficient. I
changed all the ordinary spin-locks to queued spin locks. Things shattered.
[/quote]

The statement above is just a perfect demonstration of general correctness
of Knuth’s famous statement saying that “Premature optimization is the root
of all evil (or at least most of it) in programming”.
Everything was fine, but then you had decided to “improve” things and,
obviously, accidentally introduced some random bug. The same story
happened to me quite a few times.

To be fair, the docs do say “Drivers for Windows XP and later versions of Windows should use queued spin locks instead of ordinary spin locks.”, which could imply that a driver using ordinary spin locks should be changed.

That said, queued spinlocks can perform worse in a virtual environment where the next lock in the queue may be on a virtual hardware thread that is not currently executing meaning more spinning where a non-queued spinlock would allow an available virtual hardware thread to execute. I think Hyper-V has some smarts to improve this. I can’t remember how the situation changes in a hyperthreaded system.

I assume KMDF just chooses the appropriate type of lock.

James

Doron_Holan · January 5, 2013, 7:33pm

Kmdf internally uses normal spin locks, we measured queued spin locks and it didn’t improve anything

d

From: James Harpermailto:xxxxx
Sent: ?1/?5/?2013 3:18 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE: RE:[ntdev] Baffling issue with Queued spin-locks

>

>
>
> The statement above is just a perfect demonstration of general correctness
> of Knuth’s famous statement saying that “Premature optimization is the root
> of all evil (or at least most of it) in programming”.
> Everything was fine, but then you had decided to “improve” things and,
> obviously, accidentally introduced some random bug. The same story
> happened to me quite a few times.
>

To be fair, the docs do say “Drivers for Windows XP and later versions of Windows should use queued spin locks instead of ordinary spin locks.”, which could imply that a driver using ordinary spin locks should be changed.

That said, queued spinlocks can perform worse in a virtual environment where the next lock in the queue may be on a virtual hardware thread that is not currently executing meaning more spinning where a non-queued spinlock would allow an available virtual hardware thread to execute. I think Hyper-V has some smarts to improve this. I can’t remember how the situation changes in a hyperthreaded system.

I assume KMDF just chooses the appropriate type of lock.

James

—
NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

Alex_Grig · January 5, 2013, 9:56pm

Queued spinlocks have performance gain in multi-socket systems, in high contention scenario.

Normal spinlock acquisition will issue an interlocked compare/exchange, which also invalidates the corresponding cache line of all other processors. Other contenders will also issue the interlocked cycle, which moves the cache line ownership to the new owner. Unsuccessful contenders will spin waiting for the spinlock to become zero. The problem is that the modified cache line cannot be shared. Then every spinning processor will have to read it from the owner processor cache.

Queued spinlock acquisition involves one interlocked compare exchange on the shared ULONG_PTR. Unsuccessful contenders will spin on their own location, without hitting the shared location.

This reduces socket-to-socket traffic a lot.

OSR_Community_User · January 6, 2013, 1:55am

My opinion is that the documentation on queued spin locks is misleading. I’ve been trying to get it corrected for years, and the old documentation team just kept wanting to publish more whitepapers and such, rather than put the right guidance directly in the WDK. Their take on it was that the WDK was mostly reference, and guidance wasn’t reference. We went around on that one a bunch of times. Perhaps the current team will eventually see things differently.

Any design choice is a set of tradeoffs. In order to understand those tradeoffs, it’s often useful to understand the implementations.

Traditional spinlock:

The acquiring thread raises to the IRQL of the lock (usually DISPATCH_LEVEL,) disabling the dispatcher (scheduler.) No more thread preemptions are possible. The thread then spins using an atomic compare-and-exchange instruction. When the right value is returned, it owns the lock. Both before and after acquiring the lock, the thread can be preempted by interrupts.

This strategy is simple and relatively cheap when lock contention is low. When lock contention is high, it tends to tie up the bus in cache management traffic, as all the waiters are doing atomic compare-and-exchange operations on the same cache line. In a NUMA machine, waiters in the same NUMA node as the owner are much more likely to successfully acquire the lock.

A waiter that is handling an interrupt will never acquire the lock.

Queued spinlock:

The acquiring thread raises to the IRQL of the lock. The thread then attempts to insert itself at the end of the list of waiters, using a compare-and-exchange operation, making the end of the list point to a local structure. Then it spins reading the local structure to see if it owns the lock. The owner then releases the lock by following the list and assigning ownership to the next waiter by writing into that waiter’s local structure.

This strategy is a little more complex, and no more expensive in the uncontended case. In the heavily contended case, it’s much more efficient, as all the waiters are spinning on different cache line, and thus very little cache management traffic flows across the bus. Waiters in remote NUMA nodes are no less likely to acquire the lock.

Note, though, that if a waiter is interrupted or if the hypervisor schedules some other virtual processor, it may have ownership of the lock assigned to it while it is still off handling the ISR or the while the virtual processor isn’t even running. So all the other waiters will wait longer. (I’m assuming here that these sorts of performance issues are usually more interesting in servers and that servers, to a close approximation, are now virtualized.)

For this reason, we initially only used queued locks within the kernel itself, and only when the lock IRQL was higher than device ISRs (something not typically available in drivers.)

Eventually, we exposed queued lock primitives to drivers, and at DISPATCH_LEVEL. The documentation that got written made a lot of people think that queued locks are always a win. They’re not. They’re useful if the lock tends to have a lot of waiters, and if there’s no good way to avoid that situation.

Note also that if there is a whole bunch of threads simultaneously attempting to insert themselves into the list of waiters, the contention looks a lot like a traditional spinlock, since those compare-exchange operations have to happen on the list itself, which is shared between all the processors.

As a last thought, I want to point out that the problem of having a virtual processor unscheduled while waiting for queued lock acquisition is important enough that Windows will cede virtual processor time to the hypervisor when a thread has failed to insert itself into the lock queue more than a couple of times. Similarly, it will cede time to the hypervisor when failing to acquire a traditional spinlock. This makes the two lock forms perform more similarly to each other when virtualized.

Jake Oshins
Windows Kernel Team

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Saturday, January 5, 2013 4:33 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

In my disk upper-filter driver, I was using normal spin-locks. Things were
working fine( including WHQL tests). Then I stumbled upon “queued spin
locks”. Greed had me in its grip, on reading that they are more efficient.
I changed all the ordinary spin-locks to queued spin locks. Things shattered.
After enabling driver verifier, I get to see an assert which declares timeout
for DPC watchdog.
I wonder what went wrong.

Are you sure you changed all of the calls to be queued spinlocks?

Is your PKLOCK_QUEUE_HANDLE allocated on the stack? Have a read of Doron’s blog entry on this http://blogs.msdn.com/b/doronh/archive/2006/03/08/546934.aspx and make sure you aren’t doing any of the stuff he says not to do.

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Peter_Viscarola_OSR · January 6, 2013, 11:42am

It’s a VERY difficult set of trade-offs. And the *real* problem is that it’s impossible in most cases to know in advance the environment in which ones driver will be running.

I won’t disagree with anything in Mr. Oshins’ analysis. But you have to weight the analysis for fairness as well. If your driver will be running on a NUMA system, and you expect even moderate lock contention, it seems to me that Queued Spin Locks are going to be a win.

I don’t understand the subtleties inherent in virtualization issues particularly well, so I can’t comment on those.

But for the case of acquiring a Queue Spin Lock on a physical processor while in an ISR, you’re talking about a few mics of “lost” time in the case where the winner HAPPENS to be granted the lock while within an interrupt. This strikes me as no different from what happens with either type of spin lock when an interrupt occurs AFTER the spin lock has been gained. In both cases, the processor that has the lock isn’t doing useful work and the throughput of the work being done in that “locking domain” suffers. In the Queued Spin Lock case, you’re simply SLIGHTLY expanding the time when this loss can occur, no? That just doesn’t seem like it’s a big deal to me. Is there some subtlety that I’m missing?

Peter
OSR

David_R_Cattley · January 6, 2013, 11:57am

I interpreted the impact of ISR pre-emption on a CPU attempting to acquire a
hot lock as being one of a shift in fairness away from the ISR servicing
CPU. If ISRs were evenly distributed across CPUs then I suppose this would
be a statistical wash but I think that is generally not true [distributing
interrupts across CPUs].

So queued spinlocks assist in restoring fairness in this case. But I
could complete see Peter’s point about the throughput of the lock domain
being only slightly (if not negatively) impacted. But the assumption is a
hot lock so there are many waiters willing to move that throughput forward
for the domain. The key takeaway I took from Mr. Oshins’ explanation was
that the CPU getting interrupted might not ever get the lock and thus become
useless except as a ISR servicer.

Perhaps we can have the DOC team take Mr. Oshins’ post and link it to the
spinlock pages

Cheers,
Dave Cattley

Peter_Viscarola_OSR · January 6, 2013, 12:26pm

Well, THAT certainly is an interesting take, Mr. Cattley. Thanks for that.

My problem with THAT is that unless you assume a more or less even distribution of interrupts across CPUs, things like the scheduling algorithm (which generally prefers running a thread on the last processor on which it ran) is also fundamentally unfair (in the case where the last processor happens to be one in which there are more interrupts).

Down this path lies madness: Consider that threads that run at higher priorities are more likely to have time “stolen” from them (yes, I know, no longer in the quantum accounting sense… thank you Mr. Russinovich) than those running at lower priorities. It makes my head hurt.

Well, in the case of traditional Spin Locks yes. While you’re servicing an interrupt you’re not checking to see if you’re the “winner.” You’d have to *really* be pounded with interrupts for that to be the case, though… or at least be extraordinarily unlucky. Just another case where Queued Spin Locks “win” over the traditional variant in my book.

Peter
OSR

anton_bassov · January 6, 2013, 12:37pm

Well, I would say this is already an exaggeration…

The problem with multiple contenders for a “regular” spinlock is that of so-called “thundering herd”. Let’s say we have N contenders at the moment. Only one of them will be able to acquire a spinlock when its current owner releases it. Therefore, even if we assume no new arrivals, the absolute best we can get under such
a scenario is N interlocked ops by the most unlucky contender, N-1 by the second unluckiest, N-2 by the third, etc. If the spinlock is held only for a very short duration of time and its previous owner often tries to
acquire it once more…well, at this point we are already getting, in terms of bus pressure, pretty close to
performing interlocked ops in a tight loop.

Queued locks and ticket locks ensure that interlocked op is done only once per acquisition. Therefore,
you just have no chance to put a pressure on the bus that is comparable to a tight loop of locked ops
if you use these constructs…

Are you saying that queued locks are more efficient that ticket ones that involve locked XADD on
a shared variable X, followed by simple reads with MOV of shared variable Y that may be updated only by a lock owner upon releasing the lock?

Anton Bassov

Alex_Grig · January 6, 2013, 2:32pm

Jake,

Recently, I had to diagnose some NDIS bottleneck (it was a common TCPIP.SYS queued spinlock that protects the tuple or interface table/tree).

From what I saw in the disassembly, queued spinlocks:

don’t need to spin when inserting to the waiter list;
Are fair - ownership order is FIFO.
Spin on a bit in the waiter-allocated structure (KLOCK_QUEUE_HANDLE).
A current owner may need to spin a while to release the next waiter, if the next waiter haven’t finalized the wait block link yet.

KSPIN_LOCK contains a pointer to the last contender’s KLOCK_QUEUE_HANDLE (or the owner, if there are no contenders).

If there are contenders, the spinlock release (ownership transfer) doesn’t require an interlocked operation.

OSR_Community_User · January 6, 2013, 6:30pm

The subtlety that comes in with virtualization is that the time period that the virtual processor is suspended is likely to be some multiple of the hypervisor’s scheduling quantum, not just a few mics, and without paravirtualization in the lock code, it turns out badly. This is, by the way, the specific reason that we support many fewer virtual processors for Server 2003 than we do for Server 2008R2 and Server 2012 under Hyper-V.

People sometimes try Server 2003 and observe that it boots and runs just fine with 8 or even 64 virtual processors. Then they ask us why this isn’t supported. We invite them to benchmark it. Most workloads scale negatively when you add virtual processors with Server 2003 on top of Hyper-V.

And while this is getting far from the original topic, just because somebody will point it out, I’ll mention that yes, we considered doing gang scheduling in Hyper-V to avoid this problem entirely, but we understood the downsides of that approach and decided to go with guest OS paravirtualization instead.

Jake Oshins
Hyper-V Team
Microsoft

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Sunday, January 6, 2013 8:42 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Baffling issue with Queued spin-locks

It’s a VERY difficult set of trade-offs. And the *real* problem is that it’s impossible in most cases to know in advance the environment in which ones driver will be running.

I won’t disagree with anything in Mr. Oshins’ analysis. But you have to weight the analysis for fairness as well. If your driver will be running on a NUMA system, and you expect even moderate lock contention, it seems to me that Queued Spin Locks are going to be a win.

I don’t understand the subtleties inherent in virtualization issues particularly well, so I can’t comment on those.

But for the case of acquiring a Queue Spin Lock on a physical processor while in an ISR, you’re talking about a few mics of “lost” time in the case where the winner HAPPENS to be granted the lock while within an interrupt. This strikes me as no different from what happens with either type of spin lock when an interrupt occurs AFTER the spin lock has been gained. In both cases, the processor that has the lock isn’t doing useful work and the throughput of the work being done in that “locking domain” suffers. In the Queued Spin Lock case, you’re simply SLIGHTLY expanding the time when this loss can occur, no? That just doesn’t seem like it’s a big deal to me. Is there some subtlety that I’m missing?

Peter
OSR

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · January 6, 2013, 6:33pm

Even distribution of interrupts across CPUs is becoming statistically more true over time, though you’re right that it’s never really even.

Technologies like RSS and NUMA I/O in NICs and HBAs are starting to trickle down into the client space as formerly server-only IP blocks are being incorporated into South Bridges. In the server space, these technologies are near ubiquitous, and the vast majority of interrupts are from the high-speed devices that implement them.

Jake

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of David R. Cattley
Sent: Sunday, January 6, 2013 8:56 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

I interpreted the impact of ISR pre-emption on a CPU attempting to acquire a
hot lock as being one of a shift in fairness away from the ISR servicing
CPU. If ISRs were evenly distributed across CPUs then I suppose this would
be a statistical wash but I think that is generally not true [distributing
interrupts across CPUs].

So queued spinlocks assist in restoring fairness in this case. But I
could complete see Peter’s point about the throughput of the lock domain
being only slightly (if not negatively) impacted. But the assumption is a
hot lock so there are many waiters willing to move that throughput forward
for the domain. The key takeaway I took from Mr. Oshins’ explanation was
that the CPU getting interrupted might not ever get the lock and thus become
useless except as a ISR servicer.

Perhaps we can have the DOC team take Mr. Oshins’ post and link it to the
spinlock pages

Cheers,
Dave Cattley

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · January 6, 2013, 6:34pm

Anton, I made no statement comparing anything to ticket locks.

Jake

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Sunday, January 6, 2013 9:37 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Baffling issue with Queued spin-locks

Well, I would say this is already an exaggeration…

The problem with multiple contenders for a “regular” spinlock is that of so-called “thundering herd”. Let’s say we have N contenders at the moment. Only one of them will be able to acquire a spinlock when its current owner releases it. Therefore, even if we assume no new arrivals, the absolute best we can get under such
a scenario is N interlocked ops by the most unlucky contender, N-1 by the second unluckiest, N-2 by the third, etc. If the spinlock is held only for a very short duration of time and its previous owner often tries to
acquire it once more…well, at this point we are already getting, in terms of bus pressure, pretty close to
performing interlocked ops in a tight loop.

Queued locks and ticket locks ensure that interlocked op is done only once per acquisition. Therefore,
you just have no chance to put a pressure on the bus that is comparable to a tight loop of locked ops
if you use these constructs…

Are you saying that queued locks are more efficient that ticket ones that involve locked XADD on
a shared variable X, followed by simple reads with MOV of shared variable Y that may be updated only by a lock owner upon releasing the lock?

Anton Bassov

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · January 6, 2013, 6:50pm

Yes, you’re right. I just reread the code and either it or my memory has changed since my last reading. The yield doesn’t come until after the waiter has been inserted into the list.

Jake

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@broadcom.com
Sent: Sunday, January 6, 2013 11:32 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Baffling issue with Queued spin-locks

Jake,

Recently, I had to diagnose some NDIS bottleneck (it was a common TCPIP.SYS queued spinlock that protects the tuple or interface table/tree).

From what I saw in the disassembly, queued spinlocks:

don’t need to spin when inserting to the waiter list;
Are fair - ownership order is FIFO.
Spin on a bit in the waiter-allocated structure (KLOCK_QUEUE_HANDLE).
A current owner may need to spin a while to release the next waiter, if the next waiter haven’t finalized the wait block link yet.

KSPIN_LOCK contains a pointer to the last contender’s KLOCK_QUEUE_HANDLE (or the owner, if there are no contenders).

If there are contenders, the spinlock release (ownership transfer) doesn’t require an interlocked operation.

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Alex_Grig · January 6, 2013, 8:19pm

Also, when the later OSes run under HyperV, the behavior of the spinlock functions change. For example, spin loop will signal the host after spinning for the predefined number of loops.

Alex_Grig · January 6, 2013, 8:22pm

Also, since NUMA with many nodes suffers from excessive cache coherency overhead (and also page table invalidation traffic), in the future it makes sense to partition the OS to smaller sub-sections (one or two sockets), so that most drivers and execution paths don’t need to access far memory or communicate with other socket.

anton_bassov · January 6, 2013, 9:07pm

Jake,

Anton, I made no statement comparing anything to ticket locks.

I know, but both you and Alex emphasize use of a local structure to poll as a main source of improvement…

Ticket locks don’t use any and, instead, poll the same location. Therefore, I am asking whether
queued lock may be more efficient that a ticket one. IMHO, these two are fairly equivalent, because the main source of improvement, IMHO, lies with elimination of “thundering herd” scenario - as have said in my earlier post, it may get pretty close, under some certain circumstances, to performing interlocked ops in a tight loop,
i.e. the most naive implementation of a spinlock in existence.

In any case, these factors come into the play only when contention for the lock is high . If the target lock
is in fairly low demand any difference, in terms of a bus pressure, between a “regular” spinlock on one hand and optimizations as a ticket lock and queued lock on the other one, simply disappears.

This is how it works on physical machines. When it comes to the virtual ones, things change (unless spinlock implementation is paravirtualized, of course). Consider what happens if VCPU gets preempted while holding a “regular” lock - all other contenders will have to wait until CPU is given back to the VM by a hypervisor. When it comes to queued and ticket locks the same applies not only to lock holders but to contenders as well.Therefore, as James said already, unless locks are paravirtualized, a “regular” spinlock will be a better option in a VM, compared to queued and ticket ones…

Anton Bassov

Peter_Viscarola_OSR · January 7, 2013, 9:40am

So, let me try to summarize for those following-along at home. Please correct me if this does not reflect the current state of the discussion, because I’m not trying to skew the results here… I *am* in fact trying to capture the net of what we’ve all said:

On physical machines, Queued Spin Locks are likely to be either as efficient more efficient than Classical Spin Locks.
On paravirtualized VMs (all the later implementation of Hyper-V) Queued Spin Locks perform reasonably well (due to the enlightenments) even if not necessarily QUITE as efficiently as Classical Spin Locks – this is due to the the VM being suspended for some multiple of the hypervisor’s scheduling quantum.
Mr. Bassov once again insists on introducing a Linux-specific topic in a Windows-specific discussion, to nobody’s interest but his own.

Peter
OSR