Baffling issue with Queued spin-locks

OSR_Community_User · January 7, 2013, 1:30pm

That’s a reasonably summary. I’d quibble a little and say that if the entire VM were entirely suspended, then the discussion would be moot. The problem comes when one virtual processor from a VM is suspended.

Jake Oshins
Hyper-V guy
Windows Kernel Team

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Monday, January 7, 2013 6:40 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Baffling issue with Queued spin-locks

So, let me try to summarize for those following-along at home. Please correct me if this does not reflect the current state of the discussion, because I’m not trying to skew the results here… I *am* in fact trying to capture the net of what we’ve all said:

On physical machines, Queued Spin Locks are likely to be either as efficient more efficient than Classical Spin Locks.
On paravirtualized VMs (all the later implementation of Hyper-V) Queued Spin Locks perform reasonably well (due to the enlightenments) even if not necessarily QUITE as efficiently as Classical Spin Locks – this is due to the the VM being suspended for some multiple of the hypervisor’s scheduling quantum.
Mr. Bassov once again insists on introducing a Linux-specific topic in a Windows-specific discussion, to nobody’s interest but his own.

Peter
OSR

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

anton_bassov · January 7, 2013, 5:24pm

> 3) Mr. Bassov once again insists on introducing a Linux-specific topic in a

Windows-specific discussion, to nobody’s interest but his own.

Actually, when it comes to spinlocks, the discussion is normally pretty much OS-agnostic. Therefore,
on_this_particular_occasion I am not going too much off-topic. In fact, I am just interested in cache-related
issues here, particularly when it comes to NUMA systems, which has absolutely nothing to do with the
particular OS…

Anton Bassov

James_Harper · January 7, 2013, 7:19pm

>

Mr. Bassov once again insists on introducing a Linux-specific topic in a
Windows-specific discussion, to nobody’s interest but his own.

Has the list been a bit too well behaved for you lately Peter?

James

Peter_Viscarola_OSR · January 7, 2013, 9:28pm

Oh, so not. You must have skipped those threads.

Ticket locks != Windows… they’re a Linux thing as you know.

As am I.

Peter
OSR

anton_bassov · January 7, 2013, 10:03pm

>Ticket locks != Windows… they’re a Linux thing as you know.

Actually, they had been discussed in academic papers long before they made their way to the Linux kernel.
You can check a long rambling thread http://www.osronline.com/showThread.cfm?link=122194 - you will see
that I had provided my example of how a ticket spinlock may be implemented back in the year 2007, although ticket locks got introduced into Linux kernel only in the year 2008…

Anton Bassov

Peter_Viscarola_OSR · January 8, 2013, 9:06am

That is, indeed, an EPIC thread. Everyone is on that thread. Your friend Mr. Kyler, the late David Craig (damn, I miss him), Pro, Mr. Burn, Mr. Roddy… and, of course, Alberto!

In discussing Windows Classic Spin Locks, that thread also has you asserting:

Sorry, I couldn’t resist. And no… I would not like every statement I’ve ever made on this forum to come back and haunt me five years later either.

ANYhow, I didn’t see the word “ticket” in that thread. Not that it matters, these locking constructs (ticket locks, queued spin locks, etc) frequently derive from academic work. Though there ARE certainly unique lock inventions in Windows, the spin lock is not one of them

Peter
OSR

Phil_Barila · January 8, 2013, 10:27am

Total bummer. I hadn’t heard, but I suspected, since I hadn’t seen him active in a long time.

RIP, David

Phil
Not speaking for LogRhythm, Inc.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Tuesday, January 08, 2013 7:06 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Baffling issue with Queued spin-locks

… the late David Craig (damn, I miss him), …

Alex_Grig · January 8, 2013, 10:44am

>Total bummer. I hadn’t heard, but I suspected, since I hadn’t seen him active in a long time.

It’s been over a year. I and Mr. Guan had had a pleasure to work with Mr. Craig, until his illness.

anton_bassov · January 8, 2013, 3:34pm

>In discussing Windows Classic Spin Locks, that thread also has you asserting:

You just took my statement out of context…

Alberto was claiming that spinlock may somehow be implemented on a system that does not support
bus locking and claimed that “A standard spinlock requires one bit.” Therefore, I was trying to explain to him that, at the very minimum, the target hardware has to be able to change the state of a target bit and of a flag in EFLAGS as an atomic operation, which, in turn, requires bus locking…

And no… I would not like every statement I’ve ever made on this forum to come back and haunt
me five years later either.

Oh, shit, how many times a put afoot into my mouth in this NG…

ANYhow, I didn’t see the word “ticket” in that thread.

I guess this term just had not been coined at the time - there is a good chance that the term “ticket lock” is,
indeed, Linux-specific…

Anton Bassov

Peter_Viscarola_OSR · January 8, 2013, 4:44pm

In any case, that IS a thread for the ages.

My favorite quote from that whole thread:

We should print that thread out and put it on a memorial plaque somewhere. Totally epic.

Peter
OSR

anton_bassov · January 8, 2013, 4:50pm

>Everything on this thread has been worthless. Everyone knows the correct implementation of a spinlock is:

SPIN: BBSSI #0,LOCK,SPIN

Yes, Mr. Kyler is just incredible in absolutely all respects…

Anton Bassov

David_R_Cattley · January 8, 2013, 4:56pm

> We should print that thread out and put it on a memorial plaque somewhere. Totally epic.
There must be a dark corner in a bar in Maynard that would be appropriate …
Cheers,Dave Cattley

Peter_Viscarola_OSR · January 8, 2013, 6:10pm

Oh, shit… what WAS the name of that bar across from the mill?

Peter
OSR

mm1 · January 8, 2013, 6:15pm

Cutlah’s Place?

Mm

Sent from my Verizon Wireless 4G LTE DROID

xxxxx@osr.com wrote:

Oh, shit… what WAS the name of that bar across from the mill?

Peter
OSR

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

David_R_Cattley · January 8, 2013, 7:01pm

>Oh, shit… what WAS the name of that bar across from the mill?

Sadly gone, but it was the “Sit’n Bull”.

Many a brain-cell was sent to heaven from it.

Dave Cattley

OSR_Community_User · January 9, 2013, 9:37pm

The current WDK documentation team DEFINITELY sees things differently. In fact, less than an hour ago I was asked to ask this list how they’d feel if, instead of publishing a bunch of new whitepapers this time around, we take that same guidance and integrate it into the online documentation!

One reason for doing this is that, as you say, the information really belongs directly in the WDK docs. The other reason is that keeping a bunch of separate whitepapers around as downloadable Word documents creates a lot of extra maintenance work for us over time.

Thoughts?

–Diane

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Jake Oshins
Sent: Saturday, January 05, 2013 10:53 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

My opinion is that the documentation on queued spin locks is misleading. I’ve been trying to get it corrected for years, and the old documentation team just kept wanting to publish more whitepapers and such, rather than put the right guidance directly in the WDK. Their take on it was that the WDK was mostly reference, and guidance wasn’t reference. We went around on that one a bunch of times. Perhaps the current team will eventually see things differently.

Any design choice is a set of tradeoffs. In order to understand those tradeoffs, it’s often useful to understand the implementations.

Traditional spinlock:

The acquiring thread raises to the IRQL of the lock (usually DISPATCH_LEVEL,) disabling the dispatcher (scheduler.) No more thread preemptions are possible. The thread then spins using an atomic compare-and-exchange instruction. When the right value is returned, it owns the lock. Both before and after acquiring the lock, the thread can be preempted by interrupts.

This strategy is simple and relatively cheap when lock contention is low. When lock contention is high, it tends to tie up the bus in cache management traffic, as all the waiters are doing atomic compare-and-exchange operations on the same cache line. In a NUMA machine, waiters in the same NUMA node as the owner are much more likely to successfully acquire the lock.

A waiter that is handling an interrupt will never acquire the lock.

Queued spinlock:

The acquiring thread raises to the IRQL of the lock. The thread then attempts to insert itself at the end of the list of waiters, using a compare-and-exchange operation, making the end of the list point to a local structure. Then it spins reading the local structure to see if it owns the lock. The owner then releases the lock by following the list and assigning ownership to the next waiter by writing into that waiter’s local structure.

This strategy is a little more complex, and no more expensive in the uncontended case. In the heavily contended case, it’s much more efficient, as all the waiters are spinning on different cache line, and thus very little cache management traffic flows across the bus. Waiters in remote NUMA nodes are no less likely to acquire the lock.

Note, though, that if a waiter is interrupted or if the hypervisor schedules some other virtual processor, it may have ownership of the lock assigned to it while it is still off handling the ISR or the while the virtual processor isn’t even running. So all the other waiters will wait longer. (I’m assuming here that these sorts of performance issues are usually more interesting in servers and that servers, to a close approximation, are now virtualized.)

For this reason, we initially only used queued locks within the kernel itself, and only when the lock IRQL was higher than device ISRs (something not typically available in drivers.)

Eventually, we exposed queued lock primitives to drivers, and at DISPATCH_LEVEL. The documentation that got written made a lot of people think that queued locks are always a win. They’re not. They’re useful if the lock tends to have a lot of waiters, and if there’s no good way to avoid that situation.

Note also that if there is a whole bunch of threads simultaneously attempting to insert themselves into the list of waiters, the contention looks a lot like a traditional spinlock, since those compare-exchange operations have to happen on the list itself, which is shared between all the processors.

As a last thought, I want to point out that the problem of having a virtual processor unscheduled while waiting for queued lock acquisition is important enough that Windows will cede virtual processor time to the hypervisor when a thread has failed to insert itself into the lock queue more than a couple of times. Similarly, it will cede time to the hypervisor when failing to acquire a traditional spinlock. This makes the two lock forms perform more similarly to each other when virtualized.

Jake Oshins
Windows Kernel Team

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Saturday, January 5, 2013 4:33 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

In my disk upper-filter driver, I was using normal spin-locks. Things
were working fine( including WHQL tests). Then I stumbled upon
“queued spin locks”. Greed had me in its grip, on reading that they are more efficient.
I changed all the ordinary spin-locks to queued spin locks. Things shattered.
After enabling driver verifier, I get to see an assert which declares
timeout for DPC watchdog.
I wonder what went wrong.

Are you sure you changed all of the calls to be queued spinlocks?

Is your PKLOCK_QUEUE_HANDLE allocated on the stack? Have a read of Doron’s blog entry on this http://blogs.msdn.com/b/doronh/archive/2006/03/08/546934.aspx and make sure you aren’t doing any of the stuff he says not to do.

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

mm1 · January 9, 2013, 9:46pm

Fabulous!

Mm

Sent from my Verizon Wireless 4G LTE DROID

Diane Olsen wrote:

The current WDK documentation team DEFINITELY sees things differently.
In fact, less than an hour ago I was asked to ask this list how they’d
feel if, instead of publishing a bunch of new whitepapers this time
around, we take that same guidance and integrate it into the online
documentation!

One reason for doing this is that, as you say, the information really
belongs directly in the WDK docs. The other reason is that keeping a
bunch of separate whitepapers around as downloadable Word documents
creates a lot of extra maintenance work for us over time.

Thoughts?

–Diane

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jake Oshins
Sent: Saturday, January 05, 2013 10:53 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

My opinion is that the documentation on queued spin locks is
misleading. I’ve been trying to get it corrected for years, and the
old documentation team just kept wanting to publish more whitepapers
and such, rather than put the right guidance directly in the WDK.
Their take on it was that the WDK was mostly reference, and guidance
wasn’t reference. We went around on that one a bunch of times.
Perhaps the current team will eventually see things differently.

Any design choice is a set of tradeoffs. In order to understand those
tradeoffs, it’s often useful to understand the implementations.

Traditional spinlock:

The acquiring thread raises to the IRQL of the lock (usually
DISPATCH_LEVEL,) disabling the dispatcher (scheduler.) No more thread
preemptions are possible. The thread then spins using an atomic
compare-and-exchange instruction. When the right value is returned,
it owns the lock. Both before and after acquiring the lock, the
thread can be preempted by interrupts.

This strategy is simple and relatively cheap when lock contention is
low. When lock contention is high, it tends to tie up the bus in
cache management traffic, as all the waiters are doing atomic
compare-and-exchange operations on the same cache line. In a NUMA
machine, waiters in the same NUMA node as the owner are much more
likely to successfully acquire the lock.

A waiter that is handling an interrupt will never acquire the lock.

Queued spinlock:

The acquiring thread raises to the IRQL of the lock. The thread then
attempts to insert itself at the end of the list of waiters, using a
compare-and-exchange operation, making the end of the list point to a
local structure. Then it spins reading the local structure to see if
it owns the lock. The owner then releases the lock by following the
list and assigning ownership to the next waiter by writing into that
waiter’s local structure.

This strategy is a little more complex, and no more expensive in the
uncontended case. In the heavily contended case, it’s much more
efficient, as all the waiters are spinning on different cache line,
and thus very little cache management traffic flows across the bus.
Waiters in remote NUMA nodes are no less likely to acquire the lock.

Note, though, that if a waiter is interrupted or if the hypervisor
schedules some other virtual processor, it may have ownership of the
lock assigned to it while it is still off handling the ISR or the
while the virtual processor isn’t even running. So all the other
waiters will wait longer. (I’m assuming here that these sorts of
performance issues are usually more interesting in servers and that
servers, to a close approximation, are now virtualized.)

For this reason, we initially only used queued locks within the kernel
itself, and only when the lock IRQL was higher than device ISRs
(something not typically available in drivers.)

Eventually, we exposed queued lock primitives to drivers, and at
DISPATCH_LEVEL. The documentation that got written made a lot of
people think that queued locks are always a win. They’re not.
They’re useful if the lock tends to have a lot of waiters, and if
there’s no good way to avoid that situation.

Note also that if there is a whole bunch of threads simultaneously
attempting to insert themselves into the list of waiters, the
contention looks a lot like a traditional spinlock, since those
compare-exchange operations have to happen on the list itself, which
is shared between all the processors.

As a last thought, I want to point out that the problem of having a
virtual processor unscheduled while waiting for queued lock
acquisition is important enough that Windows will cede virtual
processor time to the hypervisor when a thread has failed to insert
itself into the lock queue more than a couple of times. Similarly, it
will cede time to the hypervisor when failing to acquire a traditional
spinlock. This makes the two lock forms perform more similarly to
each other when virtualized.

- Jake Oshins
Windows Kernel Team

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Saturday, January 5, 2013 4:33 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

>
> In my disk upper-filter driver, I was using normal spin-locks. Things
> were working fine( including WHQL tests). Then I stumbled upon
> “queued spin locks”. Greed had me in its grip, on reading that they are more efficient.
> I changed all the ordinary spin-locks to queued spin locks. Things shattered.
> After enabling driver verifier, I get to see an assert which declares
> timeout for DPC watchdog.
> I wonder what went wrong.

Are you sure you changed all of the calls to be queued spinlocks?

Is your PKLOCK_QUEUE_HANDLE allocated on the stack? Have a read of
Doron’s blog entry on this
http://blogs.msdn.com/b/doronh/archive/2006/03/08/546934.aspx and make
sure you aren’t doing any of the stuff he says not to do.

James

—
NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

—
NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

—
NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · January 10, 2013, 12:29am

Hi Diane,

When I want to get a general understanding for some new concept, I look for whitepapers. When I want to know the what the MSFT recommended API for a specific task is, I look to the WDK docs. If there is guidance such as this XXX API is more efficient in the ABC environment than YYY API, I want to see that in the docs. If I lookup a deprecated API in the docs, I expect it to state that it’s deprecated and point me to the current API (Yeah, I know this is pretty standard in the docs now, but just making my point).

Thanks!
~kenny

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Diane Olsen
Sent: Wednesday, January 09, 2013 6:34 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

The current WDK documentation team DEFINITELY sees things differently. In fact, less than an hour ago I was asked to ask this list how they’d feel if, instead of publishing a bunch of new whitepapers this time around, we take that same guidance and integrate it into the online documentation!

One reason for doing this is that, as you say, the information really belongs directly in the WDK docs. The other reason is that keeping a bunch of separate whitepapers around as downloadable Word documents creates a lot of extra maintenance work for us over time.

Thoughts?

–Diane

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Jake Oshins
Sent: Saturday, January 05, 2013 10:53 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

My opinion is that the documentation on queued spin locks is misleading. I’ve been trying to get it corrected for years, and the old documentation team just kept wanting to publish more whitepapers and such, rather than put the right guidance directly in the WDK. Their take on it was that the WDK was mostly reference, and guidance wasn’t reference. We went around on that one a bunch of times. Perhaps the current team will eventually see things differently.

Any design choice is a set of tradeoffs. In order to understand those tradeoffs, it’s often useful to understand the implementations.

Traditional spinlock:

The acquiring thread raises to the IRQL of the lock (usually DISPATCH_LEVEL,) disabling the dispatcher (scheduler.) No more thread preemptions are possible. The thread then spins using an atomic compare-and-exchange instruction. When the right value is returned, it owns the lock. Both before and after acquiring the lock, the thread can be preempted by interrupts.

This strategy is simple and relatively cheap when lock contention is low. When lock contention is high, it tends to tie up the bus in cache management traffic, as all the waiters are doing atomic compare-and-exchange operations on the same cache line. In a NUMA machine, waiters in the same NUMA node as the owner are much more likely to successfully acquire the lock.

A waiter that is handling an interrupt will never acquire the lock.

Queued spinlock:

The acquiring thread raises to the IRQL of the lock. The thread then attempts to insert itself at the end of the list of waiters, using a compare-and-exchange operation, making the end of the list point to a local structure. Then it spins reading the local structure to see if it owns the lock. The owner then releases the lock by following the list and assigning ownership to the next waiter by writing into that waiter’s local structure.

This strategy is a little more complex, and no more expensive in the uncontended case. In the heavily contended case, it’s much more efficient, as all the waiters are spinning on different cache line, and thus very little cache management traffic flows across the bus. Waiters in remote NUMA nodes are no less likely to acquire the lock.

Note, though, that if a waiter is interrupted or if the hypervisor schedules some other virtual processor, it may have ownership of the lock assigned to it while it is still off handling the ISR or the while the virtual processor isn’t even running. So all the other waiters will wait longer. (I’m assuming here that these sorts of performance issues are usually more interesting in servers and that servers, to a close approximation, are now virtualized.)

For this reason, we initially only used queued locks within the kernel itself, and only when the lock IRQL was higher than device ISRs (something not typically available in drivers.)

Eventually, we exposed queued lock primitives to drivers, and at DISPATCH_LEVEL. The documentation that got written made a lot of people think that queued locks are always a win. They’re not. They’re useful if the lock tends to have a lot of waiters, and if there’s no good way to avoid that situation.

Note also that if there is a whole bunch of threads simultaneously attempting to insert themselves into the list of waiters, the contention looks a lot like a traditional spinlock, since those compare-exchange operations have to happen on the list itself, which is shared between all the processors.

As a last thought, I want to point out that the problem of having a virtual processor unscheduled while waiting for queued lock acquisition is important enough that Windows will cede virtual processor time to the hypervisor when a thread has failed to insert itself into the lock queue more than a couple of times. Similarly, it will cede time to the hypervisor when failing to acquire a traditional spinlock. This makes the two lock forms perform more similarly to each other when virtualized.

Jake Oshins
Windows Kernel Team

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Saturday, January 5, 2013 4:33 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

In my disk upper-filter driver, I was using normal spin-locks. Things
were working fine( including WHQL tests). Then I stumbled upon
“queued spin locks”. Greed had me in its grip, on reading that they are more efficient.
I changed all the ordinary spin-locks to queued spin locks. Things shattered.
After enabling driver verifier, I get to see an assert which declares
timeout for DPC watchdog.
I wonder what went wrong.

Are you sure you changed all of the calls to be queued spinlocks?

Is your PKLOCK_QUEUE_HANDLE allocated on the stack? Have a read of Doron’s blog entry on this http://blogs.msdn.com/b/doronh/archive/2006/03/08/546934.aspx and make sure you aren’t doing any of the stuff he says not to do.

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Phil_Barila · January 10, 2013, 1:16am

I think putting the “guidance” into the “reference” is a great idea! +1!

Whitepapers have their place, but they should live inside the doc tree, so they are easily referenced by the reference pages, not as a bunch of independently downloadable entities.

That’s my take on it, anyway.

Phil B

Phil Barila | Senior Software Engineer
720.881.5364 (w)
LogRhythm, Inc.
A LEADER 2012 SIEM Magic Quadrant
WINNER of SC Magazine’s 2012 SIEM Best Buy

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Diane Olsen
Sent: Wednesday, January 09, 2013 7:34 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

The current WDK documentation team DEFINITELY sees things differently. In fact, less than an hour ago I was asked to ask this list how they’d feel if, instead of publishing a bunch of new whitepapers this time around, we take that same guidance and integrate it into the online documentation!

One reason for doing this is that, as you say, the information really belongs directly in the WDK docs. The other reason is that keeping a bunch of separate whitepapers around as downloadable Word documents creates a lot of extra maintenance work for us over time.

Thoughts?

–Diane

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Jake Oshins
Sent: Saturday, January 05, 2013 10:53 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

My opinion is that the documentation on queued spin locks is misleading. I’ve been trying to get it corrected for years, and the old documentation team just kept wanting to publish more whitepapers and such, rather than put the right guidance directly in the WDK. Their take on it was that the WDK was mostly reference, and guidance wasn’t reference. We went around on that one a bunch of times. Perhaps the current team will eventually see things differently.

Any design choice is a set of tradeoffs. In order to understand those tradeoffs, it’s often useful to understand the implementations.

Traditional spinlock:

The acquiring thread raises to the IRQL of the lock (usually DISPATCH_LEVEL,) disabling the dispatcher (scheduler.) No more thread preemptions are possible. The thread then spins using an atomic compare-and-exchange instruction. When the right value is returned, it owns the lock. Both before and after acquiring the lock, the thread can be preempted by interrupts.

This strategy is simple and relatively cheap when lock contention is low. When lock contention is high, it tends to tie up the bus in cache management traffic, as all the waiters are doing atomic compare-and-exchange operations on the same cache line. In a NUMA machine, waiters in the same NUMA node as the owner are much more likely to successfully acquire the lock.

A waiter that is handling an interrupt will never acquire the lock.

Queued spinlock:

The acquiring thread raises to the IRQL of the lock. The thread then attempts to insert itself at the end of the list of waiters, using a compare-and-exchange operation, making the end of the list point to a local structure. Then it spins reading the local structure to see if it owns the lock. The owner then releases the lock by following the list and assigning ownership to the next waiter by writing into that waiter’s local structure.

This strategy is a little more complex, and no more expensive in the uncontended case. In the heavily contended case, it’s much more efficient, as all the waiters are spinning on different cache line, and thus very little cache management traffic flows across the bus. Waiters in remote NUMA nodes are no less likely to acquire the lock.

Note, though, that if a waiter is interrupted or if the hypervisor schedules some other virtual processor, it may have ownership of the lock assigned to it while it is still off handling the ISR or the while the virtual processor isn’t even running. So all the other waiters will wait longer. (I’m assuming here that these sorts of performance issues are usually more interesting in servers and that servers, to a close approximation, are now virtualized.)

For this reason, we initially only used queued locks within the kernel itself, and only when the lock IRQL was higher than device ISRs (something not typically available in drivers.)

Eventually, we exposed queued lock primitives to drivers, and at DISPATCH_LEVEL. The documentation that got written made a lot of people think that queued locks are always a win. They’re not. They’re useful if the lock tends to have a lot of waiters, and if there’s no good way to avoid that situation.

Note also that if there is a whole bunch of threads simultaneously attempting to insert themselves into the list of waiters, the contention looks a lot like a traditional spinlock, since those compare-exchange operations have to happen on the list itself, which is shared between all the processors.

As a last thought, I want to point out that the problem of having a virtual processor unscheduled while waiting for queued lock acquisition is important enough that Windows will cede virtual processor time to the hypervisor when a thread has failed to insert itself into the lock queue more than a couple of times. Similarly, it will cede time to the hypervisor when failing to acquire a traditional spinlock. This makes the two lock forms perform more similarly to each other when virtualized.

Jake Oshins
Windows Kernel Team

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Saturday, January 5, 2013 4:33 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Baffling issue with Queued spin-locks

In my disk upper-filter driver, I was using normal spin-locks. Things
were working fine( including WHQL tests). Then I stumbled upon
“queued spin locks”. Greed had me in its grip, on reading that they are more efficient.
I changed all the ordinary spin-locks to queued spin locks. Things shattered.
After enabling driver verifier, I get to see an assert which declares
timeout for DPC watchdog.
I wonder what went wrong.

Are you sure you changed all of the calls to be queued spinlocks?

Is your PKLOCK_QUEUE_HANDLE allocated on the stack? Have a read of Doron’s blog entry on this http://blogs.msdn.com/b/doronh/archive/2006/03/08/546934.aspx and make sure you aren’t doing any of the stuff he says not to do.

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Peter_Viscarola_OSR · January 10, 2013, 9:32am

There’s a place for whitepapers. That is, a detailed technical exploration of a topic for the community. This is useful, for example, when introducing a new technology (such as when MSI support was added to Windows) or when try to illustrate in a cogent end-to-end way some complex concept that the community is or will likely have trouble with. For example, “porting” 32-bit drivers to 64-bit. A whitepaper gives you the opportunity to download a big chunk of information (50 or 80 pages) on one topic that reads like a story.

The key is, ALL the information for the whitepapers NEEDS to be rewritten and integrated back into the WDK. It’s just too hard to find otherwise.

As a rule, I’d like to see the Remarks section limited to comments and guidance about a given DDI. I’d like to see the Remarks section POINT to conceptual sections in the WDK that explain overall concepts, and issues that apply to multiple DDIs. This appears to be the current approach, and I think it’s a good one. It’s not followed 100% consistently, but … you know… what is, right?

This may be a controversial idea, but I also don’t think the WDK needs to delve into ultimate depth on every issue. It’s find to say “In most cases, kernel developers should…”. For Queued Spin Locks, for example, it wouldn’t be unreasonable to provide guidance that Queued Spin Locks generally scale better in multiprocessor environments and provide equal or better performance to Executive Spin Locks. You can then note that in certain virtualized systems, particularly those running on Windows Server 2003 or earlier, Queue Spin Locks are likely to be less performant than Executive Spin Locks.

I’ll also take this opportunity to note that I think it’s *really* important that the WDK docs work harder to avoid providing guidance by making absolute pronouncements that are not literally correct. These sweeping generalizations appear all too frequently for my tastes in the WDK docs of the early 2000’s. In these cases, what is actually guidance is presented in the WDK as global truth. This is VERY confusing, and winds up diluting the WDK’s value. SOME things the WDK docs say you shouldn’t do are actually things you should NEVER do… other things the WDK say you shouldn’t do are things you shouldn’t do unless you’re clever, know what you’re doing, and follow the appropriate procedures. All this makes it difficult for less experienced kernel engineers who read the WDK which says “don’t do X”, and we more experienced folks have to explain “Well, they don’t really MEAN don’t do X… they mean that you shouldn’t do X unless you know what you’re doing and you’ve previously done A, B and C.”

Given the above, I think the docs shouldn’t shy away from qualifying statements such as “Most developers should…” or “In most drivers, the best approach will typically be…” – I know there are often “writing” issues with these qualifiers (WHICH developers should, how does the reader know if this applies to them, if MOST should then should we say who SHOULDN’T, if it’s good for MOST developers shouldn’t we just say EVERYONE should). But I think they are *really* important from an engineering standpoint. Guidance is merely guidance, not declaration of an archtectural precept. Those things need to be kept separate.

Of course, this ALSO takes us to the whole discussion about whether the WDK’s role is to present overarching architectural concepts and the DDI deatils and let people figure it out (the way the old DDK docs did back in the mid-90’s) or to present more specific task-based information and in so doing steer developers toward mainstream solutions and the associated DDIs (as has been the approach more recently). THAT, however, is a FAR longer discussion than can be had in a forum.

Peter
OSR