Spinlock vs RWLocks

Hi, in a NDIS miniport, I was wondering what is the best locking
startegy in the following scenarios given multi proc systems:

  1. I have some global configuration in the driver which can be
    configured from usermode via some IOCTLs. These settings are accessed
    by the adapters in their Tx/Rx paths. Seems like NDIS_RW_LOCK_EX APIs
    would be ideal in this case since config changes should be extremely
    infrequent in the global config but the datapath requires these
    parameters on all reads and writes. Seems like the netvmini smaple
    uses this. If this is correct, why do various other (non network - say
    storage) drivers often use SpinLocks? Isn’t that inefficient?

2a) For Tx/Rx lists for queuing pkts that require processing, it
appears SpinLocks + NdisInterlocked[Insert/Remove][Head/Tail]List is
the recommended mechanism?

2b) Aside from locking, I was curious if it is common to have Rx/Tx
queues per processor in miniports to avoid even the above locking of
queues?

Thanks!

> Hi, in a NDIS miniport, I was wondering what is the best locking

startegy in the following scenarios given multi proc systems:

  1. I have some global configuration in the driver which can be
    configured from usermode via some IOCTLs. These settings are accessed
    by the adapters in their Tx/Rx paths. Seems like NDIS_RW_LOCK_EX APIs
    would be ideal in this case since config changes should be extremely
    infrequent in the global config but the datapath requires these
    parameters on all reads and writes. Seems like the netvmini smaple
    uses this. If this is correct, why do various other (non network - say
    storage) drivers often use SpinLocks? Isn’t that inefficient?

Define “inefficient”. Propose an alternative that is “more efficient”.
Note that suspening a thread and starting another, then resuming your
original thread on the unlock, will cost at the very least several hundred
thousand instructions. Explain how you would make this work at
DISPATCH_LEVEL. Reinvent spinlocks as the only possible choce.

2a) For Tx/Rx lists for queuing pkts that require processing, it
appears SpinLocks + NdisInterlocked[Insert/Remove][Head/Tail]List is
the recommended mechanism?

See answer to question 1.

2b) Aside from locking, I was curious if it is common to have Rx/Tx
queues per processor in miniports to avoid even the above locking of
queues?

If you manipulate them at DISPATCH_LEVEL, you might consider per-processor
queues, but not that this requires that all threads that are accessing the
queue are going to run on that core only. If PASSIVE_LEVEL threads are
involved, it is impossible to avoid the locking. Can you really trust
that all DISPATCH_LEVEL threads that could access the queue are restricted
to a single core? In general, spin locks that have very low contention
have practically no overhead, and if they have high contention you should
reconsider your overall architecture so there is no high-contention
locking. If the high-contention locking is intrinsic to the solution,
consider queued spin locks to reduce bus contention, particularly in NUMA
systems.

You are probably overly concerned about a non-issue.
joe

Thanks!


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

The most important rule of performance is to not do anything unless you can measure it. Suppose you decide that Lock_ABC is better – how will you know? You have to have a benchmark, and it needs to give a nice score at the end.

The second most important rule of performance is to have a goal. If you don’t have a goal, how will you know when you’re ready to ship? Since, from rule #1, you have a benchmark with a score, you should express your goal in terms of that score.

Now that you’ve gotten this far, perf work is easy. Write the simplest, most maintainable code. Then benchmark and see if your goal is met. If the goals are met, ship (hooray!). Else, profile to identify and fix the hotspots.

So in short, just use a spinlock, until somebody is looking at an ETL file in WPA (formerly XPERF), and telling you that your spinlock is too hot.

But I suppose you weren’t asking for a discussion on engineering. You wanted to know some particulars on comparing NDIS RW locks to spinlocks:
* Spinlocks are easier to use and use less memory
* Allocating a spinlock cannot fail; allocating an NDIS RW lock can fail
* NDIS RW locks are recursive for read-acquires; spinlocks are not recursive
* Neither is fair
* Both synchronize at DISPATCH_LEVEL
* Hot spinlocks acquired for write scale poorly when more processors are added
* Hot NDIS RW locks acquired for write scale even worse (just use spinlocks instead)
* Hot NDIS RW locks acquired for read scale better than spinlocks
* The debugger can query the owner of an NDIS RW lock (e.g., !ndiskd.rwlock); spinlocks are a mystery
* PREfast and DriverVerifier are better at understanding spinlocks

BTW, do not use NdisAcquireReadWriteLock in code written for Windows 7 or later. NdisAcquireReadWriteLock falls apart when you have more than 64 CPUs, and even for smaller systems, we don’t spend time tuning it. Here, I’m only talking about its successor, NdisAcquireRWLockXxx.

In addition to very insightful replies of Jeffrey Tippet
and Dr. Newcomer, two comments:

* NdisInterlocked[Insert/Remove][Head/Tail]List may use
the spinlock in some weird way incompatible with normal use.
So, do not call NdisAcquireSpinLock etc. on same spinlocks
that you use with NdisInterlocked…
If you need to protect a more complex data structure than a list,
use yet another “outer” spinlock for the whole.

* AFAIK only ndis has RW spinlocks, other subsystems don’t
(at least, not publicly documented) and use Ke… locks.

Regards,
– pa

On 17-Nov-2012 01:17, Puchu Pachok wrote:

Hi, in a NDIS miniport, I was wondering what is the best locking
startegy in the following scenarios given multi proc systems:

  1. I have some global configuration in the driver which can be
    configured from usermode via some IOCTLs. These settings are accessed
    by the adapters in their Tx/Rx paths. Seems like NDIS_RW_LOCK_EX APIs
    would be ideal in this case since config changes should be extremely
    infrequent in the global config but the datapath requires these
    parameters on all reads and writes. Seems like the netvmini smaple
    uses this. If this is correct, why do various other (non network - say
    storage) drivers often use SpinLocks? Isn’t that inefficient?

2a) For Tx/Rx lists for queuing pkts that require processing, it
appears SpinLocks + NdisInterlocked[Insert/Remove][Head/Tail]List is
the recommended mechanism?

2b) Aside from locking, I was curious if it is common to have Rx/Tx
queues per processor in miniports to avoid even the above locking of
queues?

Thanks!

wrote in message news:xxxxx@ntdev…
>> 2b) Aside from locking, I was curious if it is common to have Rx/Tx
>> queues per processor in miniports to avoid even the above locking of
>> queues?
>
> If you manipulate them at DISPATCH_LEVEL, you might consider per-processor
> queues, but not that this requires that all threads that are accessing the
> queue are going to run on that core only. If PASSIVE_LEVEL threads are
> involved, it is impossible to avoid the locking.

It’s not, you just need to set affinity on the thread which will always be
honored by the OS. Always, that is if processor group aware functions for
setting affinities are being used and not the legacy ones. Because the
system may be booted with special parameters that activates a CPU
redirection scheme which can mess up the logic for you.

>Can you really trust > that all DISPATCH_LEVEL threads that could access
>the queue are restricted
> to a single core?

AFAIK there is no such thing as DISPATCH_LEVEL threads. Threads run at
PASSIVE_LEVEL unless they get elevated. Perhaps you mean other units of
execution such as DPCs ? They are normally not referred to as threads. In
any case the story remains the same, even though no context swap can take
place you still need to set the target processor (affinity) for the DPC.

I agree that just because you can, it doens’t mean you should. I would not
consider it unless you know you got a real contention problem that cannot be
solved by ordinary means.

//Daniel

>

wrote in message news:xxxxx@ntdev…
>>> 2b) Aside from locking, I was curious if it is common to have Rx/Tx
>>> queues per processor in miniports to avoid even the above locking of
>>> queues?
>>
>> If you manipulate them at DISPATCH_LEVEL, you might consider
>> per-processor
>> queues, but not that this requires that all threads that are accessing
>> the
>> queue are going to run on that core only. If PASSIVE_LEVEL threads are
>> involved, it is impossible to avoid the locking.
>
> It’s not, you just need to set affinity on the thread which will always be
> honored by the OS. Always, that is if processor group aware functions for
> setting affinities are being used and not the legacy ones. Because the
> system may be booted with special parameters that activates a CPU
> redirection scheme which can mess up the logic for you.
>
And you set the thread affinity how? By telling the user to do it? Lots
of luck on that one! Note that a top-level driver is always called at
PASSIVE_LEVEL in the context of the user thread. System worker threads
should not have their affinities messed with. Dedicated driver threads
can do whatever they want. But how will you know what that is? There is
nothing, as far as I know, that requires that an interrupt be fielded by a
particular core, and it appears that a current accident of implementation
on the current hardware platforms makes this true this week, but will it
be true on next week’s chipset?

>>Can you really trust > that all DISPATCH_LEVEL threads that could access
>>the queue are restricted
>> to a single core?
>
> AFAIK there is no such thing as DISPATCH_LEVEL threads. Threads run at
> PASSIVE_LEVEL unless they get elevated. Perhaps you mean other units of
> execution such as DPCs ? They are normally not referred to as threads. In
> any case the story remains the same, even though no context swap can take
> place you still need to set the target processor (affinity) for the DPC.

It depends on your definition of “thread”. If you mean “shedulable
thread”, then yes, there is no such thing as a DISPATCH_LEVEL “thread”.
But if you mean “current execution context” then both ISRs and DPCs
represent “threads of control” and they can be running concurrently on
multiple cores, or pseudo-concurrently on a single core, wuith or without
hyperthreading. And, from the viewpont of concurreny, you have to think
of each of these as a thread of execution, albeit a threadvof execution
not under the control of the PASSIVE_LEVEL scheduler. I generally say
“every interrupt-driven device will have a driver running a MINIMUM of
three threads of control: ISR, DPC, and one or more PASSIVE_LEVEL threads.
Any interaction between these threads must be mediated by an appropriate
locking mechanism appropiate for the most restrictive level involved”.

You have fixated on one class of threads and ignored the fact that a DPC
or ISR can preempt a PASSIVE_LEVEL thread, a DPC can be preempted by an
ISR, and an ISR can be preempted by the ISR of a higher-priority device.
The fact that the scheduling mechanism for these threads is not the
PASSIVE_LEVEL thread scheduler does not change the fact they look like
threads.

>
> I agree that just because you can, it doens’t mean you should. I would not
> consider it unless you know you got a real contention problem that cannot
> be
> solved by ordinary means.

It sounds like premature optimization based on a dataless guess about the
possibility that there might be a problem. So the correct solution
appears to be to build a gratuitously complex solution to what is very
likely a nonexistent problem. What’s Wrong With This Picture?

>
> //Daniel
>
>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

This has GOT to be the post of the week. Or maybe post of the MONTH.

Let’s hear it one more time:

Mr. Tippet’s guidelines above should be engraved on a plaque and hung in every software engineer’s office… or at least that of their lead. Anytime somebody asks you “why did you use that spin lock” or tells you “I think you could make this faster by doing xyz” without a specific reason or goal, remove the plaque from the wall and hit them over the head with it.

Peter
OSR

The Executive Subsystem has RW Spinlocks, documented and available for general use, starting in Vista SP1. Vide ExAcquireSpinLockExclusive, for example.

Peter
OSR

wrote in message news:xxxxx@ntdev…
> And you set the thread affinity how? By telling the user to do it? Lots
> of luck on that one! Note that a top-level driver is always called at
> PASSIVE_LEVEL in the context of the user thread.

SetThreadGroupAffinity is the interface. If the thread does not run on the
specified core, it is rescheduled immediately.

>System worker threads should not have their affinities messed with.

That’s interesting and new to me. Do you care to elaborate or share any
documentation on that one ?

>There is nothing, as far as I know, that requires that an interrupt be
>fielded by a
> particular core, and it appears that a current accident of implementation
> on the current hardware platforms makes this true this week, but will it
> be true on next week’s chipset?
>

You are mostly right about that one. There are several ways to request
interrupts to occur on a particular core but there is no way to have a
guarantee there. But the work that requires the lock
is normally delegated to a DPC so I would suppose this is not an issue.

> You have fixated on one class of threads and ignored the fact that a DPC
> or ISR can preempt a PASSIVE_LEVEL thread, a DPC can be preempted by an
> ISR, and an ISR can be preempted by the ISR of a higher-priority device.
> The fact that the scheduling mechanism for these threads is not the
> PASSIVE_LEVEL thread scheduler does not change the fact they look like
> threads.
>

No, I haven’t ignored anything. There is no way a PASSIVE_LEVEL thread can
protect a resource which is also accessed by a DPC. If both execute on
arbitrary CPUs, a lock is required. If both execcute on the same CPU, the
PASSIVE_LEVEL thread must elevate IRQL before accessing the resource.

//Daniel

Back in the early 1970s, I developed the definitive performance
measurement tool, one that was still in use in 1983. A friend came to me
and explained that he needed to get performance data. “I’ve rewritten the
key function, and by counting instructions, its execution path is slightly
less than half of the old execution path. But my program runs no faster,
and I need to know why”

I showed him how to use the tool, and how to interpret the results. We
measured his “key function” as using 0.25% of the total time. “Just
think,” I told him, “A week ago that function consumed 0.5% of your time.
See what spending a week optimizing code in the absence of data has
gained?”

My reaction to this post was, “At least one other person understands the
problem!”

I love the idea of hitting people over the head with the bronze plaque. I
know one company that nearly went under because of constant “code
improvements” based on “customer demand for better performance”. My
Inside Source said that the programmers just guessed at where the problems
might be, and “optimized” the code, but each “optimization” introduced at
least one bug, some of them quite subtle and complex. He was the
equivalent of a Team Leader, and what he imposed was that no source module
could be checked out for modification unless the programmer doing so had a
sound technical reason for doing so, and “performance improvement” in the
absence of data was not such a reason. Every check-in had to be signed
off by two other programmers who, if the claim was a bug fix, verified
that the ONLY changes were to fix that bug. Pretty extreme, but the
product went from 13,000 outstanding bugs, some very serious, to a few
hundred, “largely cosmetic”, in about 18 months. Many “bugs” were fixed by
rolling back the source to the “poorer performing” but correct code. He
reported that in nearly all cases, these reversions produced no measurable
decrease in performance.
joe

This has GOT to be the post of the week. Or maybe post of the MONTH.

Let’s hear it one more time:

Mr. Tippet’s guidelines above should be engraved on a plaque and hung in
every software engineer’s office… or at least that of their lead.
Anytime somebody asks you “why did you use that spin lock” or tells you
“I think you could make this faster by doing xyz” without a specific
reason or goal, remove the plaque from the wall and hit them over the head
with it.

Peter
OSR


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

On 18-Nov-2012 17:07, xxxxx@osr.com wrote:

The Executive Subsystem has RW Spinlocks, documented and available for general use, starting in Vista SP1. Vide ExAcquireSpinLockExclusive, for example.

Peter
OSR

Ah, thanks, at last these functions are defined in WDK 8.
But they weren’t defined in previous public WDKs? Cannot find them in 7600.
– pa

>

This has GOT to be the post of the week. Or maybe post of the MONTH.

Let’s hear it one more time:

Mr. Tippet’s guidelines above should be engraved on a plaque and hung in
every software engineer’s office… or at least that of their lead. Anytime
somebody asks you “why did you use that spin lock” or tells you “I think you
could make this faster by doing xyz” without a specific reason or goal,
remove the plaque from the wall and hit them over the head with it.

I wonder if this guy has finished his invention yet… http://www.bash.org/?4281

James

You must have missed soemthing… AFAIK, they were exposed, as Peter
said earlier :slight_smile:

-pro

On Sun, Nov 18, 2012 at 4:57 PM, Pavel A wrote:
> On 18-Nov-2012 17:07, xxxxx@osr.com wrote:
>>
>>


>>
>> The Executive Subsystem has RW Spinlocks, documented and available for
>> general use, starting in Vista SP1. Vide ExAcquireSpinLockExclusive, for
>> example.
>>
>> Peter
>> OSR
>
>
> Ah, thanks, at last these functions are defined in WDK 8.
> But they weren’t defined in previous public WDKs? Cannot find them in 7600.
> – pa
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

Correct. They were missing from WDM.H in both Vista and Win7. This appears to have been an oversight, as the functions are present in the OS and marked as supported in the Win8 WDK from Vista SP1 and forward.

Given that they’re supported and VERY useful, we’ve been using them in some very mainstream Win7 projects (we just cooked our own function prototypes) without any problem.

Peter
OSR

Well, if you don’t mind, could you please explain what you mean by the term “measure” in this context. Let’s face it - in order to be able to measure the performance of something that deals with MP issues you have to test it in some situation that resembles the real-life one, which means you need a comparable piece of hardware. If you test on a piece of hardware of a different class… well, don’t be surprised if your algorithm that had passed all tests on a PC and low-end server class of machines perfectly well shows unsatisfactory results on a high-end machine.

For example, look at BFS (Brainfuck Scheduler). This algorithm has been designed specifically for relatively low-end machines with the number of CPUs not exceeding 16 - it works perfectly well on them, and shows
both good performance and fast response. Therefore, the results that you get if you test it on your desktop will be just excellent, so that you will say “my goal has been met” and ship a product. However, your customer that runs your product on a high-end server will be not happy about it at all, because it relies upon the algorithm that is simply unsuitable for high-end machines, and was never meant to be …

Anton Bassov

xxxxx@flounder.com wrote:

I showed him how to use the tool, and how to interpret the results. We
measured his “key function” as using 0.25% of the total time. “Just
think,” I told him, “A week ago that function consumed 0.5% of your time.
See what spending a week optimizing code in the absence of data has
gained?”

This is “Amdahl’s Law”, for those of us who like to be name-droppers
when we scold our peers.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I don’t get it: They’re running your stuff in an environment for which it was not created. To me, this equals either “not supported” or “don’t complain about the performance.”

So, “measure” clearly means “measure in a suitable analog of the target environment, using a suitable analog of the target workload.”

Peter
OSR

The problem frequently arises because of the confusion of “minimum
configuration” does not generalize to “all configurations”, and because
the notion of “performance in a test lab situation with virtually no load”
does not scale to “performance in a real-life situation with SQL server,
IIS, and six other heavy-duty server apps running”.

Generally, a product is specified in terms of the minimum requires to make
it work: memory, CPU, disk, etc. But everyone seems to ignore the fact
that a 64-processor system is not just a uniprocessor system replicated 64
times. “Hot” spinlocks, particularly if they are not queued spin locks,
impact OVERALL system performance. Your driver may not work if there is a
higher-priority (in the PCI sense) device interrupting at a high rate,
screwing over any timings you may have thought you had. And so on.

When my wife was upgrading her library to their first online library
system, she received a box of 17 thick looseleaf manuals. A week later,
they asked her if she had finished the “configuration sheet”, and she said
she hadn’t found it. They said, “Oh, that’s Volume 3”.

Nobody wants to read a book delimiting the properties of the device, and
it may not even be possible to write one (the library system was a mature
product, with a ten or fifteen year history), so the box says, in 3-point
type, “Requires Windows Vista or later, 2GB RAM, 250MB of disk space” and
unstated is the fact that the processor is one that is at least new enough
to run Vista. But the nasty reality is that the real world is not
particularly friendly to toy tests.
joe

I don’t get it: They’re running your stuff in an environment for which it
was not created. To me, this equals either “not supported” or “don’t
complain about the performance.”

So, “measure” clearly means “measure in a suitable analog of the target
environment, using a suitable analog of the target workload.”

Peter
OSR


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> I don’t get it: They’re running your stuff in an environment for which it was not created.

To me, this equals either “not supported” or “don’t complain about the performance.”
So, “measure” clearly means “measure in a suitable analog of the target environment, using a
suitable analog of the target workload.”

Yes, but how are you going to define a criteria “designed for a certain type of environment”
using Mr.Tippet’s guideline??? Can you afford a testing environment with the number of CPUs falling in the range that well exceeds the one described by ‘unsigned short’ type? I guess Mr.Tippet’s post gives us yet another explanation of why someone looking at Top 500 List practically never encounters the word “Windows” on it …

Anton Bassov

Hmmm… let me think… “that well exceeds the one described by unsigned short type”… UCHAR… USHORT… what WOULD that number be? USHORT’s half an ULONG… and we’re talking C in kernel mode here… Hmmmm… 65536? Yes, I think that’s what you meant. 65536! More than 65535 CPUs? Why would I do that? Windows only supports 256 CPUs per system.

But… I’ll play along for fun, because well, it’s still morning here and I haven’t left for the office yet. So, here’s my answer: If that’s the environment that my software is targeting… why, yes. If not in-house, then certainly at a test site with which I’m partnering.

I guess I don’t understand your entire point (hey, THAT’s a first, huh?).

The way WE write software is we create a set of functional requirements, define a design that we expect will meet those requirements, write the software, and then test our software to determine – as far as practical – that our implementation and design in fact DO meet the stated functional requirements. The requirements, of necessity, include supported target environments and performance metrics, if performance is part of the goal.

One place where I *will* differ, if only slightly, from Mr. Tippet is that in some cases, or for some releases, or for some operations, performance is completely secondary to functionality. The product has to work properly, even if it exhibits crappy performance. In these cases, I’ve found that it actually works to have a performance goal that’s entirely subjective. For example, the goal might be “doing xyz in environment abc must work, and performance can’t totally suck” – My experience is that three or four engineers locked in a room can generally come to a consensus on when performance of an operation does and does not “totally suck.” Yes, this is a VERY LOW performance goal… but it’s still a valid goal. And YES, you might get customers who disagree when it comes to whether the exhibited performance in their environment “totally sucks” or not. In that case, you explain to them that functionality and not performance was the goal for the product/release/feature and whether you ever plan to enhance the performance in this area.

In general, when we have customers that report our software behaving in ways other than those we’ve specified, performance or otherwise, we first ask: “Is this product running in a way we intended and in a supported environment?” If it’s not, we explain that to the customer.

If you try to dig a hole with an axe, and you break that axe handle, it’s REALLY not a problem the axe manufacturer should have to deal with. You can complain all you want about how the axe is inefficient, and how the handle should not have broken, but the bottom line is the problem lies with YOU not with the axe – you’re using the wrong tool for the job.

It’s no different in the world of software. In the real world, there’s zero need to complicate things further. You just make your job harder, and you gain nothing.

Peter
OSR