Generic Spin Lock Question

I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are
outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure
more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many
variables coming into play, but let’s say this code is executed very
frequently. I guess the fundamental question is what is the real overhead of
claiming a spin lock on a MP machine. And where do you draw the line
between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ

MessageAcquiring and releasing spinlocks are expensive. Therefore it is in
your best interests, performance wise, to limit the acquiring and release.
However, in every project there are engineering decisions that must be made.
Both Method 1 and Method 2 work, so the question comes down to what those 10
lines of code are where the lock is not necessary and why do the have to be
done after a piece of code that requires the spin lock… I think this
is one of those “it depends” answers…


Mark Cariddi
Open Systems Resources, Inc.
www.osr.com

“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are
outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure
more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many
variables coming into play, but let’s say this code is executed very
frequently. I guess the fundamental question is what is the real overhead of
claiming a spin lock on a MP machine. And where do you draw the line
between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ

MessageThis absolutely depends. What are those 10 lines doing, as in how long will they take to execute? You stated this code is called frequently, so how many other places in the driver is this spinlock taken? If its used through out and frequently, you might not want to hold it as long as you might be holding up other threads on other processors. You would probably have to profile the driver to find the sweet spot as to when its more beneficial to hold the lock than to drop it, and this would likely vary from single to multiple processor environments. And then, you would have to profile under load or some such. There is no general answer to this question, although you could certainly determine how many cycles it took to grab the lock and make a rough estimation that way.


Bill McKenzie

“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many variables coming into play, but let’s say this code is executed very frequently. I guess the fundamental question is what is the real overhead of claiming a spin lock on a MP machine. And where do you draw the line between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ

XP’s queued spinlocks are cheaper (each CPU spins on its own variable, no cache snoop thrashing), and can also be ported back to
previous NTs for extremely performance-important code paths.

Max

“Mark J. Cariddi” wrote in message
news:LYRIS-542-52033-2002.05.14-11.19.16–maxim#xxxxx@lists.osr.com…
> MessageAcquiring and releasing spinlocks are expensive. Therefore it is in
> your best interests, performance wise, to limit the acquiring and release.
> However, in every project there are engineering decisions that must be made.
> Both Method 1 and Method 2 work, so the question comes down to what those 10
> lines of code are where the lock is not necessary and why do the have to be
> done after a piece of code that requires the spin lock… I think this
> is one of those “it depends” answers…
>
>
> –
> Mark Cariddi
> Open Systems Resources, Inc.
> www.osr.com
>
> “PJ Kirner” wrote in message news:xxxxx@ntdev…
> I recently had some code recently that was structured like
>
> -Claim SpinLock A
> -Do about 10 lines of code where claiming the spin lock is NECESSARY
> -Do another 10 lines of code where holding the spin lock is NOT NECESSARY
> -Do another 10 lines of code where claiming the spin lock is NECESSARY
> -Release SpinLock A
>
> The obvious solutions is to reorganized so the middle 10 lines of code are
> outside the spin lock. But for arguments sake lets say that is not possible.
>
> Is the above code structure more efficient or is the following structure
> more efficient
>
> -Claim SpinLock A
> -Do about 10 lines of code where claiming the spin lock is NECESSARY
> -Release SpinLock A
> -Do another 10 lines of code where holding the spin lock is NOT NECESSARY
> -Claim SpinLock A
> -Do another 10 lines of code where claiming the spin lock is NECESSARY
> -Release SpinLock A
>
> I know there is probably no good answer for this because there are so many
> variables coming into play, but let’s say this code is executed very
> frequently. I guess the fundamental question is what is the real overhead of
> claiming a spin lock on a MP machine. And where do you draw the line
> between claiming a spin lock too many times, and holding it for too long.
>
> Thanks for any input.
> PJ
>
>
>
> —
> You are currently subscribed to ntdev as: xxxxx@storagecraft.com
> To unsubscribe send a blank email to %%email.unsub%%
>

I knew it was going to be one of those “it depends” answers but could
someone please clearify exactly “how expensive” a spin lock is. For example
what does the hardware actually have to do to determine that the spin lock
can be entered. Also quantify the number of cycles lost when trying to
claim a spin lock, or at least provide lower bound.

Lets assume CPU0 claimed and released the spin lock, and some small number
of cycles later CPU1 wants to claim the spin lock. What is the penalty? I
understand this is going to get into hardware specifics, but can some one
please enlighten me about a few of the details.

Thanks,
PJ

-----Original Message-----
From: Mark J. Cariddi [mailto:xxxxx@osr.com]
Sent: Tuesday, May 14, 2002 11:17 AM
To: NT Developers Interest List
Subject: [ntdev] Re: Generic Spin Lock Question

MessageAcquiring and releasing spinlocks are expensive. Therefore it is in
your best interests, performance wise, to limit the acquiring and release.
However, in every project there are engineering decisions that must be made.
Both Method 1 and Method 2 work, so the question comes down to what those 10
lines of code are where the lock is not necessary and why do the have to be
done after a piece of code that requires the spin lock… I think this
is one of those “it depends” answers…


Mark Cariddi
Open Systems Resources, Inc.
www.osr.com

“PJ Kirner” wrote in message news:xxxxx@ntdev… I recently
had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY -Do
another 10 lines of code where holding the spin lock is NOT NECESSARY -Do
another 10 lines of code where claiming the spin lock is NECESSARY -Release
SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are
outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure
more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A -Do another 10 lines of code where holding the spin lock
is NOT NECESSARY -Claim SpinLock A -Do another 10 lines of code where
claiming the spin lock is NECESSARY -Release SpinLock A

I know there is probably no good answer for this because there are so many
variables coming into play, but let’s say this code is executed very
frequently. I guess the fundamental question is what is the real overhead of
claiming a spin lock on a MP machine. And where do you draw the line
between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ


You are currently subscribed to ntdev as: xxxxx@funk.com
To unsubscribe send a blank email to %%email.unsub%%

MessageAs someone who has spent a good portion of his life on these issues (e.g., getting 256 CPU machine to stay up for more than 4 hours), it almost absolutely does not “depend”. Unless those ten lines do something extremely expensive, there is NEVER any point in the extra release acquire. Here is why:

  1. Each acquire/release has one (…AtDpc) or three (not at Dpc) serializing instructions (at least on an X86). So a normal acquire/release with IRQL manipulation will cost as much as your 10 lines, without contention.
  2. If there is contention, and another CPU gets in between the two lock gets, then you pay though the nose shifting cache lines back and forth. Your average 10 lines of code (without calls) will be cheaper than a single cache line miss. Cache misses cost 100s of instruction times an P4 class machines.
    In more detail:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line on CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.

In the above, there are 6 misses and 2 transitions from shared to exclusive (invalidate other copies).

Holding the lock the entire time, you get:
CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1, spins
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.

In this case, there are 3 misses and 1 shared=>exclusive bus transaction.

So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would have to be an awfully large number of
instructions to be worse than 3 extra cache misses.

  1. Less lock gets/releases => simpler code => more likely to work in all cases.

  2. Cache line contention occurs at a much lower duty cycle than lock contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of worrying about how to reduce the lock held time, its much more cost effective to worry about avoiding having locks change CPUs. For example, instead of one free list, have one per CPU. Free to the one corresponding to the current CPU. Allocate from that one unless its empty, then search other CPUs. Under most loads, the locks on each list will not move CPUs, nor will the data structures themselves (and make the free lists LIFO not FIFO). And by the way, no lock contention, either.

Another way to think about this is that the cache line transfer between CPUs to acquire the lock is part of the lock held time, since even though the CPU that is about to get the lock has not marked it set, no other CPU can get the lock. So CPU2 does not acquire the lock at the time that the test & set succeeds, rather it acquired the lock when it shot down the cache line in the other CPUs. Just due to this fact, the lock held time increases with contention, and two CPUs can takes substantially longer than one to do the same work. Then you can add in the fact that the data being protected by that lock is also often in the same cache line, and the other CPU is spinning reading that cache line…

-DH

“Bill McKenzie” wrote in message news:xxxxx@ntdev…
This absolutely depends. What are those 10 lines doing, as in how long will they take to execute? You stated this code is called frequently, so how many other places in the driver is this spinlock taken? If its used through out and frequently, you might not want to hold it as long as you might be holding up other threads on other processors. You would probably have to profile the driver to find the sweet spot as to when its more beneficial to hold the lock than to drop it, and this would likely vary from single to multiple processor environments. And then, you would have to profile under load or some such. There is no general answer to this question, although you could certainly determine how many cycles it took to grab the lock and make a rough estimation that way.


Bill McKenzie

“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many variables coming into play, but let’s say this code is executed very frequently. I guess the fundamental question is what is the real overhead of claiming a spin lock on a MP machine. And where do you draw the line between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ

MessageEasy case where the 10 lines could be more expensive: They grab another spinlock. It depends.

Bill M.
“Dave Harvey” wrote in message news:xxxxx@ntdev…
As someone who has spent a good portion of his life on these issues (e.g., getting 256 CPU machine to stay up for more than 4 hours), it almost absolutely does not “depend”. Unless those ten lines do something extremely expensive, there is NEVER any point in the extra release acquire. Here is why:

1) Each acquire/release has one (…AtDpc) or three (not at Dpc) serializing instructions (at least on an X86). So a normal acquire/release with IRQL manipulation will cost as much as your 10 lines, without contention.
2) If there is contention, and another CPU gets in between the two lock gets, then you pay though the nose shifting cache lines back and forth. Your average 10 lines of code (without calls) will be cheaper than a single cache line miss. Cache misses cost 100s of instruction times an P4 class machines.
In more detail:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line on CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.

In the above, there are 6 misses and 2 transitions from shared to exclusive (invalidate other copies).

Holding the lock the entire time, you get:
CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1, spins
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.

In this case, there are 3 misses and 1 shared=>exclusive bus transaction.

So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would have to be an awfully large number of
instructions to be worse than 3 extra cache misses.

3) Less lock gets/releases => simpler code => more likely to work in all cases.

4) Cache line contention occurs at a much lower duty cycle than lock contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of worrying about how to reduce the lock held time, its much more cost effective to worry about avoiding having locks change CPUs. For example, instead of one free list, have one per CPU. Free to the one corresponding to the current CPU. Allocate from that one unless its empty, then search other CPUs. Under most loads, the locks on each list will not move CPUs, nor will the data structures themselves (and make the free lists LIFO not FIFO). And by the way, no lock contention, either.

Another way to think about this is that the cache line transfer between CPUs to acquire the lock is part of the lock held time, since even though the CPU that is about to get the lock has not marked it set, no other CPU can get the lock. So CPU2 does not acquire the lock at the time that the test & set succeeds, rather it acquired the lock when it shot down the cache line in the other CPUs. Just due to this fact, the lock held time increases with contention, and two CPUs can takes substantially longer than one to do the same work. Then you can add in the fact that the data being protected by that lock is also often in the same cache line, and the other CPU is spinning reading that cache line…

-DH

“Bill McKenzie” wrote in message news:xxxxx@ntdev…
This absolutely depends. What are those 10 lines doing, as in how long will they take to execute? You stated this code is called frequently, so how many other places in the driver is this spinlock taken? If its used through out and frequently, you might not want to hold it as long as you might be holding up other threads on other processors. You would probably have to profile the driver to find the sweet spot as to when its more beneficial to hold the lock than to drop it, and this would likely vary from single to multiple processor environments. And then, you would have to profile under load or some such. There is no general answer to this question, although you could certainly determine how many cycles it took to grab the lock and make a rough estimation that way.


Bill McKenzie

“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many variables coming into play, but let’s say this code is executed very frequently. I guess the fundamental question is what is the real overhead of claiming a spin lock on a MP machine. And where do you draw the line between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ

Couple of questions on this…

First: Does the P4 support only MESI bus protocols? Thus, if dirty
data on CPU 1 is required in CPU 2 then the data must go through to
memory? Or has the protocol advanced to what we were calling a MERCY
protocol where the CPU2 could get the dirty data directly from CPU1
prior to the push to memory? This would possibly eliminate the issue of
bus transaction overhead significantly.

Second: Doesn’t your concept of keeping separate lists per processor
require adjustments to the processor affinity within the driver? This,
in my experience, would make the driver significantly more difficult to
to maintain and support. Plus it does not allow the nt executive to
handle the smp load balancing. I could understand in a massively
parallel environment this might be conducive since so many chips are
contending for the bus. But with only (on average) 2-16 cpus and the
frequency of the lock acquire/release, isn’t Bill still correct in
stating that it depends upon the nature of the code?

I can’t disagree that cache and tlb flushes are bad, but when do they
necessarily occur on the x86 (P3 vs P4)?. I used to work with one of
the major uProcessor architectures with another company that ran NT
(long ago in a galaxy far, far away). I know there was the beginnings
of industry move to avoid cpu1->memory->cpu2 transfer of data. Don’t
know if it actually made it in. Interesting discussion…have to look
into this further.

thanks! - jb

============================================
Jonathan Borden
L5 Software Group mailto:xxxxx

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Dave Harvey
Sent: Tuesday, May 14, 2002 10:37 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Generic Spin Lock Question

As someone who has spent a good portion of his life on these issues
(e.g., getting 256 CPU machine to stay up for more than 4 hours), it
almost absolutely does not “depend”. Unless those ten lines do something
extremely expensive, there is NEVER any point in the extra release
acquire. Here is why:

1) Each acquire/release has one (…AtDpc) or three (not at Dpc)
serializing instructions (at least on an X86). So a normal
acquire/release with IRQL manipulation will cost as much as your 10
lines, without contention.
2) If there is contention, and another CPU gets in between the two lock
gets, then you pay though the nose shifting cache lines back and forth.
Your average 10 lines of code (without calls) will be cheaper than a
single cache line miss. Cache misses cost 100s of instruction times an
P4 class machines.
In more detail:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in
CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss,
invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the
lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line
on CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the
line on CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.

In the above, there are 6 misses and 2 transitions from shared to
exclusive (invalidate other copies).

Holding the lock the entire time, you get:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in
CPU 1, spins
CPU 1 writes a zero to the lock to free it, getting a cache miss,
invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the
lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.

In this case, there are 3 misses and 1 shared=>exclusive bus
transaction.

So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would
have to be an awfully large number of
instructions to be worse than 3 extra cache misses.

3) Less lock gets/releases => simpler code => more likely to work in all
cases.

4) Cache line contention occurs at a much lower duty cycle than lock
contention, and you pay plenty for it. i.e., If CPU 1 gets lock A,
releases it, then later CPU B gets lock A, CPU B gets a cache miss.
Instead of worrying about how to reduce the lock held time, its much
more cost effective to worry about avoiding having locks change CPUs.
For example, instead of one free list, have one per CPU. Free to the
one corresponding to the current CPU. Allocate from that one unless its
empty, then search other CPUs. Under most loads, the locks on each list
will not move CPUs, nor will the data structures themselves (and make
the free lists LIFO not FIFO). And by the way, no lock contention,
either.

Another way to think about this is that the cache line transfer between
CPUs to acquire the lock is part of the lock held time, since even
though the CPU that is about to get the lock has not marked it set, no
other CPU can get the lock. So CPU2 does not acquire the lock at the
time that the test & set succeeds, rather it acquired the lock when it
shot down the cache line in the other CPUs. Just due to this fact, the
lock held time increases with contention, and two CPUs can takes
substantially longer than one to do the same work. Then you can add in
the fact that the data being protected by that lock is also often in the
same cache line, and the other CPU is spinning reading that cache
line…

-DH

“Bill McKenzie” wrote in message
news:xxxxx@ntdev…
This absolutely depends. What are those 10 lines doing, as in how long
will they take to execute? You stated this code is called frequently,
so how many other places in the driver is this spinlock taken? If its
used through out and frequently, you might not want to hold it as long
as you might be holding up other threads on other processors. You would
probably have to profile the driver to find the sweet spot as to when
its more beneficial to hold the lock than to drop it, and this would
likely vary from single to multiple processor environments. And then,
you would have to profile under load or some such. There is no general
answer to this question, although you could certainly determine how many
cycles it took to grab the lock and make a rough estimation that way.


Bill McKenzie

“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT
NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code
are outside the spin lock. But for arguments sake lets say that is not
possible.

Is the above code structure more efficient or is the following structure
more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT
NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so
many variables coming into play, but let’s say this code is executed
very frequently. I guess the fundamental question is what is the real
overhead of claiming a spin lock on a MP machine. And where do you draw
the line between claiming a spin lock too many times, and holding it for
too long.

Thanks for any input.
PJ


You are currently subscribed to ntdev as: xxxxx@earthlink.net
To unsubscribe send a blank email to %%email.unsub%%</mailto:xxxxx>

I don’t understand why Intel don’t let CPU 2 read the cache line that CPU 1
writes to the bus when the line transitions from exclusive to shared state.
It it were so, CPU 1 would cause the cache line to go to E state, CPU 2
would read it, causing the line to move from E back to S state, CPU 1 posts
the cache line to be written onto the bus, and CPU 2 reads it. That would
save bus time, but I’m don’t know if it is possible. Or do I misunderstand
the protocol ?

Alberto.

-----Original Message-----
From: Dave Harvey [mailto:xxxxx@syssoftsol.com]
Sent: Tuesday, May 14, 2002 11:37 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Generic Spin Lock Question

As someone who has spent a good portion of his life on these issues (e.g.,
getting 256 CPU machine to stay up for more than 4 hours), it almost
absolutely does not “depend”. Unless those ten lines do something extremely
expensive, there is NEVER any point in the extra release acquire. Here is
why:

  1. Each acquire/release has one (…AtDpc) or three (not at Dpc) serializing
    instructions (at least on an X86). So a normal acquire/release with IRQL
    manipulation will cost as much as your 10 lines, without contention.
  2. If there is contention, and another CPU gets in between the two lock
    gets, then you pay though the nose shifting cache lines back and forth.
    Your average 10 lines of code (without calls) will be cheaper than a single
    cache line miss. Cache misses cost 100s of instruction times an P4 class
    machines.
    In more detail:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss,
invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock
free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line on
CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on
CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.

In the above, there are 6 misses and 2 transitions from shared to exclusive
(invalidate other copies).

Holding the lock the entire time, you get:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1,
spins
CPU 1 writes a zero to the lock to free it, getting a cache miss,
invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock
free.
CPU 2 invalidates the line on CPU 1 and sets the lock.

In this case, there are 3 misses and 1 shared=>exclusive bus transaction.

So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would
have to be an awfully large number of
instructions to be worse than 3 extra cache misses.

  1. Less lock gets/releases => simpler code => more likely to work in all
    cases.

  2. Cache line contention occurs at a much lower duty cycle than lock
    contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases
    it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of
    worrying about how to reduce the lock held time, its much more cost
    effective to worry about avoiding having locks change CPUs. For example,
    instead of one free list, have one per CPU. Free to the one corresponding
    to the current CPU. Allocate from that one unless its empty, then search
    other CPUs. Under most loads, the locks on each list will not move CPUs,
    nor will the data structures themselves (and make the free lists LIFO not
    FIFO). And by the way, no lock contention, either.

Another way to think about this is that the cache line transfer between CPUs
to acquire the lock is part of the lock held time, since even though the CPU
that is about to get the lock has not marked it set, no other CPU can get
the lock. So CPU2 does not acquire the lock at the time that the test & set
succeeds, rather it acquired the lock when it shot down the cache line in
the other CPUs. Just due to this fact, the lock held time increases with
contention, and two CPUs can takes substantially longer than one to do the
same work. Then you can add in the fact that the data being protected by
that lock is also often in the same cache line, and the other CPU is
spinning reading that cache line…

-DH

“Bill McKenzie” < xxxxx@bsquare.com mailto:xxxxx >
wrote in message news:xxxxx@ntdev news:xxxxx
This absolutely depends. What are those 10 lines doing, as in how long will
they take to execute? You stated this code is called frequently, so how
many other places in the driver is this spinlock taken? If its used through
out and frequently, you might not want to hold it as long as you might be
holding up other threads on other processors. You would probably have to
profile the driver to find the sweet spot as to when its more beneficial to
hold the lock than to drop it, and this would likely vary from single to
multiple processor environments. And then, you would have to profile under
load or some such. There is no general answer to this question, although
you could certainly determine how many cycles it took to grab the lock and
make a rough estimation that way.


Bill McKenzie

“PJ Kirner” < xxxxx@funk.com mailto:xxxxx > wrote in message
news:xxxxx@ntdev news:xxxxx
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are
outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure
more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many
variables coming into play, but let’s say this code is executed very
frequently. I guess the fundamental question is what is the real overhead of
claiming a spin lock on a MP machine. And where do you draw the line
between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ


You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to %%email.unsub%%

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.</news:xxxxx></mailto:xxxxx></news:xxxxx></mailto:xxxxx>

I know the Pentium Pro supports a version of the MESI protocol, I don’t know
how much it changed in the P4. There’s a very decent description of the
Pentium Pro system bus operation in Shanley’s Mindspring book. The way I
understand the protocol, and I may go wrong, that cache line after being
written by CPU 1 must go to main memory before CPU 2 can read it. I also
agree with Dave that using MP-friendly data structures goes a long way
towards helping performance issues, but then, I’m also into massively
parallel processing, so it may be my professional bias speaking now.

Alberto.

-----Original Message-----
From: Jonathan Borden [mailto:xxxxx@earthlink.net]
Sent: Wednesday, May 15, 2002 9:04 AM
To: NT Developers Interest List
Subject: [ntdev] Re: Generic Spin Lock Question

Couple of questions on this…

First: Does the P4 support only MESI bus protocols? Thus, if dirty data on
CPU 1 is required in CPU 2 then the data must go through to memory? Or has
the protocol advanced to what we were calling a MERCY protocol where the
CPU2 could get the dirty data directly from CPU1 prior to the push to
memory? This would possibly eliminate the issue of bus transaction overhead
significantly.

Second: Doesn’t your concept of keeping separate lists per processor
require adjustments to the processor affinity within the driver? This, in
my experience, would make the driver significantly more difficult to to
maintain and support. Plus it does not allow the nt executive to handle the
smp load balancing. I could understand in a massively parallel environment
this might be conducive since so many chips are contending for the bus. But
with only (on average) 2-16 cpus and the frequency of the lock
acquire/release, isn’t Bill still correct in stating that it depends upon
the nature of the code?

I can’t disagree that cache and tlb flushes are bad, but when do they
necessarily occur on the x86 (P3 vs P4)?. I used to work with one of the
major uProcessor architectures with another company that ran NT (long ago in
a galaxy far, far away). I know there was the beginnings of industry move
to avoid cpu1->memory->cpu2 transfer of data. Don’t know if it actually
made it in. Interesting discussion…have to look into this further.

thanks! - jb

============================================
Jonathan Borden
L5 Software Group mailto:xxxxx

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Dave Harvey
Sent: Tuesday, May 14, 2002 10:37 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Generic Spin Lock Question

As someone who has spent a good portion of his life on these issues (e.g.,
getting 256 CPU machine to stay up for more than 4 hours), it almost
absolutely does not “depend”. Unless those ten lines do something extremely
expensive, there is NEVER any point in the extra release acquire. Here is
why:

1) Each acquire/release has one (…AtDpc) or three (not at Dpc) serializing
instructions (at least on an X86). So a normal acquire/release with IRQL
manipulation will cost as much as your 10 lines, without contention.
2) If there is contention, and another CPU gets in between the two lock
gets, then you pay though the nose shifting cache lines back and forth.
Your average 10 lines of code (without calls) will be cheaper than a single
cache line miss. Cache misses cost 100s of instruction times an P4 class
machines.
In more detail:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss,
invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock
free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line on
CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on
CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.

In the above, there are 6 misses and 2 transitions from shared to exclusive
(invalidate other copies).

Holding the lock the entire time, you get:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1,
spins
CPU 1 writes a zero to the lock to free it, getting a cache miss,
invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock
free.
CPU 2 invalidates the line on CPU 1 and sets the lock.

In this case, there are 3 misses and 1 shared=>exclusive bus transaction.

So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would
have to be an awfully large number of
instructions to be worse than 3 extra cache misses.

3) Less lock gets/releases => simpler code => more likely to work in all
cases.

4) Cache line contention occurs at a much lower duty cycle than lock
contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases
it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of
worrying about how to reduce the lock held time, its much more cost
effective to worry about avoiding having locks change CPUs. For example,
instead of one free list, have one per CPU. Free to the one corresponding
to the current CPU. Allocate from that one unless its empty, then search
other CPUs. Under most loads, the locks on each list will not move CPUs,
nor will the data structures themselves (and make the free lists LIFO not
FIFO). And by the way, no lock contention, either.

Another way to think about this is that the cache line transfer between CPUs
to acquire the lock is part of the lock held time, since even though the CPU
that is about to get the lock has not marked it set, no other CPU can get
the lock. So CPU2 does not acquire the lock at the time that the test & set
succeeds, rather it acquired the lock when it shot down the cache line in
the other CPUs. Just due to this fact, the lock held time increases with
contention, and two CPUs can takes substantially longer than one to do the
same work. Then you can add in the fact that the data being protected by
that lock is also often in the same cache line, and the other CPU is
spinning reading that cache line…

-DH

“Bill McKenzie” < xxxxx@bsquare.com mailto:xxxxx >
wrote in message news:xxxxx@ntdev news:xxxxx
This absolutely depends. What are those 10 lines doing, as in how long will
they take to execute? You stated this code is called frequently, so how
many other places in the driver is this spinlock taken? If its used through
out and frequently, you might not want to hold it as long as you might be
holding up other threads on other processors. You would probably have to
profile the driver to find the sweet spot as to when its more beneficial to
hold the lock than to drop it, and this would likely vary from single to
multiple processor environments. And then, you would have to profile under
load or some such. There is no general answer to this question, although
you could certainly determine how many cycles it took to grab the lock and
make a rough estimation that way.


Bill McKenzie

“PJ Kirner” < xxxxx@funk.com mailto:xxxxx > wrote in message
news:xxxxx@ntdev news:xxxxx
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are
outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure
more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many
variables coming into play, but let’s say this code is executed very
frequently. I guess the fundamental question is what is the real overhead of
claiming a spin lock on a MP machine. And where do you draw the line
between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ


You are currently subscribed to ntdev as: xxxxx@earthlink.net
To unsubscribe send a blank email to %%email.unsub%%


You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to %%email.unsub%%

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.</news:xxxxx></mailto:xxxxx></news:xxxxx></mailto:xxxxx></mailto:xxxxx>

MessageThat’s what queued spinlocks is for - to avoid spinning on lock variable.
The overhead of migrating the cache line from CPU to CPU (due to snoop hits) is comparable to just uncached memory access.

Max
“Dave Harvey” wrote in message news:LYRIS-542-52101-2002.05.14-23.39.55–maxim#xxxxx@lists.osr.com…
As someone who has spent a good portion of his life on these issues (e.g., getting 256 CPU machine to stay up for more than 4 hours), it almost absolutely does not “depend”. Unless those ten lines do something extremely expensive, there is NEVER any point in the extra release acquire. Here is why:

1) Each acquire/release has one (…AtDpc) or three (not at Dpc) serializing instructions (at least on an X86). So a normal acquire/release with IRQL manipulation will cost as much as your 10 lines, without contention.
2) If there is contention, and another CPU gets in between the two lock gets, then you pay though the nose shifting cache lines back and forth. Your average 10 lines of code (without calls) will be cheaper than a single cache line miss. Cache misses cost 100s of instruction times an P4 class machines.
In more detail:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line on CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.

In the above, there are 6 misses and 2 transitions from shared to exclusive (invalidate other copies).

Holding the lock the entire time, you get:
CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1, spins
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.

In this case, there are 3 misses and 1 shared=>exclusive bus transaction.

So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would have to be an awfully large number of
instructions to be worse than 3 extra cache misses.

3) Less lock gets/releases => simpler code => more likely to work in all cases.

4) Cache line contention occurs at a much lower duty cycle than lock contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of worrying about how to reduce the lock held time, its much more cost effective to worry about avoiding having locks change CPUs. For example, instead of one free list, have one per CPU. Free to the one corresponding to the current CPU. Allocate from that one unless its empty, then search other CPUs. Under most loads, the locks on each list will not move CPUs, nor will the data structures themselves (and make the free lists LIFO not FIFO). And by the way, no lock contention, either.

Another way to think about this is that the cache line transfer between CPUs to acquire the lock is part of the lock held time, since even though the CPU that is about to get the lock has not marked it set, no other CPU can get the lock. So CPU2 does not acquire the lock at the time that the test & set succeeds, rather it acquired the lock when it shot down the cache line in the other CPUs. Just due to this fact, the lock held time increases with contention, and two CPUs can takes substantially longer than one to do the same work. Then you can add in the fact that the data being protected by that lock is also often in the same cache line, and the other CPU is spinning reading that cache line…

-DH

“Bill McKenzie” wrote in message news:xxxxx@ntdev…
This absolutely depends. What are those 10 lines doing, as in how long will they take to execute? You stated this code is called frequently, so how many other places in the driver is this spinlock taken? If its used through out and frequently, you might not want to hold it as long as you might be holding up other threads on other processors. You would probably have to profile the driver to find the sweet spot as to when its more beneficial to hold the lock than to drop it, and this would likely vary from single to multiple processor environments. And then, you would have to profile under load or some such. There is no general answer to this question, although you could certainly determine how many cycles it took to grab the lock and make a rough estimation that way.


Bill McKenzie

“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many variables coming into play, but let’s say this code is executed very frequently. I guess the fundamental question is what is the real overhead of claiming a spin lock on a MP machine. And where do you draw the line between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ

You are currently subscribed to ntdev as: xxxxx@storagecraft.com
To unsubscribe send a blank email to %%email.unsub%%

MessageIIRC assigning a NIC to the CPU (possible in NDIS) improves NDIS performance greatly.

Max
----- Original Message -----
From: Jonathan Borden
To: NT Developers Interest List
Sent: Wednesday, May 15, 2002 5:04 PM
Subject: [ntdev] Re: Generic Spin Lock Question

Couple of questions on this…

First: Does the P4 support only MESI bus protocols? Thus, if dirty data on CPU 1 is required in CPU 2 then the data must go through to memory? Or has the protocol advanced to what we were calling a MERCY protocol where the CPU2 could get the dirty data directly from CPU1 prior to the push to memory? This would possibly eliminate the issue of bus transaction overhead significantly.

Second: Doesn’t your concept of keeping separate lists per processor require adjustments to the processor affinity within the driver? This, in my experience, would make the driver significantly more difficult to to maintain and support. Plus it does not allow the nt executive to handle the smp load balancing. I could understand in a massively parallel environment this might be conducive since so many chips are contending for the bus. But with only (on average) 2-16 cpus and the frequency of the lock acquire/release, isn’t Bill still correct in stating that it depends upon the nature of the code?

I can’t disagree that cache and tlb flushes are bad, but when do they necessarily occur on the x86 (P3 vs P4)?. I used to work with one of the major uProcessor architectures with another company that ran NT (long ago in a galaxy far, far away). I know there was the beginnings of industry move to avoid cpu1->memory->cpu2 transfer of data. Don’t know if it actually made it in. Interesting discussion…have to look into this further.

thanks! - jb

Jonathan Borden
L5 Software Group

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Dave Harvey
Sent: Tuesday, May 14, 2002 10:37 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Generic Spin Lock Question

As someone who has spent a good portion of his life on these issues (e.g., getting 256 CPU machine to stay up for more than 4 hours), it almost absolutely does not “depend”. Unless those ten lines do something extremely expensive, there is NEVER any point in the extra release acquire. Here is why:

  1. Each acquire/release has one (…AtDpc) or three (not at Dpc) serializing instructions (at least on an X86). So a normal acquire/release with IRQL manipulation will cost as much as your 10 lines, without contention.
  2. If there is contention, and another CPU gets in between the two lock gets, then you pay though the nose shifting cache lines back and forth. Your average 10 lines of code (without calls) will be cheaper than a single cache line miss. Cache misses cost 100s of instruction times an P4 class machines.
    In more detail:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line on CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.

In the above, there are 6 misses and 2 transitions from shared to exclusive (invalidate other copies).

Holding the lock the entire time, you get:
CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1, spins
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.

In this case, there are 3 misses and 1 shared=>exclusive bus transaction.

So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would have to be an awfully large number of
instructions to be worse than 3 extra cache misses.

  1. Less lock gets/releases => simpler code => more likely to work in all cases.

  2. Cache line contention occurs at a much lower duty cycle than lock contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of worrying about how to reduce the lock held time, its much more cost effective to worry about avoiding having locks change CPUs. For example, instead of one free list, have one per CPU. Free to the one corresponding to the current CPU. Allocate from that one unless its empty, then search other CPUs. Under most loads, the locks on each list will not move CPUs, nor will the data structures themselves (and make the free lists LIFO not FIFO). And by the way, no lock contention, either.

Another way to think about this is that the cache line transfer between CPUs to acquire the lock is part of the lock held time, since even though the CPU that is about to get the lock has not marked it set, no other CPU can get the lock. So CPU2 does not acquire the lock at the time that the test & set succeeds, rather it acquired the lock when it shot down the cache line in the other CPUs. Just due to this fact, the lock held time increases with contention, and two CPUs can takes substantially longer than one to do the same work. Then you can add in the fact that the data being protected by that lock is also often in the same cache line, and the other CPU is spinning reading that cache line…

-DH

“Bill McKenzie” wrote in message news:xxxxx@ntdev…
This absolutely depends. What are those 10 lines doing, as in how long will they take to execute? You stated this code is called frequently, so how many other places in the driver is this spinlock taken? If its used through out and frequently, you might not want to hold it as long as you might be holding up other threads on other processors. You would probably have to profile the driver to find the sweet spot as to when its more beneficial to hold the lock than to drop it, and this would likely vary from single to multiple processor environments. And then, you would have to profile under load or some such. There is no general answer to this question, although you could certainly determine how many cycles it took to grab the lock and make a rough estimation that way.


Bill McKenzie

“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many variables coming into play, but let’s say this code is executed very frequently. I guess the fundamental question is what is the real overhead of claiming a spin lock on a MP machine. And where do you draw the line between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ

You are currently subscribed to ntdev as: xxxxx@earthlink.net
To unsubscribe send a blank email to %%email.unsub%%

You are currently subscribed to ntdev as: xxxxx@storagecraft.com
To unsubscribe send a blank email to %%email.unsub%%

Message
----- Original Message -----
From: Jonathan Borden
Newsgroups: ntdev
To: ‘NT Developers Interest List’
Sent: Wednesday, May 15, 2002 9:04 AM
Subject: Re: Generic Spin Lock Question

Couple of questions on this…

First: Does the P4 support only MESI bus protocols? Thus, if dirty data on CPU 1 is required in CPU 2 then the data must go through to memory? Or has the protocol advanced to what we were calling a MERCY protocol where the CPU2 could get the dirty data directly from CPU1 prior to the push to memory? This would possibly eliminate the issue of bus transaction overhead significantly.
Its my understanding that the P4 will transfer data directly between processors. However it is done, it takes *forever*. I spent a couple of months recently on a dual 2Ghz P4 improving performance by 2.5 times, primarily by eliminating cache line movement between CPUs (plus some PCI read reductions, and using LIFO rather than FIFO free lists, etc.).

A simple way to see this is to get a CPU time profile of the kernel with VTUNE on an SMP, then to get one for L2 cache misses. The graphs typically look quite similar.

Second: Doesn’t your concept of keeping separate lists per processor require adjustments to the processor affinity within the driver? This, in my experience, would make the driver significantly more difficult to to maintain and support. Plus it does not allow the nt executive to handle the smp load balancing. I could understand in a massively parallel environment this might be conducive
If we want to talk about maintainability, then you always want one lock acquire instead of two. Less code, less possible race conditions. If you want to talk about performance, then why use more complicated code that at best may get 50% more concurrency instead of something that actually scales?

The right way to handle this is to start with the most maintainable implementation that has a hope of performing adequately, then measure the performance, and improve the cases that are a problem.

since so many chips are contending for the bus. But with only (on average) 2-16 cpus and the frequency of the lock acquire/release, isn’t Bill still correct in stating that it depends upon the nature of the code?
Yes, but I’m assuming when a reasonably coherent performance question is asked refering to “10 lines of C”, that the writer did not mean “10 lines of C that calls 370 lines in other functions”.
-DH

I can’t disagree that cache and tlb flushes are bad, but when do they necessarily occur on the x86 (P3 vs P4)?. I used to work with one of the major uProcessor architectures with another company that ran NT (long ago in a galaxy far, far away). I know there was the beginnings of industry move to avoid cpu1->memory->cpu2 transfer of data. Don’t know if it actually made it in. Interesting discussion…have to look into this further.

thanks! - jb

Jonathan Borden
L5 Software Group

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Dave Harvey
Sent: Tuesday, May 14, 2002 10:37 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Generic Spin Lock Question

As someone who has spent a good portion of his life on these issues (e.g., getting 256 CPU machine to stay up for more than 4 hours), it almost absolutely does not “depend”. Unless those ten lines do something extremely expensive, there is NEVER any point in the extra release acquire. Here is why:

  1. Each acquire/release has one (…AtDpc) or three (not at Dpc) serializing instructions (at least on an X86). So a normal acquire/release with IRQL manipulation will cost as much as your 10 lines, without contention.
  2. If there is contention, and another CPU gets in between the two lock gets, then you pay though the nose shifting cache lines back and forth. Your average 10 lines of code (without calls) will be cheaper than a single cache line miss. Cache misses cost 100s of instruction times an P4 class machines.
    In more detail:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line on CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.

In the above, there are 6 misses and 2 transitions from shared to exclusive (invalidate other copies).

Holding the lock the entire time, you get:
CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1, spins
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.

In this case, there are 3 misses and 1 shared=>exclusive bus transaction.

So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would have to be an awfully large number of
instructions to be worse than 3 extra cache misses.

  1. Less lock gets/releases => simpler code => more likely to work in all cases.

  2. Cache line contention occurs at a much lower duty cycle than lock contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of worrying about how to reduce the lock held time, its much more cost effective to worry about avoiding having locks change CPUs. For example, instead of one free list, have one per CPU. Free to the one corresponding to the current CPU. Allocate from that one unless its empty, then search other CPUs. Under most loads, the locks on each list will not move CPUs, nor will the data structures themselves (and make the free lists LIFO not FIFO). And by the way, no lock contention, either.

Another way to think about this is that the cache line transfer between CPUs to acquire the lock is part of the lock held time, since even though the CPU that is about to get the lock has not marked it set, no other CPU can get the lock. So CPU2 does not acquire the lock at the time that the test & set succeeds, rather it acquired the lock when it shot down the cache line in the other CPUs. Just due to this fact, the lock held time increases with contention, and two CPUs can takes substantially longer than one to do the same work. Then you can add in the fact that the data being protected by that lock is also often in the same cache line, and the other CPU is spinning reading that cache line…

-DH

“Bill McKenzie” wrote in message news:xxxxx@ntdev…
This absolutely depends. What are those 10 lines doing, as in how long will they take to execute? You stated this code is called frequently, so how many other places in the driver is this spinlock taken? If its used through out and frequently, you might not want to hold it as long as you might be holding up other threads on other processors. You would probably have to profile the driver to find the sweet spot as to when its more beneficial to hold the lock than to drop it, and this would likely vary from single to multiple processor environments. And then, you would have to profile under load or some such. There is no general answer to this question, although you could certainly determine how many cycles it took to grab the lock and make a rough estimation that way.


Bill McKenzie

“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many variables coming into play, but let’s say this code is executed very frequently. I guess the fundamental question is what is the real overhead of claiming a spin lock on a MP machine. And where do you draw the line between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ

You are currently subscribed to ntdev as: xxxxx@earthlink.net
To unsubscribe send a blank email to %%email.unsub%%

Message
“Maxim S. Shatskih” wrote in message news:xxxxx@ntdev…
That’s what queued spinlocks is for - to avoid spinning on lock variable.
The overhead of migrating the cache line from CPU to CPU (due to snoop hits) is comparable to just uncached memory access.
I haven’t looked at this particular implementation, but while there are algorithms that can avoid slowing down the lock holder
due to a spinning requestor, I don’t see how the spin lock implementation helps the case where there is no long contention,
but where the cache line shuttles between CPUs on each Acquire.
-DH

Max
“Dave Harvey” wrote in message news:LYRIS-542-52101-2002.05.14-23.39.55–maxim#xxxxx@lists.osr.com…
As someone who has spent a good portion of his life on these issues (e.g., getting 256 CPU machine to stay up for more than 4 hours), it almost absolutely does not “depend”. Unless those ten lines do something extremely expensive, there is NEVER any point in the extra release acquire. Here is why:

1) Each acquire/release has one (…AtDpc) or three (not at Dpc) serializing instructions (at least on an X86). So a normal acquire/release with IRQL manipulation will cost as much as your 10 lines, without contention.
2) If there is contention, and another CPU gets in between the two lock gets, then you pay though the nose shifting cache lines back and forth. Your average 10 lines of code (without calls) will be cheaper than a single cache line miss. Cache misses cost 100s of instruction times an P4 class machines.
In more detail:

CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line on CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.

In the above, there are 6 misses and 2 transitions from shared to exclusive (invalidate other copies).

Holding the lock the entire time, you get:
CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1, spins
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.

In this case, there are 3 misses and 1 shared=>exclusive bus transaction.

So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would have to be an awfully large number of
instructions to be worse than 3 extra cache misses.

3) Less lock gets/releases => simpler code => more likely to work in all cases.

4) Cache line contention occurs at a much lower duty cycle than lock contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of worrying about how to reduce the lock held time, its much more cost effective to worry about avoiding having locks change CPUs. For example, instead of one free list, have one per CPU. Free to the one corresponding to the current CPU. Allocate from that one unless its empty, then search other CPUs. Under most loads, the locks on each list will not move CPUs, nor will the data structures themselves (and make the free lists LIFO not FIFO). And by the way, no lock contention, either.

Another way to think about this is that the cache line transfer between CPUs to acquire the lock is part of the lock held time, since even though the CPU that is about to get the lock has not marked it set, no other CPU can get the lock. So CPU2 does not acquire the lock at the time that the test & set succeeds, rather it acquired the lock when it shot down the cache line in the other CPUs. Just due to this fact, the lock held time increases with contention, and two CPUs can takes substantially longer than one to do the same work. Then you can add in the fact that the data being protected by that lock is also often in the same cache line, and the other CPU is spinning reading that cache line…

-DH

“Bill McKenzie” wrote in message news:xxxxx@ntdev…
This absolutely depends. What are those 10 lines doing, as in how long will they take to execute? You stated this code is called frequently, so how many other places in the driver is this spinlock taken? If its used through out and frequently, you might not want to hold it as long as you might be holding up other threads on other processors. You would probably have to profile the driver to find the sweet spot as to when its more beneficial to hold the lock than to drop it, and this would likely vary from single to multiple processor environments. And then, you would have to profile under load or some such. There is no general answer to this question, although you could certainly determine how many cycles it took to grab the lock and make a rough estimation that way.


Bill McKenzie

“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

The obvious solutions is to reorganized so the middle 10 lines of code are outside the spin lock. But for arguments sake lets say that is not possible.

Is the above code structure more efficient or is the following structure more efficient

-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A

I know there is probably no good answer for this because there are so many variables coming into play, but let’s say this code is executed very frequently. I guess the fundamental question is what is the real overhead of claiming a spin lock on a MP machine. And where do you draw the line between claiming a spin lock too many times, and holding it for too long.

Thanks for any input.
PJ

You are currently subscribed to ntdev as: xxxxx@storagecraft.com
To unsubscribe send a blank email to %%email.unsub%%