Message
----- Original Message -----
From: Jonathan Borden
Newsgroups: ntdev
To: ‘NT Developers Interest List’
Sent: Wednesday, May 15, 2002 9:04 AM
Subject: Re: Generic Spin Lock Question
Couple of questions on this…
First: Does the P4 support only MESI bus protocols? Thus, if dirty data on CPU 1 is required in CPU 2 then the data must go through to memory? Or has the protocol advanced to what we were calling a MERCY protocol where the CPU2 could get the dirty data directly from CPU1 prior to the push to memory? This would possibly eliminate the issue of bus transaction overhead significantly.
Its my understanding that the P4 will transfer data directly between processors. However it is done, it takes *forever*. I spent a couple of months recently on a dual 2Ghz P4 improving performance by 2.5 times, primarily by eliminating cache line movement between CPUs (plus some PCI read reductions, and using LIFO rather than FIFO free lists, etc.).
A simple way to see this is to get a CPU time profile of the kernel with VTUNE on an SMP, then to get one for L2 cache misses. The graphs typically look quite similar.
Second: Doesn’t your concept of keeping separate lists per processor require adjustments to the processor affinity within the driver? This, in my experience, would make the driver significantly more difficult to to maintain and support. Plus it does not allow the nt executive to handle the smp load balancing. I could understand in a massively parallel environment this might be conducive
If we want to talk about maintainability, then you always want one lock acquire instead of two. Less code, less possible race conditions. If you want to talk about performance, then why use more complicated code that at best may get 50% more concurrency instead of something that actually scales?
The right way to handle this is to start with the most maintainable implementation that has a hope of performing adequately, then measure the performance, and improve the cases that are a problem.
since so many chips are contending for the bus. But with only (on average) 2-16 cpus and the frequency of the lock acquire/release, isn’t Bill still correct in stating that it depends upon the nature of the code?
Yes, but I’m assuming when a reasonably coherent performance question is asked refering to “10 lines of C”, that the writer did not mean “10 lines of C that calls 370 lines in other functions”.
-DH
I can’t disagree that cache and tlb flushes are bad, but when do they necessarily occur on the x86 (P3 vs P4)?. I used to work with one of the major uProcessor architectures with another company that ran NT (long ago in a galaxy far, far away). I know there was the beginnings of industry move to avoid cpu1->memory->cpu2 transfer of data. Don’t know if it actually made it in. Interesting discussion…have to look into this further.
thanks! - jb
Jonathan Borden
L5 Software Group
-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Dave Harvey
Sent: Tuesday, May 14, 2002 10:37 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Generic Spin Lock Question
As someone who has spent a good portion of his life on these issues (e.g., getting 256 CPU machine to stay up for more than 4 hours), it almost absolutely does not “depend”. Unless those ten lines do something extremely expensive, there is NEVER any point in the extra release acquire. Here is why:
- Each acquire/release has one (…AtDpc) or three (not at Dpc) serializing instructions (at least on an X86). So a normal acquire/release with IRQL manipulation will cost as much as your 10 lines, without contention.
- If there is contention, and another CPU gets in between the two lock gets, then you pay though the nose shifting cache lines back and forth. Your average 10 lines of code (without calls) will be cheaper than a single cache line miss. Cache misses cost 100s of instruction times an P4 class machines.
In more detail:
CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
CPU 1 tries to get the lock, gets a cache miss, invalidating the line on CPU 2, spins
CPU 2 writes a zero the lock, getting a cache miss, invalidating the line on CPU 1
CPU 1 gets a read only copy of the cache line, sees the lock free.
CPU 1 invalidates the line on CPU 2 and sets the lock.
In the above, there are 6 misses and 2 transitions from shared to exclusive (invalidate other copies).
Holding the lock the entire time, you get:
CPU 1 gets the lock A dirty in cache.
CPU 2 tries to get the lock, gets cache miss, invalidates the line in CPU 1, spins
CPU 1 writes a zero to the lock to free it, getting a cache miss, invalidating the line in CPU 2
CPU 2 gets a read only copy of the cache line (another miss), sees the lock free.
CPU 2 invalidates the line on CPU 1 and sets the lock.
In this case, there are 3 misses and 1 shared=>exclusive bus transaction.
So in the latter case, I stall CPU 2 for 10 lines. These 10 lines would have to be an awfully large number of
instructions to be worse than 3 extra cache misses.
-
Less lock gets/releases => simpler code => more likely to work in all cases.
-
Cache line contention occurs at a much lower duty cycle than lock contention, and you pay plenty for it. i.e., If CPU 1 gets lock A, releases it, then later CPU B gets lock A, CPU B gets a cache miss. Instead of worrying about how to reduce the lock held time, its much more cost effective to worry about avoiding having locks change CPUs. For example, instead of one free list, have one per CPU. Free to the one corresponding to the current CPU. Allocate from that one unless its empty, then search other CPUs. Under most loads, the locks on each list will not move CPUs, nor will the data structures themselves (and make the free lists LIFO not FIFO). And by the way, no lock contention, either.
Another way to think about this is that the cache line transfer between CPUs to acquire the lock is part of the lock held time, since even though the CPU that is about to get the lock has not marked it set, no other CPU can get the lock. So CPU2 does not acquire the lock at the time that the test & set succeeds, rather it acquired the lock when it shot down the cache line in the other CPUs. Just due to this fact, the lock held time increases with contention, and two CPUs can takes substantially longer than one to do the same work. Then you can add in the fact that the data being protected by that lock is also often in the same cache line, and the other CPU is spinning reading that cache line…
-DH
“Bill McKenzie” wrote in message news:xxxxx@ntdev…
This absolutely depends. What are those 10 lines doing, as in how long will they take to execute? You stated this code is called frequently, so how many other places in the driver is this spinlock taken? If its used through out and frequently, you might not want to hold it as long as you might be holding up other threads on other processors. You would probably have to profile the driver to find the sweet spot as to when its more beneficial to hold the lock than to drop it, and this would likely vary from single to multiple processor environments. And then, you would have to profile under load or some such. There is no general answer to this question, although you could certainly determine how many cycles it took to grab the lock and make a rough estimation that way.
–
Bill McKenzie
“PJ Kirner” wrote in message news:xxxxx@ntdev…
I recently had some code recently that was structured like
-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
The obvious solutions is to reorganized so the middle 10 lines of code are outside the spin lock. But for arguments sake lets say that is not possible.
Is the above code structure more efficient or is the following structure more efficient
-Claim SpinLock A
-Do about 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
-Do another 10 lines of code where holding the spin lock is NOT NECESSARY
-Claim SpinLock A
-Do another 10 lines of code where claiming the spin lock is NECESSARY
-Release SpinLock A
I know there is probably no good answer for this because there are so many variables coming into play, but let’s say this code is executed very frequently. I guess the fundamental question is what is the real overhead of claiming a spin lock on a MP machine. And where do you draw the line between claiming a spin lock too many times, and holding it for too long.
Thanks for any input.
PJ
—
You are currently subscribed to ntdev as: xxxxx@earthlink.net
To unsubscribe send a blank email to %%email.unsub%%