Thanks Alberto,
That explains what I thought was the buring question of this thread, though I might be wrong since I did not follow the thread from the begining…
-
Correctness: Let’s suppose that one is set out to use assembler instructions ( instead of using DDI for whatever reason ) to implement some locking to protect shared resources. The question would be: Do we need to have explicit use of those fencing instruction, before we go on implementing the example given in the reference material you supplied as a link ( or whatever is given in the Vol3 of Intel’s reference mannual?. — It surely looks we don’t need to ( even if the policy is write-back).
-
Performance: Some of the points you alluded to are very very helpful, and surely microscopic. Also for everyone, there is a good chance to “miss the whole 9 yrd”, if one does not follow the 90-10 rules of performance-witch-craft. That is "usually 10% of the code uses 90% of cpu(s). And for that Peter W’s advice is a long standing rule being followed by lot of performance people - so hard ( if not impossible ) to ignore the value of that advice too …
-pro
----- Original Message -----
From: Alberto Moreira
To: Windows System Software Devs Interest List
Sent: Sunday, October 28, 2007 6:46 AM
Subject: Re: Re:[ntdev] RE:interlocked instructions perform better with more processors ?
Hi, Pro,
The IA32 Systems Programming Guide, Vol 1, Chapter 7, has a good description of how the LOCK prefix works. You can download it from
http://www.intel.com/products/processor/manuals/index.htm
The book states, for example,
“Locked operations are atomic with respect to all other memory operations and all externally visible events.”
“Locked operations can be used to synchronize data written by one processor and read by another processor.”
"For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized. "
“Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to insure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 7.1.2, “Bus Locking”).”
So, there you go! Straight from the horse’s mouth. Hope this helps!
Alberto.
----- Original Message -----
From: Prokash Sinha
To: Windows System Software Devs Interest List
Sent: Saturday, October 27, 2007 2:45 PM
Subject: Re: Re:[ntdev] RE:interlocked instructions perform better with more processors ?
Ah, it’s been a while since I looked at…
Now for the speculative part, IIRC the processor has the capability to invalidate such an out of order read. That is why they have separate scratch pad, that is why deep level speculation is a night-mare, and that is why IA64 has the hinting technique, but again I’m rusty on this … Aberto, you have any insight on this, as a pure curiosity !
-pro
----- Original Message -----
From: Daniel Terhell
Newsgroups: ntdev
To: Windows System Software Devs Interest List
Sent: Saturday, October 27, 2007 10:59 AM
Subject: Re:[ntdev] RE:interlocked instructions perform better with more processors ?
From what I understand this is not a code generation issue but because “processors begin to fetch values from memory before they are requested by the application. This can happen as part of speculative execution or due to cache line issues. CPU caches can be partitioned into banks that can be accessed in parallel. This means that memory operations can be completed out of order.”
http://msdn2.microsoft.com/en-us/library/ms686355.aspx
/Daniel
“Prokash Sinha” wrote in message news:xxxxx@ntdev…
So the basic question is that at whatever stage of linking/code generation the out or order optimization is applied, it would not do apply this out of order optimization as and when it sees such an implicit or explict LOCK# assertion…
-pro
On 10/27/07, Daniel Terhell wrote:
Thanks Alberto, it’s great to have you around. But these slides are giving
me some new questions.
It says: “In X86, XCHG can be used to implement a semaphore, XCHG has an
implicit lock which makes it an atomic operation Alternative: use explicit
LOCK prefix on the instruction … XCHG EAX, x_guard / atomic - locked”
If it is true that XCHG is always atomic, does this imply it is safe in a
multiprocessor environment ? Does that mean that a lock prefix is
superfluous ? Then what about memory ordering ? According to this MSDN
article, it needs a memory-barrier instruction
to ensure prevent out of order execution.
/Daniel
“Alberto Moreira” wrote in message news:xxxxx@ntdev…
>I found an easily reachable reference online, which isn’t that hard to read
>although it’s kind of terse. If you’re interested, you can get this:
>
> http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf
>
> Hope it helps. And yes, this isn’t your fathers Z80!
>
>
> Alberto.
>
>
> ----- Original Message -----
> From: “David R. Cattley”
> To: “Windows System Software Devs Interest List”
> Sent: Saturday, October 27, 2007 11:16 AM
> Subject: RE: RE:[ntdev] interlocked instructions perform better with more
> processors ?
>
>
>> Thanks. I appreciate the explanation (and the complexity!) I was not
>> taking into account a ‘write miss’ in a write-back cache and it sure now
>> makes sense that snooping something that has not yet happened could not
>> possibly solve that issue.
>>
>> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
>> rip-off Oldsmobile) we are talking about here.
>>
>> -dave
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
>> Sent: Friday, October 26, 2007 10:42 PM
>> To: Windows System Software Devs Interest List
>> Subject: Re: RE:[ntdev] interlocked instructions perform better with more
>> processors ?
>>
>> If you have a writeback cache, a write miss may not propagate into the
>> bus.
>> That happens if the cache line is marked Exclusive. If it is marked
>> Shared,
>> a “Read and Invalidate” transaction indeed goes to the bus, and those
>> processors that have the cache line marked as Shared will invalidate
>> their
>> cache lines. I don’t know if the Pentium directly updates cache lines
>> directly from snoops, you may be right. What I do know is that if a
>> processor has a write miss and another processor has that line marked
>> Modified, that processor will put the data out onto the bus. But the
>> protocol is complicated, and I’m not too sure I grasp all of its ins and
>> outs!
>>
>> Alberto.
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>
—
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
—
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
—
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
—
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer