interlocked instructions perform better with more processors ?

> Well just to quibble, xchg does an effective locked read/write operation, and

you can do an effective locked read as well,

Actually, so-called “locked read” is not going to be a read operation, from the bus perspective - despite the fact that the value of the target variable does not change, operation will still involve read and write cycles, so that it will still be interlocked RW…

although why one would want to do a locked read is indeed a mystery.

IIRC, few months ago there was a thread in MSFT NG where the OP was just desperate to do
interlocked read…

Anton Bassov

> In Intel’s manuals I found an explicit statement that memory reads are atomic

…if they do not cross a cache line.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

If you have a writeback cache, a write miss may not propagate into the bus.
That happens if the cache line is marked Exclusive. If it is marked Shared,
a “Read and Invalidate” transaction indeed goes to the bus, and those
processors that have the cache line marked as Shared will invalidate their
cache lines. I don’t know if the Pentium directly updates cache lines
directly from snoops, you may be right. What I do know is that if a
processor has a write miss and another processor has that line marked
Modified, that processor will put the data out onto the bus. But the
protocol is complicated, and I’m not too sure I grasp all of its ins and
outs!

Alberto.

----- Original Message -----
From: “David R. Cattley”
To: “Windows System Software Devs Interest List”
Sent: Thursday, October 25, 2007 9:31 PM
Subject: RE: RE:[ntdev] interlocked instructions perform better with more
processors ?

> Alberto,
>
>> … ship a modified cache line directly to a read requestor …
>
> Is that another way to say “Snooping” where the cache simply watches for
> write operations on the (shared) memory bus and proactively updates cache
> lines? I thought that was a common feature of the current platforms
> (perhaps excluding NUMA and other exotics).
>
> I sure don’t need to know this but I am just trying to keep up playing the
> home game and was curious.
>
> -dave

Thanks. I appreciate the explanation (and the complexity!) I was not
taking into account a ‘write miss’ in a write-back cache and it sure now
makes sense that snooping something that has not yet happened could not
possibly solve that issue.

One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
rip-off Oldsmobile) we are talking about here.

-dave

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
Sent: Friday, October 26, 2007 10:42 PM
To: Windows System Software Devs Interest List
Subject: Re: RE:[ntdev] interlocked instructions perform better with more
processors ?

If you have a writeback cache, a write miss may not propagate into the bus.
That happens if the cache line is marked Exclusive. If it is marked Shared,
a “Read and Invalidate” transaction indeed goes to the bus, and those
processors that have the cache line marked as Shared will invalidate their
cache lines. I don’t know if the Pentium directly updates cache lines
directly from snoops, you may be right. What I do know is that if a
processor has a write miss and another processor has that line marked
Modified, that processor will put the data out onto the bus. But the
protocol is complicated, and I’m not too sure I grasp all of its ins and
outs!

Alberto.

I found an easily reachable reference online, which isn’t that hard to read
although it’s kind of terse. If you’re interested, you can get this:

http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf

Hope it helps. And yes, this isn’t your fathers Z80!

Alberto.

----- Original Message -----
From: “David R. Cattley”
To: “Windows System Software Devs Interest List”
Sent: Saturday, October 27, 2007 11:16 AM
Subject: RE: RE:[ntdev] interlocked instructions perform better with more
processors ?

> Thanks. I appreciate the explanation (and the complexity!) I was not
> taking into account a ‘write miss’ in a write-back cache and it sure now
> makes sense that snooping something that has not yet happened could not
> possibly solve that issue.
>
> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
> rip-off Oldsmobile) we are talking about here.
>
> -dave
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
> Sent: Friday, October 26, 2007 10:42 PM
> To: Windows System Software Devs Interest List
> Subject: Re: RE:[ntdev] interlocked instructions perform better with more
> processors ?
>
> If you have a writeback cache, a write miss may not propagate into the
> bus.
> That happens if the cache line is marked Exclusive. If it is marked
> Shared,
> a “Read and Invalidate” transaction indeed goes to the bus, and those
> processors that have the cache line marked as Shared will invalidate their
> cache lines. I don’t know if the Pentium directly updates cache lines
> directly from snoops, you may be right. What I do know is that if a
> processor has a write miss and another processor has that line marked
> Modified, that processor will put the data out onto the bus. But the
> protocol is complicated, and I’m not too sure I grasp all of its ins and
> outs!
>
> Alberto.
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

Thanks Alberto, it’s great to have you around. But these slides are giving
me some new questions.

It says: “In X86, XCHG can be used to implement a semaphore, XCHG has an
implicit lock which makes it an atomic operation Alternative: use explicit
LOCK prefix on the instruction … XCHG EAX, x_guard / atomic - locked”

If it is true that XCHG is always atomic, does this imply it is safe in a
multiprocessor environment ? Does that mean that a lock prefix is
superfluous ? Then what about memory ordering ? According to this MSDN
article, it needs a memory-barrier instruction
to ensure prevent out of order execution.

/Daniel

“Alberto Moreira” wrote in message news:xxxxx@ntdev…
>I found an easily reachable reference online, which isn’t that hard to read
>although it’s kind of terse. If you’re interested, you can get this:
>
> http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf
>
> Hope it helps. And yes, this isn’t your fathers Z80!
>
>
> Alberto.
>
>
> ----- Original Message -----
> From: “David R. Cattley”
> To: “Windows System Software Devs Interest List”
> Sent: Saturday, October 27, 2007 11:16 AM
> Subject: RE: RE:[ntdev] interlocked instructions perform better with more
> processors ?
>
>
>> Thanks. I appreciate the explanation (and the complexity!) I was not
>> taking into account a ‘write miss’ in a write-back cache and it sure now
>> makes sense that snooping something that has not yet happened could not
>> possibly solve that issue.
>>
>> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
>> rip-off Oldsmobile) we are talking about here.
>>
>> -dave
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
>> Sent: Friday, October 26, 2007 10:42 PM
>> To: Windows System Software Devs Interest List
>> Subject: Re: RE:[ntdev] interlocked instructions perform better with more
>> processors ?
>>
>> If you have a writeback cache, a write miss may not propagate into the
>> bus.
>> That happens if the cache line is marked Exclusive. If it is marked
>> Shared,
>> a “Read and Invalidate” transaction indeed goes to the bus, and those
>> processors that have the cache line marked as Shared will invalidate
>> their
>> cache lines. I don’t know if the Pentium directly updates cache lines
>> directly from snoops, you may be right. What I do know is that if a
>> processor has a write miss and another processor has that line marked
>> Modified, that processor will put the data out onto the bus. But the
>> protocol is complicated, and I’m not too sure I grasp all of its ins and
>> outs!
>>
>> Alberto.
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>

Correct me if I’m wrong !.

But the XCHG instruction implicitly asserts LOCK# pinout, and that causes
the bus to be exclusively owned by the processor issuing it. So it is
multiprocessor safe.

Also as far as I understand, the out of (program) order execution would not
happen if there is such an instruction. So the basic question is that at
whatever stage of linking/code generation the out or order optimization is
applied, it would not do apply this out of order optimization as and when it
sees such an implicit or explict LOCK# assertion…

-pro

On 10/27/07, Daniel Terhell wrote:
>
> Thanks Alberto, it’s great to have you around. But these slides are
> giving
> me some new questions.
>
> It says: “In X86, XCHG can be used to implement a semaphore, XCHG has an
> implicit lock which makes it an atomic operation Alternative: use explicit
> LOCK prefix on the instruction … XCHG EAX, x_guard / atomic - locked”
>
> If it is true that XCHG is always atomic, does this imply it is safe in a
> multiprocessor environment ? Does that mean that a lock prefix is
> superfluous ? Then what about memory ordering ? According to this MSDN
> article, it needs a memory-barrier instruction
> to ensure prevent out of order execution.
>
> /Daniel
>
>
> “Alberto Moreira” wrote in message news:xxxxx@ntdev…
> >I found an easily reachable reference online, which isn’t that hard to
> read
> >although it’s kind of terse. If you’re interested, you can get this:
> >
> > http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf
> >
> > Hope it helps. And yes, this isn’t your fathers Z80!
> >
> >
> > Alberto.
> >
> >
> > ----- Original Message -----
> > From: “David R. Cattley”
> > To: “Windows System Software Devs Interest List”
> > Sent: Saturday, October 27, 2007 11:16 AM
> > Subject: RE: RE:[ntdev] interlocked instructions perform better with
> more
> > processors ?
> >
> >
> >> Thanks. I appreciate the explanation (and the complexity!) I was not
> >> taking into account a ‘write miss’ in a write-back cache and it sure
> now
> >> makes sense that snooping something that has not yet happened could not
> >> possibly solve that issue.
> >>
> >> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
> >> rip-off Oldsmobile) we are talking about here.
> >>
> >> -dave
> >>
> >> -----Original Message-----
> >> From: xxxxx@lists.osr.com
> >> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
> >> Sent: Friday, October 26, 2007 10:42 PM
> >> To: Windows System Software Devs Interest List
> >> Subject: Re: RE:[ntdev] interlocked instructions perform better with
> more
> >> processors ?
> >>
> >> If you have a writeback cache, a write miss may not propagate into the
> >> bus.
> >> That happens if the cache line is marked Exclusive. If it is marked
> >> Shared,
> >> a “Read and Invalidate” transaction indeed goes to the bus, and those
> >> processors that have the cache line marked as Shared will invalidate
> >> their
> >> cache lines. I don’t know if the Pentium directly updates cache lines
> >> directly from snoops, you may be right. What I do know is that if a
> >> processor has a write miss and another processor has that line marked
> >> Modified, that processor will put the data out onto the bus. But the
> >> protocol is complicated, and I’m not too sure I grasp all of its ins
> and
> >> outs!
> >>
> >> Alberto.
> >>
> >>
> >> —
> >> NTDEV is sponsored by OSR
> >>
> >> For our schedule of WDF, WDM, debugging and other seminars visit:
> >> http://www.osr.com/seminars
> >>
> >> To unsubscribe, visit the List Server section of OSR Online at
> >> http://www.osronline.com/page.cfm?name=ListServer
> >
> >
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

From what I understand this is not a code generation issue but because “processors begin to fetch values from memory before they are requested by the application. This can happen as part of speculative execution or due to cache line issues. CPU caches can be partitioned into banks that can be accessed in parallel. This means that memory operations can be completed out of order.”
http://msdn2.microsoft.com/en-us/library/ms686355.aspx

/Daniel

“Prokash Sinha” wrote in message news:xxxxx@ntdev…
So the basic question is that at whatever stage of linking/code generation the out or order optimization is applied, it would not do apply this out of order optimization as and when it sees such an implicit or explict LOCK# assertion…

-pro

On 10/27/07, Daniel Terhell wrote:
Thanks Alberto, it’s great to have you around. But these slides are giving
me some new questions.

It says: “In X86, XCHG can be used to implement a semaphore, XCHG has an
implicit lock which makes it an atomic operation Alternative: use explicit
LOCK prefix on the instruction … XCHG EAX, x_guard / atomic - locked”

If it is true that XCHG is always atomic, does this imply it is safe in a
multiprocessor environment ? Does that mean that a lock prefix is
superfluous ? Then what about memory ordering ? According to this MSDN
article, it needs a memory-barrier instruction
to ensure prevent out of order execution.

/Daniel

“Alberto Moreira” wrote in message news:xxxxx@ntdev…
>I found an easily reachable reference online, which isn’t that hard to read
>although it’s kind of terse. If you’re interested, you can get this:
>
> http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf
>
> Hope it helps. And yes, this isn’t your fathers Z80!
>
>
> Alberto.
>
>
> ----- Original Message -----
> From: “David R. Cattley”
> To: “Windows System Software Devs Interest List”
> Sent: Saturday, October 27, 2007 11:16 AM
> Subject: RE: RE:[ntdev] interlocked instructions perform better with more
> processors ?
>
>
>> Thanks. I appreciate the explanation (and the complexity!) I was not
>> taking into account a ‘write miss’ in a write-back cache and it sure now
>> makes sense that snooping something that has not yet happened could not
>> possibly solve that issue.
>>
>> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
>> rip-off Oldsmobile) we are talking about here.
>>
>> -dave
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
>> Sent: Friday, October 26, 2007 10:42 PM
>> To: Windows System Software Devs Interest List
>> Subject: Re: RE:[ntdev] interlocked instructions perform better with more
>> processors ?
>>
>> If you have a writeback cache, a write miss may not propagate into the
>> bus.
>> That happens if the cache line is marked Exclusive. If it is marked
>> Shared,
>> a “Read and Invalidate” transaction indeed goes to the bus, and those
>> processors that have the cache line marked as Shared will invalidate
>> their
>> cache lines. I don’t know if the Pentium directly updates cache lines
>> directly from snoops, you may be right. What I do know is that if a
>> processor has a write miss and another processor has that line marked
>> Modified, that processor will put the data out onto the bus. But the
>> protocol is complicated, and I’m not too sure I grasp all of its ins and
>> outs!
>>
>> Alberto.
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Never mind I found the answer eventually. “If a memory operand is
referenced, the processor’s locking protocol is automatically implemented
for the duration of the exchange operation, regardless of the presence or
absence of the LOCK prefix or of the value of
the IOPL.”

As it appears, it counts for XCHG only but not for CMPXCHG or other
instructions. So we are not putting our Interlocks on for nothing.

/Daniel

“Daniel Terhell” wrote in message
news:xxxxx@ntdev…
> Thanks Alberto, it’s great to have you around. But these slides are
> giving me some new questions.
>
> It says: “In X86, XCHG can be used to implement a semaphore, XCHG has an
> implicit lock which makes it an atomic operation Alternative: use explicit
> LOCK prefix on the instruction … XCHG EAX, x_guard / atomic - locked”
>
> If it is true that XCHG is always atomic, does this imply it is safe in a
> multiprocessor environment ? Does that mean that a lock prefix is
> superfluous ? Then what about memory ordering ? According to this MSDN
> article, it needs a memory-barrier instruction
> to ensure prevent out of order execution.
>
> /Daniel
>
>
> “Alberto Moreira” wrote in message news:xxxxx@ntdev…
>>I found an easily reachable reference online, which isn’t that hard to
>>read although it’s kind of terse. If you’re interested, you can get this:
>>
>> http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf
>>
>> Hope it helps. And yes, this isn’t your fathers Z80!
>>
>>
>> Alberto.
>>
>>
>> ----- Original Message -----
>> From: “David R. Cattley”
>> To: “Windows System Software Devs Interest List”
>> Sent: Saturday, October 27, 2007 11:16 AM
>> Subject: RE: RE:[ntdev] interlocked instructions perform better with more
>> processors ?
>>
>>
>>> Thanks. I appreciate the explanation (and the complexity!) I was not
>>> taking into account a ‘write miss’ in a write-back cache and it sure now
>>> makes sense that snooping something that has not yet happened could not
>>> possibly solve that issue.
>>>
>>> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
>>> rip-off Oldsmobile) we are talking about here.
>>>
>>> -dave
>>>
>>> -----Original Message-----
>>> From: xxxxx@lists.osr.com
>>> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
>>> Sent: Friday, October 26, 2007 10:42 PM
>>> To: Windows System Software Devs Interest List
>>> Subject: Re: RE:[ntdev] interlocked instructions perform better with
>>> more
>>> processors ?
>>>
>>> If you have a writeback cache, a write miss may not propagate into the
>>> bus.
>>> That happens if the cache line is marked Exclusive. If it is marked
>>> Shared,
>>> a “Read and Invalidate” transaction indeed goes to the bus, and those
>>> processors that have the cache line marked as Shared will invalidate
>>> their
>>> cache lines. I don’t know if the Pentium directly updates cache lines
>>> directly from snoops, you may be right. What I do know is that if a
>>> processor has a write miss and another processor has that line marked
>>> Modified, that processor will put the data out onto the bus. But the
>>> protocol is complicated, and I’m not too sure I grasp all of its ins and
>>> outs!
>>>
>>> Alberto.
>>>
>>>
>>> —
>>> NTDEV is sponsored by OSR
>>>
>>> For our schedule of WDF, WDM, debugging and other seminars visit:
>>> http://www.osr.com/seminars
>>>
>>> To unsubscribe, visit the List Server section of OSR Online at
>>> http://www.osronline.com/page.cfm?name=ListServer
>>
>>
>
>
>

My understanding is that the lock prefix, when executed, causes the LOCK#
line to be asserted on the front-side bus. It also asserts BREQ# to prevent
other processors from arbitrating for the bus. Some read-modify-write
instructions, such as XCHG, assert the LOCK# line automatically. The
Mindshare book states that locked instructions are automatically serializing
and synchronizing operations: serializing means that all outstanding
instructions are executed before the locked instruction is allowed to
execute, and synchronizing means that all posted writes are flushed to
memory before the locked instruction is executed.

So, watch out when you use an XCHG instruction! Looks like it pretty much
stops the bus and the processor on its tracks. Pretty strong stuff.

Incidentally, I believe that the bus is also locked between loading the SS
segment register and the next instruction, which is usually a load of the
ESP register.

Alberto.

----- Original Message -----
From: “Daniel Terhell”
Newsgroups: ntdev
To: “Windows System Software Devs Interest List”
Sent: Saturday, October 27, 2007 2:15 PM
Subject: Re:[ntdev] RE:interlocked instructions perform better with more
processors ?

> Never mind I found the answer eventually. “If a memory operand is
> referenced, the processor’s locking protocol is automatically implemented
> for the duration of the exchange operation, regardless of the presence or
> absence of the LOCK prefix or of the value of
> the IOPL.”
>
> As it appears, it counts for XCHG only but not for CMPXCHG or other
> instructions. So we are not putting our Interlocks on for nothing.
>
> /Daniel
>
>
> “Daniel Terhell” wrote in message
> news:xxxxx@ntdev…
>> Thanks Alberto, it’s great to have you around. But these slides are
>> giving me some new questions.
>>
>> It says: “In X86, XCHG can be used to implement a semaphore, XCHG has an
>> implicit lock which makes it an atomic operation Alternative: use
>> explicit LOCK prefix on the instruction … XCHG EAX, x_guard / atomic -
>> locked”
>>
>> If it is true that XCHG is always atomic, does this imply it is safe in a
>> multiprocessor environment ? Does that mean that a lock prefix is
>> superfluous ? Then what about memory ordering ? According to this MSDN
>> article, it needs a memory-barrier instruction
>> to ensure prevent out of order execution.
>>
>> /Daniel
>>
>>
>> “Alberto Moreira” wrote in message
>> news:xxxxx@ntdev…
>>>I found an easily reachable reference online, which isn’t that hard to
>>>read although it’s kind of terse. If you’re interested, you can get this:
>>>
>>> http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf
>>>
>>> Hope it helps. And yes, this isn’t your fathers Z80!
>>>
>>>
>>> Alberto.
>>>
>>>
>>> ----- Original Message -----
>>> From: “David R. Cattley”
>>> To: “Windows System Software Devs Interest List”
>>> Sent: Saturday, October 27, 2007 11:16 AM
>>> Subject: RE: RE:[ntdev] interlocked instructions perform better with
>>> more processors ?
>>>
>>>
>>>> Thanks. I appreciate the explanation (and the complexity!) I was not
>>>> taking into account a ‘write miss’ in a write-back cache and it sure
>>>> now
>>>> makes sense that snooping something that has not yet happened could not
>>>> possibly solve that issue.
>>>>
>>>> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
>>>> rip-off Oldsmobile) we are talking about here.
>>>>
>>>> -dave
>>>>
>>>> -----Original Message-----
>>>> From: xxxxx@lists.osr.com
>>>> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
>>>> Sent: Friday, October 26, 2007 10:42 PM
>>>> To: Windows System Software Devs Interest List
>>>> Subject: Re: RE:[ntdev] interlocked instructions perform better with
>>>> more
>>>> processors ?
>>>>
>>>> If you have a writeback cache, a write miss may not propagate into the
>>>> bus.
>>>> That happens if the cache line is marked Exclusive. If it is marked
>>>> Shared,
>>>> a “Read and Invalidate” transaction indeed goes to the bus, and those
>>>> processors that have the cache line marked as Shared will invalidate
>>>> their
>>>> cache lines. I don’t know if the Pentium directly updates cache lines
>>>> directly from snoops, you may be right. What I do know is that if a
>>>> processor has a write miss and another processor has that line marked
>>>> Modified, that processor will put the data out onto the bus. But the
>>>> protocol is complicated, and I’m not too sure I grasp all of its ins
>>>> and
>>>> outs!
>>>>
>>>> Alberto.
>>>>
>>>>
>>>> —
>>>> NTDEV is sponsored by OSR
>>>>
>>>> For our schedule of WDF, WDM, debugging and other seminars visit:
>>>> http://www.osr.com/seminars
>>>>
>>>> To unsubscribe, visit the List Server section of OSR Online at
>>>> http://www.osronline.com/page.cfm?name=ListServer
>>>
>>>
>>
>>
>>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

Ah, it’s been a while since I looked at…

Now for the speculative part, IIRC the processor has the capability to invalidate such an out of order read. That is why they have separate scratch pad, that is why deep level speculation is a night-mare, and that is why IA64 has the hinting technique, but again I’m rusty on this … Aberto, you have any insight on this, as a pure curiosity !

-pro
----- Original Message -----
From: Daniel Terhell
Newsgroups: ntdev
To: Windows System Software Devs Interest List
Sent: Saturday, October 27, 2007 10:59 AM
Subject: Re:[ntdev] RE:interlocked instructions perform better with more processors ?

From what I understand this is not a code generation issue but because “processors begin to fetch values from memory before they are requested by the application. This can happen as part of speculative execution or due to cache line issues. CPU caches can be partitioned into banks that can be accessed in parallel. This means that memory operations can be completed out of order.”
http://msdn2.microsoft.com/en-us/library/ms686355.aspx

/Daniel

“Prokash Sinha” wrote in message news:xxxxx@ntdev…
So the basic question is that at whatever stage of linking/code generation the out or order optimization is applied, it would not do apply this out of order optimization as and when it sees such an implicit or explict LOCK# assertion…

-pro

On 10/27/07, Daniel Terhell wrote:
Thanks Alberto, it’s great to have you around. But these slides are giving
me some new questions.

It says: “In X86, XCHG can be used to implement a semaphore, XCHG has an
implicit lock which makes it an atomic operation Alternative: use explicit
LOCK prefix on the instruction … XCHG EAX, x_guard / atomic - locked”

If it is true that XCHG is always atomic, does this imply it is safe in a
multiprocessor environment ? Does that mean that a lock prefix is
superfluous ? Then what about memory ordering ? According to this MSDN
article, it needs a memory-barrier instruction
to ensure prevent out of order execution.

/Daniel

“Alberto Moreira” wrote in message news:xxxxx@ntdev…
>I found an easily reachable reference online, which isn’t that hard to read
>although it’s kind of terse. If you’re interested, you can get this:
>
> http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf
>
> Hope it helps. And yes, this isn’t your fathers Z80!
>
>
> Alberto.
>
>
> ----- Original Message -----
> From: “David R. Cattley”
> To: “Windows System Software Devs Interest List”
> Sent: Saturday, October 27, 2007 11:16 AM
> Subject: RE: RE:[ntdev] interlocked instructions perform better with more
> processors ?
>
>
>> Thanks. I appreciate the explanation (and the complexity!) I was not
>> taking into account a ‘write miss’ in a write-back cache and it sure now
>> makes sense that snooping something that has not yet happened could not
>> possibly solve that issue.
>>
>> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
>> rip-off Oldsmobile) we are talking about here.
>>
>> -dave
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
>> Sent: Friday, October 26, 2007 10:42 PM
>> To: Windows System Software Devs Interest List
>> Subject: Re: RE:[ntdev] interlocked instructions perform better with more
>> processors ?
>>
>> If you have a writeback cache, a write miss may not propagate into the
>> bus.
>> That happens if the cache line is marked Exclusive. If it is marked
>> Shared,
>> a “Read and Invalidate” transaction indeed goes to the bus, and those
>> processors that have the cache line marked as Shared will invalidate
>> their
>> cache lines. I don’t know if the Pentium directly updates cache lines
>> directly from snoops, you may be right. What I do know is that if a
>> processor has a write miss and another processor has that line marked
>> Modified, that processor will put the data out onto the bus. But the
>> protocol is complicated, and I’m not too sure I grasp all of its ins and
>> outs!
>>
>> Alberto.
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> If it is true that XCHG is always atomic, does this imply it is safe in a

multiprocessor environment ?

According to Intel Manuals, #LOCK signal gets asserted automatically when you issue XCHG instruction, so that there is no need for LOCK prefix in order to make it atomic…

Then what about memory ordering ?

This is a good question…

Whenever you use LOCK prefix, it automatically acts as an implicit memory barrier. However, I think that XCHG that you have issued without LOCK prefix does not act as a memory barrier, although this instruction asserts #LOCK signal the way LOCK prefix does - after all, Intel Manual does not mention
XCHG when it discusses serializing instructions…

Anton Bassov

Hi, Pro,

The IA32 Systems Programming Guide, Vol 1, Chapter 7, has a good description of how the LOCK prefix works. You can download it from

http://www.intel.com/products/processor/manuals/index.htm

The book states, for example,

“Locked operations are atomic with respect to all other memory operations and all externally visible events.”

“Locked operations can be used to synchronize data written by one processor and read by another processor.”

"For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized. "

“Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to insure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 7.1.2, “Bus Locking”).”

So, there you go! Straight from the horse’s mouth. Hope this helps!

Alberto.

----- Original Message -----
From: Prokash Sinha
To: Windows System Software Devs Interest List
Sent: Saturday, October 27, 2007 2:45 PM
Subject: Re: Re:[ntdev] RE:interlocked instructions perform better with more processors ?

Ah, it’s been a while since I looked at…

Now for the speculative part, IIRC the processor has the capability to invalidate such an out of order read. That is why they have separate scratch pad, that is why deep level speculation is a night-mare, and that is why IA64 has the hinting technique, but again I’m rusty on this … Aberto, you have any insight on this, as a pure curiosity !

-pro
----- Original Message -----
From: Daniel Terhell
Newsgroups: ntdev
To: Windows System Software Devs Interest List
Sent: Saturday, October 27, 2007 10:59 AM
Subject: Re:[ntdev] RE:interlocked instructions perform better with more processors ?

From what I understand this is not a code generation issue but because “processors begin to fetch values from memory before they are requested by the application. This can happen as part of speculative execution or due to cache line issues. CPU caches can be partitioned into banks that can be accessed in parallel. This means that memory operations can be completed out of order.”
http://msdn2.microsoft.com/en-us/library/ms686355.aspx

/Daniel

“Prokash Sinha” wrote in message news:xxxxx@ntdev…
So the basic question is that at whatever stage of linking/code generation the out or order optimization is applied, it would not do apply this out of order optimization as and when it sees such an implicit or explict LOCK# assertion…

-pro

On 10/27/07, Daniel Terhell wrote:
Thanks Alberto, it’s great to have you around. But these slides are giving
me some new questions.

It says: “In X86, XCHG can be used to implement a semaphore, XCHG has an
implicit lock which makes it an atomic operation Alternative: use explicit
LOCK prefix on the instruction … XCHG EAX, x_guard / atomic - locked”

If it is true that XCHG is always atomic, does this imply it is safe in a
multiprocessor environment ? Does that mean that a lock prefix is
superfluous ? Then what about memory ordering ? According to this MSDN
article, it needs a memory-barrier instruction
to ensure prevent out of order execution.

/Daniel

“Alberto Moreira” wrote in message news:xxxxx@ntdev…
>I found an easily reachable reference online, which isn’t that hard to read
>although it’s kind of terse. If you’re interested, you can get this:
>
> http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf
>
> Hope it helps. And yes, this isn’t your fathers Z80!
>
>
> Alberto.
>
>
> ----- Original Message -----
> From: “David R. Cattley”
> To: “Windows System Software Devs Interest List”
> Sent: Saturday, October 27, 2007 11:16 AM
> Subject: RE: RE:[ntdev] interlocked instructions perform better with more
> processors ?
>
>
>> Thanks. I appreciate the explanation (and the complexity!) I was not
>> taking into account a ‘write miss’ in a write-back cache and it sure now
>> makes sense that snooping something that has not yet happened could not
>> possibly solve that issue.
>>
>> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
>> rip-off Oldsmobile) we are talking about here.
>>
>> -dave
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
>> Sent: Friday, October 26, 2007 10:42 PM
>> To: Windows System Software Devs Interest List
>> Subject: Re: RE:[ntdev] interlocked instructions perform better with more
>> processors ?
>>
>> If you have a writeback cache, a write miss may not propagate into the
>> bus.
>> That happens if the cache line is marked Exclusive. If it is marked
>> Shared,
>> a “Read and Invalidate” transaction indeed goes to the bus, and those
>> processors that have the cache line marked as Shared will invalidate
>> their
>> cache lines. I don’t know if the Pentium directly updates cache lines
>> directly from snoops, you may be right. What I do know is that if a
>> processor has a write miss and another processor has that line marked
>> Modified, that processor will put the data out onto the bus. But the
>> protocol is complicated, and I’m not too sure I grasp all of its ins and
>> outs!
>>
>> Alberto.
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

I may be wrong, but I suspect that Intel doesn’t call xchg “serializing”
because although it serializes most memory accesses, it does not serialize
Write-Combine and other “weakly ordered” cases. This shouldn’t be the case
in normal operation, or am I wrong ? I wonder if someone from Intel is
reading this and could shed some light on it!

Alberto.

----- Original Message -----
From:
To: “Windows System Software Devs Interest List”
Sent: Saturday, October 27, 2007 9:31 PM
Subject: RE:[ntdev] interlocked instructions perform better with more
processors ?

>> If it is true that XCHG is always atomic, does this imply it is safe in a
>> multiprocessor environment ?
>
> According to Intel Manuals, #LOCK signal gets asserted automatically when
> you issue XCHG instruction, so that there is no need for LOCK prefix in
> order to make it atomic…
>
>> Then what about memory ordering ?
>
> This is a good question…
>
> Whenever you use LOCK prefix, it automatically acts as an implicit memory
> barrier. However, I think that XCHG that you have issued without LOCK
> prefix does not act as a memory barrier, although this instruction asserts
> #LOCK signal the way LOCK prefix does - after all, Intel Manual does not
> mention
> XCHG when it discusses serializing instructions…
>
>
> Anton Bassov
>
>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

Thanks Alberto,

That explains what I thought was the buring question of this thread, though I might be wrong since I did not follow the thread from the begining…

  1. Correctness: Let’s suppose that one is set out to use assembler instructions ( instead of using DDI for whatever reason ) to implement some locking to protect shared resources. The question would be: Do we need to have explicit use of those fencing instruction, before we go on implementing the example given in the reference material you supplied as a link ( or whatever is given in the Vol3 of Intel’s reference mannual?. — It surely looks we don’t need to ( even if the policy is write-back).

  2. Performance: Some of the points you alluded to are very very helpful, and surely microscopic. Also for everyone, there is a good chance to “miss the whole 9 yrd”, if one does not follow the 90-10 rules of performance-witch-craft. That is "usually 10% of the code uses 90% of cpu(s). And for that Peter W’s advice is a long standing rule being followed by lot of performance people - so hard ( if not impossible ) to ignore the value of that advice too …

-pro
----- Original Message -----
From: Alberto Moreira
To: Windows System Software Devs Interest List
Sent: Sunday, October 28, 2007 6:46 AM
Subject: Re: Re:[ntdev] RE:interlocked instructions perform better with more processors ?

Hi, Pro,

The IA32 Systems Programming Guide, Vol 1, Chapter 7, has a good description of how the LOCK prefix works. You can download it from

http://www.intel.com/products/processor/manuals/index.htm

The book states, for example,

“Locked operations are atomic with respect to all other memory operations and all externally visible events.”

“Locked operations can be used to synchronize data written by one processor and read by another processor.”

"For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized. "

“Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to insure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 7.1.2, “Bus Locking”).”

So, there you go! Straight from the horse’s mouth. Hope this helps!

Alberto.

----- Original Message -----
From: Prokash Sinha
To: Windows System Software Devs Interest List
Sent: Saturday, October 27, 2007 2:45 PM
Subject: Re: Re:[ntdev] RE:interlocked instructions perform better with more processors ?

Ah, it’s been a while since I looked at…

Now for the speculative part, IIRC the processor has the capability to invalidate such an out of order read. That is why they have separate scratch pad, that is why deep level speculation is a night-mare, and that is why IA64 has the hinting technique, but again I’m rusty on this … Aberto, you have any insight on this, as a pure curiosity !

-pro
----- Original Message -----
From: Daniel Terhell
Newsgroups: ntdev
To: Windows System Software Devs Interest List
Sent: Saturday, October 27, 2007 10:59 AM
Subject: Re:[ntdev] RE:interlocked instructions perform better with more processors ?

From what I understand this is not a code generation issue but because “processors begin to fetch values from memory before they are requested by the application. This can happen as part of speculative execution or due to cache line issues. CPU caches can be partitioned into banks that can be accessed in parallel. This means that memory operations can be completed out of order.”
http://msdn2.microsoft.com/en-us/library/ms686355.aspx

/Daniel

“Prokash Sinha” wrote in message news:xxxxx@ntdev…
So the basic question is that at whatever stage of linking/code generation the out or order optimization is applied, it would not do apply this out of order optimization as and when it sees such an implicit or explict LOCK# assertion…

-pro

On 10/27/07, Daniel Terhell wrote:
Thanks Alberto, it’s great to have you around. But these slides are giving
me some new questions.

It says: “In X86, XCHG can be used to implement a semaphore, XCHG has an
implicit lock which makes it an atomic operation Alternative: use explicit
LOCK prefix on the instruction … XCHG EAX, x_guard / atomic - locked”

If it is true that XCHG is always atomic, does this imply it is safe in a
multiprocessor environment ? Does that mean that a lock prefix is
superfluous ? Then what about memory ordering ? According to this MSDN
article, it needs a memory-barrier instruction
to ensure prevent out of order execution.

/Daniel

“Alberto Moreira” wrote in message news:xxxxx@ntdev…
>I found an easily reachable reference online, which isn’t that hard to read
>although it’s kind of terse. If you’re interested, you can get this:
>
> http://www.ee.technion.ac.il/courses/044800/lectures/MESI.pdf
>
> Hope it helps. And yes, this isn’t your fathers Z80!
>
>
> Alberto.
>
>
> ----- Original Message -----
> From: “David R. Cattley”
> To: “Windows System Software Devs Interest List”
> Sent: Saturday, October 27, 2007 11:16 AM
> Subject: RE: RE:[ntdev] interlocked instructions perform better with more
> processors ?
>
>
>> Thanks. I appreciate the explanation (and the complexity!) I was not
>> taking into account a ‘write miss’ in a write-back cache and it sure now
>> makes sense that snooping something that has not yet happened could not
>> possibly solve that issue.
>>
>> One thing is for sure, this is not ‘your fathers Z80’ (to shamelessly
>> rip-off Oldsmobile) we are talking about here.
>>
>> -dave
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of Alberto Moreira
>> Sent: Friday, October 26, 2007 10:42 PM
>> To: Windows System Software Devs Interest List
>> Subject: Re: RE:[ntdev] interlocked instructions perform better with more
>> processors ?
>>
>> If you have a writeback cache, a write miss may not propagate into the
>> bus.
>> That happens if the cache line is marked Exclusive. If it is marked
>> Shared,
>> a “Read and Invalidate” transaction indeed goes to the bus, and those
>> processors that have the cache line marked as Shared will invalidate
>> their
>> cache lines. I don’t know if the Pentium directly updates cache lines
>> directly from snoops, you may be right. What I do know is that if a
>> processor has a write miss and another processor has that line marked
>> Modified, that processor will put the data out onto the bus. But the
>> protocol is complicated, and I’m not too sure I grasp all of its ins and
>> outs!
>>
>> Alberto.
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer