How does InterlockedExchange work

> > But the discussion was, are there “lock free” algorithms for architectures

> that don’t have
> hardware interlocking.

OK…

Therefore, let me try to solve the problem that your professor has presented
to you - let me try to implement a spinlock on such an arch, and do it is plain
C…

Our lock variable has to be initialized to zero. Now let’s acquire a lock

Could the following happen for two hardware threads trying to acquire the lock at the same time…

void get_lock( unsigned long *lock)
{

unsigned long temp1=0,temp2=0;

while(1)

{

while(1)
{

temp1=*lock;

Both threads read value 0 into temp1

*lock++;

We are assuming this isn’t an atomic operation on all architectures right? So ++ is the same as:

tmp = *lock; (both threads read value 0 from *lock into a register)
tmp = tmp + 1; (both threads increment the register to 1)
*lock = tmp; (both threads write the value 1 back to *lock)

temp2=*lock;

Both threads read value 1 into temp2

if(temp2==1+temp1)break;

Both threads exit the inner loop

*lock–;

}

if(temp2==1)break;

temp2 == 1 is true, so both threads exit the other loop

*lock–;

}

}

Both threads have acquired the lock.

Maybe I’m missing something? (sarcasm/irony perhaps?)

James

>So ++ is the same as:

tmp = *lock; (both threads read value 0 from *lock into a register)
tmp = tmp + 1; (both threads increment the register to 1)
*lock = tmp; (both threads write the value 1 back to *lock)

I have realized it after having already made my post already - indeed, it may happen that two hardware cores/threads that have their own registers (both external and internal ones) may get the identical views of data, and reloading registers is not going to help in any possible way if there is no hardware -assisted serialization provided…

Therefore, the whole thing is crap…

Anton Bassov

On Fri, Oct 11, 2013 at 9:57 PM, wrote:

> But the discussion was, are there “lock free” algorithms for architectures
> that don’t have hardware interlocking. If you use hardware interlocking,
> then you can’t call it “lock free”

In an earlier post you wrote that there’s no such thing as a “lock
free” algorithm, and now you write this. “Lock freedom” has NOTHING
to do with locks per se. It’s simply a way of expressing the notion
that, given sufficient time, at least one of the many threads
executing a lock-free algorithm will make some progress. That is, that
the threads won’t block each other forever, with none of them making
any forward progress.

The following algorithm is dated 1981 but seems to satisfy the requirements without any hardware assisted synchronization (memory reordering notwithstanding). 

http://en.m.wikipedia.org/wiki/Peterson’s_algorithm

On Sat, Oct 12, 2013 at 6:37 AM, null wrote:

> In the past, many architectures supported what was called
> “read-pause-write”, the equivalent of LOCK, on certain instructions. For
> example, on the PDP-11 (and, consequently, almost certainly on the VAX),
> INC and DEC were always done with read-pause-write. The problem with this
> it that it wastes memory bandwidth, hence the optional LOCK prefix on the
> x86 family. It is also nice that the LOCK prefix can apply to many
> different instructions.
> There is no such thing as a “lock free” algorithm, although some limited
> cases (a circular buffer accessed by at most two threads comes to mind)
> can be made to work without explicit locking. However, like the “locked
> access” example we had to do in Algol, an algorithm that requires
> simultaneous access must have some known invariant. The circular buffer
> case usually works because the “head” is accessed by exactly one thread
> and the “tail” is accessed by exactly one thread, meaning that locking was
> not needed. But the number of “lock-free” algorithms I’ve seen always
> seem to depend at some level on instruction atomicity, or are so
> incredibly fragile that attempts to adopt them usually lead to some error
> that invalidates the complex set of preconditions that make them work.
> And all too often, it is an attempt to generalize a two-thread-safe model
> to an N-thread-safe model. I’ve seen a few of these spectacular failures,
> caused by people who felt that those synchronization primitives like spin
> locks, mutexes, and semaphores were “too complicated” and they had a
> “simple solution”. Yeah, right.
> joe
>>> IMHO this paradigm is erroneously called ‘lock free’, but really is is
>>> hardware lock based instead
>>> of software lock based.
>>
>> In order to realize how erroneous calling these functions lock-free is you
>> just have to keep in mind that
>> not all architectures that support MP provide the exhaustive sets of
>> operations that may use bus locking.
>> Don’t forget that on many architectures (they are known as load-and-store
>> ones) you cannot perform arithmetic and logic operations directly on
>> memory operands - instead, you have to load operands to registers,
>> perform an operation in registers, and then store the results in memory.
>>
>>
>> It is (hopefully) obvious that you cannot implement functions like
>> InterlockedIncrement() on such an architecture like that in one go,
>> because ‘LOCK INC [.memory]’ -like operation is obviously impossible on
>> it. Instead, you have to, first, obtain a test-and-set-based lock, then
>> update the target variable, and then release the lock…
>>
>>
>>
>> Anton Bassov
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>>
>> OSR is HIRING!! See http://www.osr.com/careers
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>>
> —
> NTDEV is sponsored by OSR
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
> OSR is HIRING!! See http://www.osr.com/careers
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

The first such algorithm w/o hardware assist was by Dekker…

It reminds us, how the things shaped up over the years and there were defining moments to define the requirements, and then amendment(s) to it…

For a lock, initially : Mutual exclusion, deadlock free, and free of starvation. Then came the fairness which is a part of starvation, but somewhat more assertive. Hence came the queued locks.

Another one is like: Overlay, segmentation, sharing, paging… Lot of it started in software, then lands in hardware !

-pro

Date: Sat, 12 Oct 2013 02:10:44 -0700
From: xxxxx@gmail.com
To: xxxxx@lists.osr.com
CC: xxxxx@lists.osr.com
Subject: RE:[ntdev] How does InterlockedExchange work

The following algorithm is dated 1981 but seems to satisfy the requirements without any hardware assisted synchronization (memory reordering notwithstanding).

http://en.m.wikipedia.org/wiki/Peterson’s_algorithm

On Sat, Oct 12, 2013 at 6:37 AM, xxxxx@flounder.com wrote:

In the past, many architectures supported what was called

“read-pause-write”, the equivalent of LOCK, on certain instructions. For

example, on the PDP-11 (and, consequently, almost certainly on the VAX),

INC and DEC were always done with read-pause-write. The problem with this

it that it wastes memory bandwidth, hence the optional LOCK prefix on the

x86 family. It is also nice that the LOCK prefix can apply to many

different instructions.

There is no such thing as a “lock free” algorithm, although some limited

cases (a circular buffer accessed by at most two threads comes to mind)

can be made to work without explicit locking. However, like the “locked

access” example we had to do in Algol, an algorithm that requires

simultaneous access must have some known invariant. The circular buffer

case usually works because the “head” is accessed by exactly one thread

and the “tail” is accessed by exactly one thread, meaning that locking was

not needed. But the number of “lock-free” algorithms I’ve seen always

seem to depend at some level on instruction atomicity, or are so

incredibly fragile that attempts to adopt them usually lead to some error

that invalidates the complex set of preconditions that make them work.

And all too often, it is an attempt to generalize a two-thread-safe model

to an N-thread-safe model. I’ve seen a few of these spectacular failures,

caused by people who felt that those synchronization primitives like spin

locks, mutexes, and semaphores were “too complicated” and they had a

“simple solution”. Yeah, right.

joe

>> IMHO this paradigm is erroneously called ‘lock free’, but really is is

>> hardware lock based instead

>> of software lock based.

>

> In order to realize how erroneous calling these functions lock-free is you

> just have to keep in mind that

> not all architectures that support MP provide the exhaustive sets of

> operations that may use bus locking.

> Don’t forget that on many architectures (they are known as load-and-store

> ones) you cannot perform arithmetic and logic operations directly on

> memory operands - instead, you have to load operands to registers,

> perform an operation in registers, and then store the results in memory.

>

>

> It is (hopefully) obvious that you cannot implement functions like

> InterlockedIncrement() on such an architecture like that in one go,

> because ‘LOCK INC [.memory]’ -like operation is obviously impossible on

> it. Instead, you have to, first, obtain a test-and-set-based lock, then

> update the target variable, and then release the lock…

>

>

>

> Anton Bassov

>

>

> —

> NTDEV is sponsored by OSR

>

> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

>

> OSR is HIRING!! See http://www.osr.com/careers

>

> For our schedule of WDF, WDM, debugging and other seminars visit:

> http://www.osr.com/seminars

>

> To unsubscribe, visit the List Server section of OSR Online at

> http://www.osronline.com/page.cfm?name=ListServer

>



NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:

http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Regarding IA32 (including x64), the interlocked operations don’t actually lock the bus (whatever it is). QPI or other inter-processor bus is still free to perform other transactions.

The lock is only applied to the cache line. read-modify-write cycle is performed atomically on a cache line. The CPU brings the cache line in. If it’s not modified, it’s marked as modified, which (in case of a shared unmodified line) will broadcast cache line invalidation message to other cores. Then the read-modify write is performed. This sequence guarantees no other processor has the cache line loaded, so no other processor could have modified the variable in between read and write.

If multiple processors are trying to issue an interlocked operation on the same cache line, this will cause a lot of cache traffic between cores and has to be avoided.

Cache line locking can be used as an optimization under certain circumstances, but the processor will still resort to LOCK# in various degenerate cases (e.g. uncacheable memory, locked operations that straddle multiple cache lines, etc.). It’s not a safe assumption to make that other accesses can continue to be made during a locked operation (though this is transparent to most, well-behaved use cases & system designs).

  • S (Msft)

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@broadcom.com
Sent: Saturday, October 12, 2013 12:18 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] How does InterlockedExchange work

Regarding IA32 (including x64), the interlocked operations don’t actually lock the bus (whatever it is). QPI or other inter-processor bus is still free to perform other transactions.

The lock is only applied to the cache line. read-modify-write cycle is performed atomically on a cache line. The CPU brings the cache line in. If it’s not modified, it’s marked as modified, which (in case of a shared unmodified line) will broadcast cache line invalidation message to other cores. Then the read-modify write is performed. This sequence guarantees no other processor has the cache line loaded, so no other processor could have modified the variable in between read and write.

If multiple processors are trying to issue an interlocked operation on the same cache line, this will cause a lot of cache traffic between cores and has to be avoided.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

I am not going to even pretend I have a clue about the details something I
learned 46 years ago. What I do remember was that it resembles the
“two-phase commit” of transacted file systems; it only used Boolean
variables; and the proof techniques strained my relatively unsophisticated
grasp of algorithmic proof that I had in those days. So I can’t be of
much help, except that I know it existed once.
joe

> > But the discussion was, are there “lock free” algorithms for
> architectures
> > that don’t have
> > hardware interlocking.
>
>
> OK…
>
>
> Therefore, let me try to solve the problem that your professor has
> presented
> to you - let me try to implement a spinlock on such an arch, and do it
> is plain
> C…
>
>
> Our lock variable has to be initialized to zero. Now let’s acquire a
> lock
>

Could the following happen for two hardware threads trying to acquire the
lock at the same time…

>
> void get_lock( unsigned long *lock)
> {
>
> unsigned long temp1=0,temp2=0;
>
> while(1)
>
> {
>
>
> while(1)
> {
>
> temp1=*lock;

Both threads read value 0 into temp1

>
> *lock++;

We are assuming this isn’t an atomic operation on all architectures right?
So ++ is the same as:

tmp = *lock; (both threads read value 0 from *lock into a register)
tmp = tmp + 1; (both threads increment the register to 1)
*lock = tmp; (both threads write the value 1 back to *lock)

>
> temp2=*lock;
>

Both threads read value 1 into temp2

> if(temp2==1+temp1)break;

Both threads exit the inner loop

>
> *lock–;
>
> }
>
>
> if(temp2==1)break;

temp2 == 1 is true, so both threads exit the other loop

>
> *lock–;
>
> }
>
> }
>

Both threads have acquired the lock.

Maybe I’m missing something? (sarcasm/irony perhaps?)

James


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

From Intel manual —

For the Intel486? and Pentium? processors, the LOCK# signal is always asserted on the bus

during a LOCK operation, even if the area of memory being locked is cached in the processor.

For the P6 family processors, if the area of memory being locked during a LOCK operation is
cached in the processor that is performing the LOCK operation as write-back memory and is
completely contained in a cache line, the processor may not assert the LOCK# signal on the bus.
Instead, it will modify the memory location internally and allow it?s cache coherency mechanism
to insure that the operation is carried out atomically. This operation is called ?cache
locking.? The cache coherency mechanism automatically prevents two or more processors that
have cached the same area of memory from simultaneously modifying data in that area.

16: old = _InterlockedOr(&cur, 0xF0);
0:000> u
test!main+0x10 [d:\work\size\main.cpp @ 16]:
00e611d0 b9f0000000 mov ecx,0F0h
00e611d5 8d55f8 lea edx,[ebp-8] <— it tells it is intrinsic!!
00e611d8 8b02 mov eax,dword ptr [edx]
00e611da 8bf0 mov esi,eax
00e611dc 0bf1 or esi,ecx <— esi hold the OR’d value
00e611de f00fb132 lock cmpxchg dword ptr [edx],esi <- read/modify/write
00e611e2 75f6 jne test!main+0x1a (00e611da) <- only way to know that or’d value is committed !!

00e611e4 8945fc mov dword ptr [ebp-4],eax
0:000> p

17: _InterlockedOr(&cur, 0xF0);
0:000> u
test!main+0x27 [d:\work\size\main.cpp @ 17]:
00e611e7 b9f0000000 mov ecx,0F0h
00e611ec 8d55f8 lea edx,[ebp-8]
00e611ef f0090a lock or dword ptr [edx],ecx <- read/modify/write, but no guarantee !!!

From: xxxxx@valhallalegends.com
To: xxxxx@lists.osr.com
Subject: RE: RE:[ntdev] How does InterlockedExchange work
Date: Sat, 12 Oct 2013 19:38:05 +0000

Cache line locking can be used as an optimization under certain circumstances, but the processor will still resort to LOCK# in various degenerate cases (e.g. uncacheable memory, locked operations that straddle multiple cache lines, etc.). It’s not a safe assumption to make that other accesses can continue to be made during a locked operation (though this is transparent to most, well-behaved use cases & system designs).

  • S (Msft)

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@broadcom.com
Sent: Saturday, October 12, 2013 12:18 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] How does InterlockedExchange work

Regarding IA32 (including x64), the interlocked operations don’t actually lock the bus (whatever it is). QPI or other inter-processor bus is still free to perform other transactions.

The lock is only applied to the cache line. read-modify-write cycle is performed atomically on a cache line. The CPU brings the cache line in. If it’s not modified, it’s marked as modified, which (in case of a shared unmodified line) will broadcast cache line invalidation message to other cores. Then the read-modify write is performed. This sequence guarantees no other processor has the cache line loaded, so no other processor could have modified the variable in between read and write.

If multiple processors are trying to issue an interlocked operation on the same cache line, this will cause a lot of cache traffic between cores and has to be avoided.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Not sure what you mean below by “no guarantee”. Both of these are
guaranteed to OR in a value properly; in the first case, the old value has
to be retained, so the lock operation on the exchange means that if the
value changed while the computation was going on, the value is discarded,
and the operation is performed with the new value. I first encountered
this pattern when looking at how an atomic “*=” was implemented in OpenMP.
If you look at both sequences, they do exactly the same thing to the
target location, that is, guarantee that it has been altered atomically,
without any error caused by a concurrent modify. The need to preserve the
old value is what makes the first one more complex.
joe

From Intel manual —

For the Intel486™ and Pentium® processors, the LOCK# signal is always
asserted on the bus

during a LOCK operation, even if the area of memory being locked is cached
in the processor.

For the P6 family processors, if the area of memory being locked during a
LOCK operation is
cached in the processor that is performing the LOCK operation as
write-back memory and is
completely contained in a cache line, the processor may not assert the
LOCK# signal on the bus.
Instead, it will modify the memory location internally and allow it’s
cache coherency mechanism
to insure that the operation is carried out atomically. This operation is
called “cache
locking.” The cache coherency mechanism automatically prevents two or more
processors that
have cached the same area of memory from simultaneously modifying data in
that area.

> 16: old = _InterlockedOr(&cur, 0xF0);
0:000> u
test!main+0x10 [d:\work\size\main.cpp @ 16]:
00e611d0 b9f0000000 mov ecx,0F0h
00e611d5 8d55f8 lea edx,[ebp-8] <— it tells it is intrinsic!!
00e611d8 8b02 mov eax,dword ptr [edx]
00e611da 8bf0 mov esi,eax
00e611dc 0bf1 or esi,ecx <— esi hold the OR’d value
00e611de f00fb132 lock cmpxchg dword ptr [edx],esi <- read/modify/write
00e611e2 75f6 jne test!main+0x1a (00e611da) <- only way to know that or’d
value is committed !!

00e611e4 8945fc mov dword ptr [ebp-4],eax
0:000> p

> 17: _InterlockedOr(&cur, 0xF0);
0:000> u
test!main+0x27 [d:\work\size\main.cpp @ 17]:
00e611e7 b9f0000000 mov ecx,0F0h
00e611ec 8d55f8 lea edx,[ebp-8]
00e611ef f0090a lock or dword ptr [edx],ecx <- read/modify/write, but no
guarantee !!!

> From: xxxxx@valhallalegends.com
> To: xxxxx@lists.osr.com
> Subject: RE: RE:[ntdev] How does InterlockedExchange work
> Date: Sat, 12 Oct 2013 19:38:05 +0000
>
> Cache line locking can be used as an optimization under certain
> circumstances, but the processor will still resort to LOCK# in various
> degenerate cases (e.g. uncacheable memory, locked operations that
> straddle multiple cache lines, etc.). It’s not a safe assumption to
> make that other accesses can continue to be made during a locked
> operation (though this is transparent to most, well-behaved use cases &
> system designs).
>
> - S (Msft)
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of
> xxxxx@broadcom.com
> Sent: Saturday, October 12, 2013 12:18 PM
> To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] How does InterlockedExchange work
>
> Regarding IA32 (including x64), the interlocked operations don’t
> actually lock the bus (whatever it is). QPI or other inter-processor bus
> is still free to perform other transactions.
>
> The lock is only applied to the cache line. read-modify-write cycle is
> performed atomically on a cache line. The CPU brings the cache line in.
> If it’s not modified, it’s marked as modified, which (in case of a
> shared unmodified line) will broadcast cache line invalidation message
> to other cores. Then the read-modify write is performed. This sequence
> guarantees no other processor has the cache line loaded, so no other
> processor could have modified the variable in between read and write.
>
> If multiple processors are trying to issue an interlocked operation on
> the same cache line, this will cause a lot of cache traffic between
> cores and has to be avoided.
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

What I meant to say by - “not guranteed” is that it is either committed or not committed, right after that instruction. The state of the physical field in the memory is unknown. It is the responsibility of the underlying interconnect to make sure that next time it is being used ( specially on a different processor), it will push thru to backed phy pages… This is the part of the optimization…

Basically, they are semantically equivalent…

In fact, it is better to think at the semantic level of the programmers model, and intentionally avoid the H/W or interconnect’s business, since this to begin with (1) should not clutter our thinking process, IMO, and (2) interconnect could be different for different memory structure(s). For example, NUMA could very well have a different interconnect. As long as the cache coherency protocol is consistent we should be fine —

-pro

Date: Sun, 13 Oct 2013 17:30:12 -0400
Subject: RE: [ntdev] How does InterlockedExchange work
From: xxxxx@flounder.com
To: xxxxx@lists.osr.com

Not sure what you mean below by “no guarantee”. Both of these are
guaranteed to OR in a value properly; in the first case, the old value has
to be retained, so the lock operation on the exchange means that if the
value changed while the computation was going on, the value is discarded,
and the operation is performed with the new value. I first encountered
this pattern when looking at how an atomic “*=” was implemented in OpenMP.
If you look at both sequences, they do exactly the same thing to the
target location, that is, guarantee that it has been altered atomically,
without any error caused by a concurrent modify. The need to preserve the
old value is what makes the first one more complex.
joe

> From Intel manual —
>
>
> For the Intel486? and Pentium? processors, the LOCK# signal is always
> asserted on the bus
>
> during a LOCK operation, even if the area of memory being locked is cached
> in the processor.
>
> For the P6 family processors, if the area of memory being locked during a
> LOCK operation is
> cached in the processor that is performing the LOCK operation as
> write-back memory and is
> completely contained in a cache line, the processor may not assert the
> LOCK# signal on the bus.
> Instead, it will modify the memory location internally and allow it?s
> cache coherency mechanism
> to insure that the operation is carried out atomically. This operation is
> called ?cache
> locking.? The cache coherency mechanism automatically prevents two or more
> processors that
> have cached the same area of memory from simultaneously modifying data in
> that area.
>
>
>> 16: old = _InterlockedOr(&cur, 0xF0);
> 0:000> u
> test!main+0x10 [d:\work\size\main.cpp @ 16]:
> 00e611d0 b9f0000000 mov ecx,0F0h
> 00e611d5 8d55f8 lea edx,[ebp-8] <— it tells it is intrinsic!!
> 00e611d8 8b02 mov eax,dword ptr [edx]
> 00e611da 8bf0 mov esi,eax
> 00e611dc 0bf1 or esi,ecx <— esi hold the OR’d value
> 00e611de f00fb132 lock cmpxchg dword ptr [edx],esi <- read/modify/write
> 00e611e2 75f6 jne test!main+0x1a (00e611da) <- only way to know that or’d
> value is committed !!
>
> 00e611e4 8945fc mov dword ptr [ebp-4],eax
> 0:000> p
>
>> 17: _InterlockedOr(&cur, 0xF0);
> 0:000> u
> test!main+0x27 [d:\work\size\main.cpp @ 17]:
> 00e611e7 b9f0000000 mov ecx,0F0h
> 00e611ec 8d55f8 lea edx,[ebp-8]
> 00e611ef f0090a lock or dword ptr [edx],ecx <- read/modify/write, but no
> guarantee !!!
>
>
>
>
>
>> From: xxxxx@valhallalegends.com
>> To: xxxxx@lists.osr.com
>> Subject: RE: RE:[ntdev] How does InterlockedExchange work
>> Date: Sat, 12 Oct 2013 19:38:05 +0000
>>
>> Cache line locking can be used as an optimization under certain
>> circumstances, but the processor will still resort to LOCK# in various
>> degenerate cases (e.g. uncacheable memory, locked operations that
>> straddle multiple cache lines, etc.). It’s not a safe assumption to
>> make that other accesses can continue to be made during a locked
>> operation (though this is transparent to most, well-behaved use cases &
>> system designs).
>>
>> - S (Msft)
>>
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of
>> xxxxx@broadcom.com
>> Sent: Saturday, October 12, 2013 12:18 PM
>> To: Windows System Software Devs Interest List
>> Subject: RE:[ntdev] How does InterlockedExchange work
>>
>> Regarding IA32 (including x64), the interlocked operations don’t
>> actually lock the bus (whatever it is). QPI or other inter-processor bus
>> is still free to perform other transactions.
>>
>> The lock is only applied to the cache line. read-modify-write cycle is
>> performed atomically on a cache line. The CPU brings the cache line in.
>> If it’s not modified, it’s marked as modified, which (in case of a
>> shared unmodified line) will broadcast cache line invalidation message
>> to other cores. Then the read-modify write is performed. This sequence
>> guarantees no other processor has the cache line loaded, so no other
>> processor could have modified the variable in between read and write.
>>
>> If multiple processors are trying to issue an interlocked operation on
>> the same cache line, this will cause a lot of cache traffic between
>> cores and has to be avoided.
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>>
>> OSR is HIRING!! See http://www.osr.com/careers
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>>
>> OSR is HIRING!! See http://www.osr.com/careers
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Yes. The “logical guarantee”, the contract between the x86 hardware
designer, and the x86 programmer, is to guarantee certain behaviors, such
as apparent sequentiality of instructions, and the promise of atomic
actions as specified in tbe architecture documents. Just like I don’t
have to worry about the motility of “holes” and “electrons” in silicon
(electrons are more motile than holes, which is part of the reason for
2-phase clocks), or the dielectric properties of the printed circuit
board, I don’t have to worry that the x86 is an asynchronous opportunistic
RISC machine with serious pipelinig, or the memory system is implemented
by an asynchronous multi-level caching mechanism. How the “lock”
abstraction is implemented outside the chip doesn’t matter, as long as it
meets the specs of the x86 hardware documentation of how the instructions
work.

You are right that there is no guarantee the data is in the memory, as
long as the memory system maintains the ILLUSION that it is there. Just
like the onboard chip logic maintains the illusion of instruction
sequentiality and that there is actually an {R,E,}AX register (and
friends), which is really handled by having a large register set and using
dynamic register renaming to get a significant performance edge.

LOCK works because it /must/ work, or our contract with the hardware has
been violated.
joe

What I meant to say by - “not guranteed” is that it is either committed
or not committed, right after that instruction. The state of the physical
field in the memory is unknown. It is the responsibility of the underlying
interconnect to make sure that next time it is being used ( specially on a
different processor), it will push thru to backed phy pages… This is the
part of the optimization…

Basically, they are semantically equivalent…

In fact, it is better to think at the semantic level of the programmers
model, and intentionally avoid the H/W or interconnect’s business, since
this to begin with (1) should not clutter our thinking process, IMO, and
(2) interconnect could be different for different memory structure(s). For
example, NUMA could very well have a different interconnect. As long as
the cache coherency protocol is consistent we should be fine —

-pro

> Date: Sun, 13 Oct 2013 17:30:12 -0400
> Subject: RE: [ntdev] How does InterlockedExchange work
> From: xxxxx@flounder.com
> To: xxxxx@lists.osr.com
>
> Not sure what you mean below by “no guarantee”. Both of these are
> guaranteed to OR in a value properly; in the first case, the old value
> has
> to be retained, so the lock operation on the exchange means that if the
> value changed while the computation was going on, the value is
> discarded,
> and the operation is performed with the new value. I first encountered
> this pattern when looking at how an atomic “*=” was implemented in
> OpenMP.
> If you look at both sequences, they do exactly the same thing to the
> target location, that is, guarantee that it has been altered atomically,
> without any error caused by a concurrent modify. The need to preserve
> the
> old value is what makes the first one more complex.
> joe
>
>
> > From Intel manual —
> >
> >
> > For the Intel486™ and Pentium® processors, the LOCK# signal is always
> > asserted on the bus
> >
> > during a LOCK operation, even if the area of memory being locked is
> cached
> > in the processor.
> >
> > For the P6 family processors, if the area of memory being locked
> during a
> > LOCK operation is
> > cached in the processor that is performing the LOCK operation as
> > write-back memory and is
> > completely contained in a cache line, the processor may not assert the
> > LOCK# signal on the bus.
> > Instead, it will modify the memory location internally and allow it’s
> > cache coherency mechanism
> > to insure that the operation is carried out atomically. This operation
> is
> > called “cache
> > locking.” The cache coherency mechanism automatically prevents two or
> more
> > processors that
> > have cached the same area of memory from simultaneously modifying data
> in
> > that area.
> >
> >
> >> 16: old = _InterlockedOr(&cur, 0xF0);
> > 0:000> u
> > test!main+0x10 [d:\work\size\main.cpp @ 16]:
> > 00e611d0 b9f0000000 mov ecx,0F0h
> > 00e611d5 8d55f8 lea edx,[ebp-8] <— it tells it is intrinsic!!
> > 00e611d8 8b02 mov eax,dword ptr [edx]
> > 00e611da 8bf0 mov esi,eax
> > 00e611dc 0bf1 or esi,ecx <— esi hold the OR’d value
> > 00e611de f00fb132 lock cmpxchg dword ptr [edx],esi <-
> read/modify/write
> > 00e611e2 75f6 jne test!main+0x1a (00e611da) <- only way to know that
> or’d
> > value is committed !!
> >
> > 00e611e4 8945fc mov dword ptr [ebp-4],eax
> > 0:000> p
> >
> >> 17: _InterlockedOr(&cur, 0xF0);
> > 0:000> u
> > test!main+0x27 [d:\work\size\main.cpp @ 17]:
> > 00e611e7 b9f0000000 mov ecx,0F0h
> > 00e611ec 8d55f8 lea edx,[ebp-8]
> > 00e611ef f0090a lock or dword ptr [edx],ecx <- read/modify/write, but
> no
> > guarantee !!!
> >
> >
> >
> >
> >
> >> From: xxxxx@valhallalegends.com
> >> To: xxxxx@lists.osr.com
> >> Subject: RE: RE:[ntdev] How does InterlockedExchange work
> >> Date: Sat, 12 Oct 2013 19:38:05 +0000
> >>
> >> Cache line locking can be used as an optimization under certain
> >> circumstances, but the processor will still resort to LOCK# in
> various
> >> degenerate cases (e.g. uncacheable memory, locked operations that
> >> straddle multiple cache lines, etc.). It’s not a safe assumption to
> >> make that other accesses can continue to be made during a locked
> >> operation (though this is transparent to most, well-behaved use cases
> &
> >> system designs).
> >>
> >> - S (Msft)
> >>
> >>
> >> -----Original Message-----
> >> From: xxxxx@lists.osr.com
> >> [mailto:xxxxx@lists.osr.com] On Behalf Of
> >> xxxxx@broadcom.com
> >> Sent: Saturday, October 12, 2013 12:18 PM
> >> To: Windows System Software Devs Interest List
> >> Subject: RE:[ntdev] How does InterlockedExchange work
> >>
> >> Regarding IA32 (including x64), the interlocked operations don’t
> >> actually lock the bus (whatever it is). QPI or other inter-processor
> bus
> >> is still free to perform other transactions.
> >>
> >> The lock is only applied to the cache line. read-modify-write cycle
> is
> >> performed atomically on a cache line. The CPU brings the cache line
> in.
> >> If it’s not modified, it’s marked as modified, which (in case of a
> >> shared unmodified line) will broadcast cache line invalidation
> message
> >> to other cores. Then the read-modify write is performed. This
> sequence
> >> guarantees no other processor has the cache line loaded, so no other
> >> processor could have modified the variable in between read and write.
> >>
> >> If multiple processors are trying to issue an interlocked operation
> on
> >> the same cache line, this will cause a lot of cache traffic between
> >> cores and has to be avoided.
> >>
> >> —
> >> NTDEV is sponsored by OSR
> >>
> >> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
> >>
> >> OSR is HIRING!! See http://www.osr.com/careers
> >>
> >> For our schedule of WDF, WDM, debugging and other seminars visit:
> >> http://www.osr.com/seminars
> >>
> >> To unsubscribe, visit the List Server section of OSR Online at
> >> http://www.osronline.com/page.cfm?name=ListServer
> >>
> >> —
> >> NTDEV is sponsored by OSR
> >>
> >> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
> >>
> >> OSR is HIRING!! See http://www.osr.com/careers
> >>
> >> For our schedule of WDF, WDM, debugging and other seminars visit:
> >> http://www.osr.com/seminars
> >>
> >> To unsubscribe, visit the List Server section of OSR Online at
> >> http://www.osronline.com/page.cfm?name=ListServer
> >
> > —
> > NTDEV is sponsored by OSR
> >
> > Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
> >
> > OSR is HIRING!! See http://www.osr.com/careers
> >
> > For our schedule of WDF, WDM, debugging and other seminars visit:
> > http://www.osr.com/seminars
> >
> > To unsubscribe, visit the List Server section of OSR Online at
> > http://www.osronline.com/page.cfm?name=ListServer
>
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer