How does InterlockedExchange work

Hi,

I have a Windows 7 x86/amd64 driver where I want to synchronize access to a variable. Can I use InterlockedExchange for it?

My current understanding of InterlockedExchange is, that InterlockedExchange is done via compiler intrinsics. That means, the read (InterlockedExchange returns the old value) and the write is done in one clock cycle. The interlocked functions are atomic only when the variable is always accessed via an interlocked function.

But what happens in this case:

CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);
CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);

StatusVariable is written in the same clock cycle on two CPU cores. Does the function notice that the variable is accessed and defer the write to a different clock cycle? Or is it undefined which value the variable has after the write? Is it also possible that the variable contains garbage?

This is a cross-post from stackoverflow, because I didn’t get satisfying answers there.
http://stackoverflow.com/questions/18076416/interlockedexchange-on-two-cpu-cores

InterlockedExchange is implemented with lock prefix, which is part of x86 instruction set. You can read about the prefix in Intel or AMD manuals.

Execution of ‘locked’ instruction guarantees that no other entity (another CPU, core and probably DMA controllers) can access the memory address during execution of the instruction.

So, having instruction ‘inc dword [SomeAddress]’ (which is of ‘read-modify-write’) executing in parallel on two cores, the following sequence may occur:

  1. CPU0 reads memory content to internal register.
  2. CPU1 reads memory content to internal register.
  3. CPU0 increments internal register.
  4. CPU1 increments internal register.
  5. CPU0 writes memory content from internal register.
  6. CPU1 writes memory content from internal register.

So we have the variable incremented only once.

Having instruction ‘lock inc dword [SomeAddress]’ the sequence will look like this:

  1. CPU0 reads memory content to internal register.
  2. CPU0 increments internal register.
  3. CPU0 writes memory content from internal register.
  4. CPU1 reads memory content to internal register.
  5. CPU1 increments internal register.
  6. CPU1 writes memory content from internal register.

Which is what you wanted to do.

Not all the instructions allowed to have the prefix, some of instructions implicitly have the prefix. I also don’t remember granularity of the lock. Manuals should contain all the details.

Your notion of how it works in some fixed number of “clock cycles” is what’s causing your confusion. The instruction in question may take many (for a very loose definition of many) cycles, depending on how much contention there is on the bus. If you had 16 cores trying to do the same thing on the same address simultaneously, the last one to get the bus is going to take many more cycles than the first one.

The information below should help you comprehend this.

Phil

Not speaking for LogRhythm
Phil Barila | Senior Software Engineer
720.881.5364 (w)
LogRhythm, Inc.
A LEADER 2013 SIEM Magic Quadrant
Perfect 5-Star Rating in SC Magazine for 5 Consecutive Years

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@yahoo.com
Sent: Friday, October 11, 2013 10:37 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] How does InterlockedExchange work

InterlockedExchange is implemented with lock prefix, which is part of x86 instruction set. You can read about the prefix in Intel or AMD manuals.

Execution of ‘locked’ instruction guarantees that no other entity (another CPU, core and probably DMA controllers) can access the memory address during execution of the instruction.

So, having instruction ‘inc dword [SomeAddress]’ (which is of ‘read-modify-write’) executing in parallel on two cores, the following sequence may occur:

  1. CPU0 reads memory content to internal register.
  2. CPU1 reads memory content to internal register.
  3. CPU0 increments internal register.
  4. CPU1 increments internal register.
  5. CPU0 writes memory content from internal register.
  6. CPU1 writes memory content from internal register.

So we have the variable incremented only once.

Having instruction ‘lock inc dword [SomeAddress]’ the sequence will look like this:

  1. CPU0 reads memory content to internal register.
  2. CPU0 increments internal register.
  3. CPU0 writes memory content from internal register.
  4. CPU1 reads memory content to internal register.
  5. CPU1 increments internal register.
  6. CPU1 writes memory content from internal register.

Which is what you wanted to do.

Not all the instructions allowed to have the prefix, some of instructions implicitly have the prefix. I also don’t remember granularity of the lock. Manuals should contain all the details.

The InterlockedExchange uses the Intel XCHG instruction. XCHG returns the
previous value of the location so it is a read/write sequence which is
locked to be atomic.

The XCHG instruction will guarantee that one of the InterlockedExchange
operations in your example will complete before the other executes. We do
not know if CPU1 or CPU2 will be the first to complete but we do know that
each exchange will be atomic, the second can not start until the first is
complete. That is the second to execute will return the value set by the
first to execute (5 or 3) .

The CPUs have a hardware line for interprocessor synchronization, the LOCK
signal. The LOCK signal is asserted during the execution of a few
instructions by default and can be selected by the LOCK instruction Prefix
on a few more instructions. This allows the synchronization.

Some instructions are atomic by default:

The Intel486 processor (and newer processors since) guarantees that the
following basic memory operations will always be carried out atomically:

. Reading or writing a byte

. Reading or writing a word aligned on a 16-bit boundary

. Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the
following additional memory operations will always be carried out
atomically:

. Reading or writing a quadword aligned on a 64-bit boundary

. 16-bit accesses to uncached memory locations that fit within a 32-bit data
bus

The P6 family processors (and newer processors since) guarantee that the
following additional memory operation will always be carried out atomically:

. Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a
cache line

Other operations and collections of operations are not atomic by default.
The LOCK prefix is allowed for some instructions to form LOCKED
instructions.

We can use LOCKED instructions to synchronize the other operations and
groups of operations. We also use locked instructions to implement locking
for much longer sequences of instructions. (For example the simplest form
of spinlock can be constructed with the XCHG instruction.)

The reference in the stackoverflow reply is the definitive reference for
multiprocessor operations:

"See “Bus Locking” in Chapter 8, "Multiple-Processor Management,“of the
IntelR 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A,
for more information on bus locking.”

http://download.intel.com/design/processor/manuals/253668.pdf

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@gmx.de
Sent: Friday, October 11, 2013 11:54 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] How does InterlockedExchange work

Hi,

I have a Windows 7 x86/amd64 driver where I want to synchronize access to a
variable. Can I use InterlockedExchange for it?

My current understanding of InterlockedExchange is, that InterlockedExchange
is done via compiler intrinsics. That means, the read (InterlockedExchange
returns the old value) and the write is done in one clock cycle. The
interlocked functions are atomic only when the variable is always accessed
via an interlocked function.

But what happens in this case:

CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);

CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);

StatusVariable is written in the same clock cycle on two CPU cores. Does the
function notice that the variable is accessed and defer the write to a
different clock cycle? Or is it undefined which value the variable has after
the write? Is it also possible that the variable contains garbage?

This is a cross-post from stackoverflow, because I didn’t get satisfying
answers there.

http:cores>
http://stackoverflow.com/questions/18076416/interlockedexchange-on-two-cpu-c
ores



NTDEV is sponsored by OSR

Visit the list at: http:
http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http:
http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:

http: http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http:
http://www.osronline.com/page.cfm?name=ListServer</http:></http:></http:></http:></http:>

The LOCK prefix (or an implicitly-locked instruction) is a single
instruction cycle. So I’m not sure what you meant by “groups of
operations” in the explanation below; if you execute three LOCKED
instructions in a row, the combination of accidental sequencing and the
built-in lock logic means they may be interleaved in every possible order
between two or more cores, even if the first instruction is granted to
CPU0, the lock does not persist across the next instruction, even if it is
LOCKed as well. So the only “groups” of operations that can be done are
those that involve a single instruction (E.g., InterlockedIncrement,
which reads, increments, and stores in one operation, or
InterlockedCompareAndExchange, which does a lot more, but again, only
within one instruction.
joe

The InterlockedExchange uses the Intel XCHG instruction. XCHG returns the
previous value of the location so it is a read/write sequence which is
locked to be atomic.

The XCHG instruction will guarantee that one of the InterlockedExchange
operations in your example will complete before the other executes. We do
not know if CPU1 or CPU2 will be the first to complete but we do know that
each exchange will be atomic, the second can not start until the first is
complete. That is the second to execute will return the value set by the
first to execute (5 or 3) .

The CPUs have a hardware line for interprocessor synchronization, the
LOCK
signal. The LOCK signal is asserted during the execution of a few
instructions by default and can be selected by the LOCK instruction Prefix
on a few more instructions. This allows the synchronization.

Some instructions are atomic by default:

The Intel486 processor (and newer processors since) guarantees that the
following basic memory operations will always be carried out atomically:

. Reading or writing a byte

. Reading or writing a word aligned on a 16-bit boundary

. Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the
following additional memory operations will always be carried out
atomically:

. Reading or writing a quadword aligned on a 64-bit boundary

. 16-bit accesses to uncached memory locations that fit within a 32-bit
data
bus

The P6 family processors (and newer processors since) guarantee that the
following additional memory operation will always be carried out
atomically:

. Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within
a
cache line

Other operations and collections of operations are not atomic by default.
The LOCK prefix is allowed for some instructions to form LOCKED
instructions.

We can use LOCKED instructions to synchronize the other operations and
groups of operations. We also use locked instructions to implement
locking
for much longer sequences of instructions. (For example the simplest form
of spinlock can be constructed with the XCHG instruction.)

The reference in the stackoverflow reply is the definitive reference for
multiprocessor operations:

"See “Bus Locking” in Chapter 8, "Multiple-Processor Management,“of the
IntelR 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A,
for more information on bus locking.”

http://download.intel.com/design/processor/manuals/253668.pdf

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@gmx.de
Sent: Friday, October 11, 2013 11:54 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] How does InterlockedExchange work

Hi,

I have a Windows 7 x86/amd64 driver where I want to synchronize access to
a
variable. Can I use InterlockedExchange for it?

My current understanding of InterlockedExchange is, that
InterlockedExchange
is done via compiler intrinsics. That means, the read (InterlockedExchange
returns the old value) and the write is done in one clock cycle. The
interlocked functions are atomic only when the variable is always accessed
via an interlocked function.

But what happens in this case:

CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);

CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);

StatusVariable is written in the same clock cycle on two CPU cores. Does
the
function notice that the variable is accessed and defer the write to a
different clock cycle? Or is it undefined which value the variable has
after
the write? Is it also possible that the variable contains garbage?

This is a cross-post from stackoverflow, because I didn’t get satisfying
answers there.

http:> cores>
> http://stackoverflow.com/questions/18076416/interlockedexchange-on-two-cpu-c
> ores
>
>
>
> —
>
> NTDEV is sponsored by OSR
>
>
>
> Visit the list at: http:
> http://www.osronline.com/showlists.cfm?list=ntdev
>
>
>
> OSR is HIRING!! See http:
> http://www.osr.com/careers
>
>
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
>
> http: http://www.osr.com/seminars
>
>
>
> To unsubscribe, visit the List Server section of OSR Online at
> http:
> http://www.osronline.com/page.cfm?name=ListServer
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer</http:></http:></http:></http:></http:>

I was referring to the LOCK mechanism as the basis for multiprocessor
synchronization.

One uses a LOCKED instruction to form a lock to protect the sequence of
instructions (a simple spinlock).

To elaborate on the simple spinlock we could create a lock variable.
Initialize it to 0. To acquire the lock we do an XCHG with an argument of

  1. If the previous value returned is 0 we have the lock, if it is 1 we do
    not and should delay and try again until we get the lock. To release the
    lock we do the XCHG with an argument of 0. The longer sequence of code is
    between the acquire and release and is protected by the (implicit) LOCK on
    the XCHG instruction.

It is of course better to use the Windows SpinLock.

Ed

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@flounder.com
Sent: Friday, October 11, 2013 2:08 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] How does InterlockedExchange work

The LOCK prefix (or an implicitly-locked instruction) is a single
instruction cycle. So I’m not sure what you meant by “groups of operations”
in the explanation below; if you execute three LOCKED instructions in a row,
the combination of accidental sequencing and the built-in lock logic means
they may be interleaved in every possible order between two or more cores,
even if the first instruction is granted to CPU0, the lock does not persist
across the next instruction, even if it is LOCKed as well. So the only
“groups” of operations that can be done are those that involve a single
instruction (E.g., InterlockedIncrement, which reads, increments, and
stores in one operation, or InterlockedCompareAndExchange, which does a lot
more, but again, only within one instruction.

joe

The InterlockedExchange uses the Intel XCHG instruction. XCHG returns

the previous value of the location so it is a read/write sequence

which is locked to be atomic.

The XCHG instruction will guarantee that one of the

InterlockedExchange operations in your example will complete before

the other executes. We do not know if CPU1 or CPU2 will be the first

to complete but we do know that each exchange will be atomic, the

second can not start until the first is complete. That is the second

to execute will return the value set by the first to execute (5 or 3) .

The CPUs have a hardware line for interprocessor synchronization, the

LOCK signal. The LOCK signal is asserted during the execution of a

few instructions by default and can be selected by the LOCK

instruction Prefix on a few more instructions. This allows the

synchronization.

Some instructions are atomic by default:

The Intel486 processor (and newer processors since) guarantees that

the following basic memory operations will always be carried out
atomically:

. Reading or writing a byte

. Reading or writing a word aligned on a 16-bit boundary

. Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the

following additional memory operations will always be carried out

atomically:

. Reading or writing a quadword aligned on a 64-bit boundary

. 16-bit accesses to uncached memory locations that fit within a

32-bit data bus

The P6 family processors (and newer processors since) guarantee that

the following additional memory operation will always be carried out

atomically:

. Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit

within a cache line

Other operations and collections of operations are not atomic by default.

The LOCK prefix is allowed for some instructions to form LOCKED

instructions.

We can use LOCKED instructions to synchronize the other operations and

groups of operations. We also use locked instructions to implement

locking for much longer sequences of instructions. (For example the

simplest form of spinlock can be constructed with the XCHG

instruction.)

The reference in the stackoverflow reply is the definitive reference

for multiprocessor operations:

"See “Bus Locking” in Chapter 8, "Multiple-Processor Management,"of

the IntelR 64 and IA-32 Architectures Software Developer’s Manual,

Volume 3A, for more information on bus locking."

http:
http://download.intel.com/design/processor/manuals/253668.pdf

>

>

>

>

>

>

>

>

>

>

>

> -----Original Message-----

> From: mailto:xxxxx
xxxxx@lists.osr.com

> [mailto:xxxxx
mailto:xxxxx@lists.osr.com] On Behalf Of

> mailto:xxxxx xxxxx@gmx.de

> Sent: Friday, October 11, 2013 11:54 AM

> To: Windows System Software Devs Interest List

> Subject: [ntdev] How does InterlockedExchange work

>

>

>

> Hi,

>

>

>

> I have a Windows 7 x86/amd64 driver where I want to synchronize access

> to a variable. Can I use InterlockedExchange for it?

>

>

>

> My current understanding of InterlockedExchange is, that

> InterlockedExchange is done via compiler intrinsics. That means, the

> read (InterlockedExchange returns the old value) and the write is done

> in one clock cycle. The interlocked functions are atomic only when the

> variable is always accessed via an interlocked function.

>

>

>

> But what happens in this case:

>

>

>

> CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);

>

> CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);

>

>

>

> StatusVariable is written in the same clock cycle on two CPU cores.

> Does the function notice that the variable is accessed and defer the

> write to a different clock cycle? Or is it undefined which value the

> variable has after the write? Is it also possible that the variable

> contains garbage?

>

>

>

> This is a cross-post from stackoverflow, because I didn’t get

> satisfying answers there.

>

>

> http:
> o-cpu-

> cores>

> http:
http://stackoverflow.com/questions/18076416/interlockedexchange-on-two

> -cpu-c

> ores

>

>

>

> —

>

> NTDEV is sponsored by OSR

>

>

>

> Visit the list at:

> < http:
http://www.osronline.com/showlists.cfm?list=ntdev&gt;

> http:
http://www.osronline.com/showlists.cfm?list=ntdev

>

>

>

> OSR is HIRING!! See < http:
http://www.osr.com/careers&gt;

> http: http://www.osr.com/careers

>

>

>

> For our schedule of WDF, WDM, debugging and other seminars visit:

>

> < http: http://www.osr.com/seminars&gt;
http: http://www.osr.com/seminars

>

>

>

> To unsubscribe, visit the List Server section of OSR Online at

> < http:
http://www.osronline.com/page.cfm?name=ListServer&gt;

> http:
http://www.osronline.com/page.cfm?name=ListServer

>

>

> —

> NTDEV is sponsored by OSR

>

> Visit the list at: http:
http://www.osronline.com/showlists.cfm?list=ntdev

>

> OSR is HIRING!! See http:
http://www.osr.com/careers

>

> For our schedule of WDF, WDM, debugging and other seminars visit:

> http: http://www.osr.com/seminars

>

> To unsubscribe, visit the List Server section of OSR Online at

> http:
http://www.osronline.com/page.cfm?name=ListServer



NTDEV is sponsored by OSR

Visit the list at: http:
http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http:
http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:

http: http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http:
http://www.osronline.com/page.cfm?name=ListServer</http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></mailto:xxxxx></http:>

Note also the suggested usage of the PAUSE instruction in a spin-lock wait.
joe

I was referring to the LOCK mechanism as the basis for multiprocessor
synchronization.

One uses a LOCKED instruction to form a lock to protect the sequence of
instructions (a simple spinlock).

To elaborate on the simple spinlock we could create a lock variable.
Initialize it to 0. To acquire the lock we do an XCHG with an argument
of

  1. If the previous value returned is 0 we have the lock, if it is 1 we
    do
    not and should delay and try again until we get the lock. To release the
    lock we do the XCHG with an argument of 0. The longer sequence of code is
    between the acquire and release and is protected by the (implicit) LOCK on
    the XCHG instruction.

It is of course better to use the Windows SpinLock.

Ed

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@flounder.com
Sent: Friday, October 11, 2013 2:08 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] How does InterlockedExchange work

The LOCK prefix (or an implicitly-locked instruction) is a single
instruction cycle. So I’m not sure what you meant by “groups of
operations”
in the explanation below; if you execute three LOCKED instructions in a
row,
the combination of accidental sequencing and the built-in lock logic means
they may be interleaved in every possible order between two or more cores,
even if the first instruction is granted to CPU0, the lock does not
persist
across the next instruction, even if it is LOCKed as well. So the only
“groups” of operations that can be done are those that involve a single
instruction (E.g., InterlockedIncrement, which reads, increments, and
stores in one operation, or InterlockedCompareAndExchange, which does a
lot
more, but again, only within one instruction.

joe

> The InterlockedExchange uses the Intel XCHG instruction. XCHG returns

> the previous value of the location so it is a read/write sequence

> which is locked to be atomic.

>

>

>

>

>

> The XCHG instruction will guarantee that one of the

> InterlockedExchange operations in your example will complete before

> the other executes. We do not know if CPU1 or CPU2 will be the first

> to complete but we do know that each exchange will be atomic, the

> second can not start until the first is complete. That is the second

> to execute will return the value set by the first to execute (5 or 3) .

>

>

>

>

>

>

>

> The CPUs have a hardware line for interprocessor synchronization, the

> LOCK signal. The LOCK signal is asserted during the execution of a

> few instructions by default and can be selected by the LOCK

> instruction Prefix on a few more instructions. This allows the

> synchronization.

>

>

>

> Some instructions are atomic by default:

>

> The Intel486 processor (and newer processors since) guarantees that

> the following basic memory operations will always be carried out
atomically:

>

> . Reading or writing a byte

>

> . Reading or writing a word aligned on a 16-bit boundary

>

> . Reading or writing a doubleword aligned on a 32-bit boundary

>

> The Pentium processor (and newer processors since) guarantees that the

> following additional memory operations will always be carried out

> atomically:

>

> . Reading or writing a quadword aligned on a 64-bit boundary

>

> . 16-bit accesses to uncached memory locations that fit within a

> 32-bit data bus

>

> The P6 family processors (and newer processors since) guarantee that

> the following additional memory operation will always be carried out

> atomically:

>

> . Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit

> within a cache line

>

>

>

> Other operations and collections of operations are not atomic by
> default.

> The LOCK prefix is allowed for some instructions to form LOCKED

> instructions.

>

>

>

> We can use LOCKED instructions to synchronize the other operations and

> groups of operations. We also use locked instructions to implement

> locking for much longer sequences of instructions. (For example the

> simplest form of spinlock can be constructed with the XCHG

> instruction.)

>

>

>

>

>

>

>

> The reference in the stackoverflow reply is the definitive reference

> for multiprocessor operations:

>

> "See “Bus Locking” in Chapter 8, "Multiple-Processor Management,"of

> the IntelR 64 and IA-32 Architectures Software Developer’s Manual,

> Volume 3A, for more information on bus locking."

>

> http:
> http://download.intel.com/design/processor/manuals/253668.pdf
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>> -----Original Message-----
>
>> From: mailto:xxxxx
> xxxxx@lists.osr.com
>
>> [mailto:xxxxx
> mailto:xxxxx@lists.osr.com] On Behalf Of
>
>> mailto:xxxxx xxxxx@gmx.de
>
>> Sent: Friday, October 11, 2013 11:54 AM
>
>> To: Windows System Software Devs Interest List
>
>> Subject: [ntdev] How does InterlockedExchange work
>
>>
>
>>
>
>>
>
>> Hi,
>
>>
>
>>
>
>>
>
>> I have a Windows 7 x86/amd64 driver where I want to synchronize access
>
>> to a variable. Can I use InterlockedExchange for it?
>
>>
>
>>
>
>>
>
>> My current understanding of InterlockedExchange is, that
>
>> InterlockedExchange is done via compiler intrinsics. That means, the
>
>> read (InterlockedExchange returns the old value) and the write is done
>
>> in one clock cycle. The interlocked functions are atomic only when the
>
>> variable is always accessed via an interlocked function.
>
>>
>
>>
>
>>
>
>> But what happens in this case:
>
>>
>
>>
>
>>
>
>> CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);
>
>>
>
>> CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);
>
>>
>
>>
>
>>
>
>> StatusVariable is written in the same clock cycle on two CPU cores.
>
>> Does the function notice that the variable is accessed and defer the
>
>> write to a different clock cycle? Or is it undefined which value the
>
>> variable has after the write? Is it also possible that the variable
>
>> contains garbage?
>
>>
>
>>
>
>>
>
>> This is a cross-post from stackoverflow, because I didn’t get
>
>> satisfying answers there.
>
>>
>
>>
>
>> http:>
>> o-cpu-
>
>> cores>
>
>> http:
> http://stackoverflow.com/questions/18076416/interlockedexchange-on-two
>
>> -cpu-c
>
>> ores
>
>>
>
>>
>
>>
>
>> —
>
>>
>
>> NTDEV is sponsored by OSR
>
>>
>
>>
>
>>
>
>> Visit the list at:
>
>> < http:
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
>> http:
> http://www.osronline.com/showlists.cfm?list=ntdev
>
>>
>
>>
>
>>
>
>> OSR is HIRING!! See < http:
> http://www.osr.com/careers&gt;
>
>> http: http://www.osr.com/careers
>
>>
>
>>
>
>>
>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>
>>
>
>> < http: http://www.osr.com/seminars&gt;
> http: http://www.osr.com/seminars
>
>>
>
>>
>
>>
>
>> To unsubscribe, visit the List Server section of OSR Online at
>
>> < http:
> http://www.osronline.com/page.cfm?name=ListServer&gt;
>
>> http:
> http://www.osronline.com/page.cfm?name=ListServer
>
>>
>
>>
>
>> —
>
>> NTDEV is sponsored by OSR
>
>>
>
>> Visit the list at: http:
> http://www.osronline.com/showlists.cfm?list=ntdev
>
>>
>
>> OSR is HIRING!! See http:
> http://www.osr.com/careers
>
>>
>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>
>> http: http://www.osr.com/seminars
>
>>
>
>> To unsubscribe, visit the List Server section of OSR Online at
>
>> http:
> http://www.osronline.com/page.cfm?name=ListServer
>
>
>
>
>
>
>
> —
>
> NTDEV is sponsored by OSR
>
>
>
> Visit the list at: http:
> http://www.osronline.com/showlists.cfm?list=ntdev
>
>
>
> OSR is HIRING!! See http:
> http://www.osr.com/careers
>
>
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
>
> http: http://www.osr.com/seminars
>
>
>
> To unsubscribe, visit the List Server section of OSR Online at
> http:
> http://www.osronline.com/page.cfm?name=ListServer
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer</http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></mailto:xxxxx></http:>

> Some instructions are atomic by default: The Intel486 processor (and newer processors since)

guarantees that the following basic memory operations will always be carried out atomically: .
Reading or writing a byte

You are mixing up two different types of atomicity - the one of reads and writes, on one hand (i.e. the one of operations performed with MOV instruction), and the one of interlocked operations, on the other hand. The atomic operations of the former type involve only one cycle to memory (i.e. either read or write). They cannot be used with LOCK prefix,for understandable reasons - bus locking in context of such an operation is totally pointless. The ones of the later type involve at least 2 separate cycles to memory (i.e.read a memory location to some internal register of the CPU, then update this register, and then update the target memory location). Now consider what happens if CPU B sneaks in while CPU A’s operations are still in progress. To make sure it does not happen one can use LOCK prefix with the target instruction. Use of LOCK prefix ensures that the bus is locked for the duration of the instruction. This prefix can be usedonly with the certain instructions that work the way explained above (i.e. read-update-write).
XCHG instruction always asserts #LOCK signal behind the scenes, even if you use it without the explicit LOCK prefix…

To release the lock we do the XCHG with an argument of 0

Actually, in spinlock implementation that you have described this part can be done with a simple MOV - under such an implementation bus locking is needed only for lock acquisition…

Anton Bassov

The Interlocked… functions are very special,
as they are the machine architecture feature
and the OS API at the same time.

The intent for these function is to implement
them inline, however, this often requires more than
single CPU instruction (with obvious prologue).

Doron wrote a great blog on how these intrinsics
are composed:

http://blogs.msdn.com/b/doronh/archive/2006/11/10/the-case-of-amazingly-morphing-intrinsic-function.aspx

/* The GCC compiler has a similar set of intrinsics, but it isn’t
documented as complete as in MSDN */

– pa

In this case, as every other case, a compiler intrinsic is merely a way for
your code to be compiled to a binary that does not include the instruction
sequence for a function call to an external library. The code may be
condensed to an inline operation, or converted to a local function call
instead depending on the compiler and the intrinsic.

In the specific case of the various interlocked operations, as others have
intimated, the API relies on features of the hardware to ensure atomicity of
the operation and the compiler intrinsic does the same for a specific
platform. IMHO this paradigm is erroneously called ‘lock free’, but really
is is hardware lock based instead of software lock based. Note that nearly
all software locks rely on hardware locks of some description, so what this
paradigm really involves is using the constructs usually used to implement a
software lock to directly manipulate a logical construct (i.e. a linked list
or other data structure)

Note that cache aware versions of these algorithms are the subject of
conjectural patents in jurisdictions where such patents are possible, but
mostly treated as either prior art or not copyright infringement in all sane
countries (i.e. those not currently having a problem with a debt ceiling)

If you wish to investigate further, there is a significant amount of
publicized research on the topic and there won’t be any lack of pointers to
good resources from this group

wrote in message news:xxxxx@ntdev…

Hi,

I have a Windows 7 x86/amd64 driver where I want to synchronize access to a
variable. Can I use InterlockedExchange for it?

My current understanding of InterlockedExchange is, that InterlockedExchange
is done via compiler intrinsics. That means, the read (InterlockedExchange
returns the old value) and the write is done in one clock cycle. The
interlocked functions are atomic only when the variable is always accessed
via an interlocked function.

But what happens in this case:

CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);
CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);

StatusVariable is written in the same clock cycle on two CPU cores. Does the
function notice that the variable is accessed and defer the write to a
different clock cycle? Or is it undefined which value the variable has after
the write? Is it also possible that the variable contains garbage?

This is a cross-post from stackoverflow, because I didn’t get satisfying
answers there.
http://stackoverflow.com/questions/18076416/interlockedexchange-on-two-cpu-cores

It is interesting, and subtle. But important on the face of muti-processors systems and cache.

in later case

For the first part, it is better to return the exact old value …

-pro

From: xxxxx@hotmail.com
Subject: Re:[ntdev] How does InterlockedExchange work
Date: Fri, 11 Oct 2013 21:02:56 -0400
To: xxxxx@lists.osr.com

In this case, as every other case, a compiler intrinsic is merely a way for
your code to be compiled to a binary that does not include the instruction
sequence for a function call to an external library. The code may be
condensed to an inline operation, or converted to a local function call
instead depending on the compiler and the intrinsic.

In the specific case of the various interlocked operations, as others have
intimated, the API relies on features of the hardware to ensure atomicity of
the operation and the compiler intrinsic does the same for a specific
platform. IMHO this paradigm is erroneously called ‘lock free’, but really
is is hardware lock based instead of software lock based. Note that nearly
all software locks rely on hardware locks of some description, so what this
paradigm really involves is using the constructs usually used to implement a
software lock to directly manipulate a logical construct (i.e. a linked list
or other data structure)

Note that cache aware versions of these algorithms are the subject of
conjectural patents in jurisdictions where such patents are possible, but
mostly treated as either prior art or not copyright infringement in all sane
countries (i.e. those not currently having a problem with a debt ceiling)

If you wish to investigate further, there is a significant amount of
publicized research on the topic and there won?t be any lack of pointers to
good resources from this group

wrote in message news:xxxxx@ntdev…

Hi,

I have a Windows 7 x86/amd64 driver where I want to synchronize access to a
variable. Can I use InterlockedExchange for it?

My current understanding of InterlockedExchange is, that InterlockedExchange
is done via compiler intrinsics. That means, the read (InterlockedExchange
returns the old value) and the write is done in one clock cycle. The
interlocked functions are atomic only when the variable is always accessed
via an interlocked function.

But what happens in this case:

CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);
CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);

StatusVariable is written in the same clock cycle on two CPU cores. Does the
function notice that the variable is accessed and defer the write to a
different clock cycle? Or is it undefined which value the variable has after
the write? Is it also possible that the variable contains garbage?

This is a cross-post from stackoverflow, because I didn’t get satisfying
answers there.
http://stackoverflow.com/questions/18076416/interlockedexchange-on-two-cpu-cores


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Particularly when the underlying system is NUMA…

From: xxxxx@outlook.com
To: xxxxx@lists.osr.com
Subject: RE: [ntdev] How does InterlockedExchange work
Date: Fri, 11 Oct 2013 20:15:07 -0500

It is interesting, and subtle. But important on the face of muti-processors systems and cache.

in later case

For the first part, it is better to return the exact old value …

-pro

From: xxxxx@hotmail.com
Subject: Re:[ntdev] How does InterlockedExchange work
Date: Fri, 11 Oct 2013 21:02:56 -0400
To: xxxxx@lists.osr.com

In this case, as every other case, a compiler intrinsic is merely a way for
your code to be compiled to a binary that does not include the instruction
sequence for a function call to an external library. The code may be
condensed to an inline operation, or converted to a local function call
instead depending on the compiler and the intrinsic.

In the specific case of the various interlocked operations, as others have
intimated, the API relies on features of the hardware to ensure atomicity of
the operation and the compiler intrinsic does the same for a specific
platform. IMHO this paradigm is erroneously called ‘lock free’, but really
is is hardware lock based instead of software lock based. Note that nearly
all software locks rely on hardware locks of some description, so what this
paradigm really involves is using the constructs usually used to implement a
software lock to directly manipulate a logical construct (i.e. a linked list
or other data structure)

Note that cache aware versions of these algorithms are the subject of
conjectural patents in jurisdictions where such patents are possible, but
mostly treated as either prior art or not copyright infringement in all sane
countries (i.e. those not currently having a problem with a debt ceiling)

If you wish to investigate further, there is a significant amount of
publicized research on the topic and there won?t be any lack of pointers to
good resources from this group

wrote in message news:xxxxx@ntdev…

Hi,

I have a Windows 7 x86/amd64 driver where I want to synchronize access to a
variable. Can I use InterlockedExchange for it?

My current understanding of InterlockedExchange is, that InterlockedExchange
is done via compiler intrinsics. That means, the read (InterlockedExchange
returns the old value) and the write is done in one clock cycle. The
interlocked functions are atomic only when the variable is always accessed
via an interlocked function.

But what happens in this case:

CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);
CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);

StatusVariable is written in the same clock cycle on two CPU cores. Does the
function notice that the variable is accessed and defer the write to a
different clock cycle? Or is it undefined which value the variable has after
the write? Is it also possible that the variable contains garbage?

This is a cross-post from stackoverflow, because I didn’t get satisfying
answers there.
http://stackoverflow.com/questions/18076416/interlockedexchange-on-two-cpu-cores


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:

http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Interlocked instructions are mot required to do locking. As a homework
exercise in my operating systems ourse in 1967, we had to implement
interprocessor/interthread locks in an ordinary programming language (in
those days it was Algol; today it would be C). None of us got the right
answer, but the professor (a little-known new faculty, one David L.
Parnas) presented the correct answer, complete with a proof of
correctness.

By te way, the solution is sufficiently complex and inefficient that it
makes it obvious ehy you need interlocked instructions.
joe

Particularly when the underlying system is NUMA…

From: xxxxx@outlook.com
To: xxxxx@lists.osr.com
Subject: RE: [ntdev] How does InterlockedExchange work
Date: Fri, 11 Oct 2013 20:15:07 -0500

It is interesting, and subtle. But important on the face of
muti-processors systems and cache.

in later case

For the first part, it is better to return the exact old value …

-pro

> From: xxxxx@hotmail.com
> Subject: Re:[ntdev] How does InterlockedExchange work
> Date: Fri, 11 Oct 2013 21:02:56 -0400
> To: xxxxx@lists.osr.com
>
> In this case, as every other case, a compiler intrinsic is merely a way
> for
> your code to be compiled to a binary that does not include the
> instruction
> sequence for a function call to an external library. The code may be
> condensed to an inline operation, or converted to a local function call
> instead depending on the compiler and the intrinsic.
>
> In the specific case of the various interlocked operations, as others
> have
> intimated, the API relies on features of the hardware to ensure
> atomicity of
> the operation and the compiler intrinsic does the same for a specific
> platform. IMHO this paradigm is erroneously called ‘lock free’, but
> really
> is is hardware lock based instead of software lock based. Note that
> nearly
> all software locks rely on hardware locks of some description, so what
> this
> paradigm really involves is using the constructs usually used to
> implement a
> software lock to directly manipulate a logical construct (i.e. a linked
> list
> or other data structure)
>
> Note that cache aware versions of these algorithms are the subject of
> conjectural patents in jurisdictions where such patents are possible,
> but
> mostly treated as either prior art or not copyright infringement in all
> sane
> countries (i.e. those not currently having a problem with a debt
> ceiling)
>
> If you wish to investigate further, there is a significant amount of
> publicized research on the topic and there won’t be any lack of pointers
> to
> good resources from this group
>
> wrote in message news:xxxxx@ntdev…
>
> Hi,
>
> I have a Windows 7 x86/amd64 driver where I want to synchronize access
> to a
> variable. Can I use InterlockedExchange for it?
>
> My current understanding of InterlockedExchange is, that
> InterlockedExchange
> is done via compiler intrinsics. That means, the read
> (InterlockedExchange
> returns the old value) and the write is done in one clock cycle. The
> interlocked functions are atomic only when the variable is always
> accessed
> via an interlocked function.
>
> But what happens in this case:
>
> CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);
> CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);
>
> StatusVariable is written in the same clock cycle on two CPU cores. Does
> the
> function notice that the variable is accessed and defer the write to a
> different clock cycle? Or is it undefined which value the variable has
> after
> the write? Is it also possible that the variable contains garbage?
>
> This is a cross-post from stackoverflow, because I didn’t get satisfying
> answers there.
> http://stackoverflow.com/questions/18076416/interlockedexchange-on-two-cpu-cores
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:

http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Sure Joe …

Even in late 70s and early 80s, it was an exercise to do the atomic OPS for machine that don’t have test/set ops…

Also you reminded me those old heors: O.J.Dahl, Parnas, Hoere, Dijkstra to name a few for the advancement of CS & Soft Engg…

-pro

Date: Fri, 11 Oct 2013 21:34:05 -0400
Subject: RE: [ntdev] How does InterlockedExchange work
From: xxxxx@flounder.com
To: xxxxx@lists.osr.com

Interlocked instructions are mot required to do locking. As a homework
exercise in my operating systems ourse in 1967, we had to implement
interprocessor/interthread locks in an ordinary programming language (in
those days it was Algol; today it would be C). None of us got the right
answer, but the professor (a little-known new faculty, one David L.
Parnas) presented the correct answer, complete with a proof of
correctness.

By te way, the solution is sufficiently complex and inefficient that it
makes it obvious ehy you need interlocked instructions.
joe

> Particularly when the underlying system is NUMA…
>
>
>
>
>
> From: xxxxx@outlook.com
> To: xxxxx@lists.osr.com
> Subject: RE: [ntdev] How does InterlockedExchange work
> Date: Fri, 11 Oct 2013 20:15:07 -0500
>
>
>
>
> It is interesting, and subtle. But important on the face of
> muti-processors systems and cache.
>
>
> in later case
>
>
> For the first part, it is better to return the exact old value …
>
>
> -pro
>
>> From: xxxxx@hotmail.com
>> Subject: Re:[ntdev] How does InterlockedExchange work
>> Date: Fri, 11 Oct 2013 21:02:56 -0400
>> To: xxxxx@lists.osr.com
>>
>> In this case, as every other case, a compiler intrinsic is merely a way
>> for
>> your code to be compiled to a binary that does not include the
>> instruction
>> sequence for a function call to an external library. The code may be
>> condensed to an inline operation, or converted to a local function call
>> instead depending on the compiler and the intrinsic.
>>
>> In the specific case of the various interlocked operations, as others
>> have
>> intimated, the API relies on features of the hardware to ensure
>> atomicity of
>> the operation and the compiler intrinsic does the same for a specific
>> platform. IMHO this paradigm is erroneously called ‘lock free’, but
>> really
>> is is hardware lock based instead of software lock based. Note that
>> nearly
>> all software locks rely on hardware locks of some description, so what
>> this
>> paradigm really involves is using the constructs usually used to
>> implement a
>> software lock to directly manipulate a logical construct (i.e. a linked
>> list
>> or other data structure)
>>
>> Note that cache aware versions of these algorithms are the subject of
>> conjectural patents in jurisdictions where such patents are possible,
>> but
>> mostly treated as either prior art or not copyright infringement in all
>> sane
>> countries (i.e. those not currently having a problem with a debt
>> ceiling)
>>
>> If you wish to investigate further, there is a significant amount of
>> publicized research on the topic and there won?t be any lack of pointers
>> to
>> good resources from this group
>>
>> wrote in message news:xxxxx@ntdev…
>>
>> Hi,
>>
>> I have a Windows 7 x86/amd64 driver where I want to synchronize access
>> to a
>> variable. Can I use InterlockedExchange for it?
>>
>> My current understanding of InterlockedExchange is, that
>> InterlockedExchange
>> is done via compiler intrinsics. That means, the read
>> (InterlockedExchange
>> returns the old value) and the write is done in one clock cycle. The
>> interlocked functions are atomic only when the variable is always
>> accessed
>> via an interlocked function.
>>
>> But what happens in this case:
>>
>> CPU1: InterlockedExchange(&Adapter->StatusVariable, 5);
>> CPU2: InterlockedExchange(&Adapter->StatusVariable, 3);
>>
>> StatusVariable is written in the same clock cycle on two CPU cores. Does
>> the
>> function notice that the variable is accessed and defer the write to a
>> different clock cycle? Or is it undefined which value the variable has
>> after
>> the write? Is it also possible that the variable contains garbage?
>>
>> This is a cross-post from stackoverflow, because I didn’t get satisfying
>> answers there.
>> http://stackoverflow.com/questions/18076416/interlockedexchange-on-two-cpu-cores
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>>
>> OSR is HIRING!! See http://www.osr.com/careers
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>
> —
>
>
> NTDEV is sponsored by OSR
>
>
>
>
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
>
>
>
>
> OSR is HIRING!! See http://www.osr.com/careers
>
>
>
>
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
>
>
> http://www.osr.com/seminars
>
>
>
>
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> IMHO this paradigm is erroneously called ‘lock free’, but really is is hardware lock based instead

of software lock based.

In order to realize how erroneous calling these functions lock-free is you just have to keep in mind that
not all architectures that support MP provide the exhaustive sets of operations that may use bus locking.
Don’t forget that on many architectures (they are known as load-and-store ones) you cannot perform arithmetic and logic operations directly on memory operands - instead, you have to load operands to registers, perform an operation in registers, and then store the results in memory.

It is (hopefully) obvious that you cannot implement functions like InterlockedIncrement() on such an architecture like that in one go, because ‘LOCK INC [.memory]’ -like operation is obviously impossible on it. Instead, you have to, first, obtain a test-and-set-based lock, then update the target variable, and then release the lock…

Anton Bassov

In the past, many architectures supported what was called
“read-pause-write”, the equivalent of LOCK, on certain instructions. For
example, on the PDP-11 (and, consequently, almost certainly on the VAX),
INC and DEC were always done with read-pause-write. The problem with this
it that it wastes memory bandwidth, hence the optional LOCK prefix on the
x86 family. It is also nice that the LOCK prefix can apply to many
different instructions.

There is no such thing as a “lock free” algorithm, although some limited
cases (a circular buffer accessed by at most two threads comes to mind)
can be made to work without explicit locking. However, like the “locked
access” example we had to do in Algol, an algorithm that requires
simultaneous access must have some known invariant. The circular buffer
case usually works because the “head” is accessed by exactly one thread
and the “tail” is accessed by exactly one thread, meaning that locking was
not needed. But the number of “lock-free” algorithms I’ve seen always
seem to depend at some level on instruction atomicity, or are so
incredibly fragile that attempts to adopt them usually lead to some error
that invalidates the complex set of preconditions that make them work.
And all too often, it is an attempt to generalize a two-thread-safe model
to an N-thread-safe model. I’ve seen a few of these spectacular failures,
caused by people who felt that those synchronization primitives like spin
locks, mutexes, and semaphores were “too complicated” and they had a
“simple solution”. Yeah, right.
joe

> IMHO this paradigm is erroneously called ‘lock free’, but really is is
> hardware lock based instead
> of software lock based.

In order to realize how erroneous calling these functions lock-free is you
just have to keep in mind that
not all architectures that support MP provide the exhaustive sets of
operations that may use bus locking.
Don’t forget that on many architectures (they are known as load-and-store
ones) you cannot perform arithmetic and logic operations directly on
memory operands - instead, you have to load operands to registers,
perform an operation in registers, and then store the results in memory.

It is (hopefully) obvious that you cannot implement functions like
InterlockedIncrement() on such an architecture like that in one go,
because ‘LOCK INC [.memory]’ -like operation is obviously impossible on
it. Instead, you have to, first, obtain a test-and-set-based lock, then
update the target variable, and then release the lock…

Anton Bassov


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

This problem with high contention on busses is one of the reason that
queued spin locks were invented. And it is worth studying why a “spin
lock” should include a PAUSE instruction to reduce power consumption and
bus contention.

The memory system runs asynchronously with the CPU, and is mediated by the
multilevel caching system and cache interlock logic as well. The “cost”
of accessing a memory location will depend on a huge number of conditions.
If the lock is in a cached location, it might take as little as 0 clock
cycles to read it. If the lock is in a cached location that has been
modified in another cache, it can take hundreds of clock cycles, depending
on the relative speeds of the CPU clock and the memory system. After
that, if there’s a cache miss, or some cache has a modified copy of the
location being accessed, the costs can only go up. The bus is locked for
the entire cycle of the instruction, however long that takes.
joe

Your notion of how it works in some fixed number of “clock cycles” is
what’s causing your confusion. The instruction in question may take many
(for a very loose definition of many) cycles, depending on how much
contention there is on the bus. If you had 16 cores trying to do the same
thing on the same address simultaneously, the last one to get the bus is
going to take many more cycles than the first one.

The information below should help you comprehend this.

Phil

Not speaking for LogRhythm
Phil Barila | Senior Software Engineer
720.881.5364 (w)
LogRhythm, Inc.
A LEADER 2013 SIEM Magic Quadrant
Perfect 5-Star Rating in SC Magazine for 5 Consecutive Years

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@yahoo.com
Sent: Friday, October 11, 2013 10:37 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] How does InterlockedExchange work

InterlockedExchange is implemented with lock prefix, which is part of x86
instruction set. You can read about the prefix in Intel or AMD manuals.

Execution of ‘locked’ instruction guarantees that no other entity (another
CPU, core and probably DMA controllers) can access the memory address
during execution of the instruction.

So, having instruction ‘inc dword [SomeAddress]’ (which is of
‘read-modify-write’) executing in parallel on two cores, the following
sequence may occur:

  1. CPU0 reads memory content to internal register.
  2. CPU1 reads memory content to internal register.
  3. CPU0 increments internal register.
  4. CPU1 increments internal register.
  5. CPU0 writes memory content from internal register.
  6. CPU1 writes memory content from internal register.

So we have the variable incremented only once.

Having instruction ‘lock inc dword [SomeAddress]’ the sequence will look
like this:

  1. CPU0 reads memory content to internal register.
  2. CPU0 increments internal register.
  3. CPU0 writes memory content from internal register.
  4. CPU1 reads memory content to internal register.
  5. CPU1 increments internal register.
  6. CPU1 writes memory content from internal register.

Which is what you wanted to do.

Not all the instructions allowed to have the prefix, some of instructions
implicitly have the prefix. I also don’t remember granularity of the lock.
Manuals should contain all the details.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> There is no such thing as a “lock free” algorithm, although some limited cases (a circular buffer

accessed by at most two threads comes to mind) can be made to work without explicit locking.

Well, assuming that your target arch supports interlocked increment atomic instruction, a circular buffer can be made to work with any number of threads/CPUs without any additional explicit locking - just make sure that its size is some power of 2, and that’s it…

Anton Bassov

>> There is no such thing as a “lock free” algorithm, although some limited

> cases (a circular buffer
> accessed by at most two threads comes to mind) can be made to work
> without explicit locking.

Well, assuming that your target arch supports interlocked increment atomic
instruction, a circular buffer can be made to work with any number of
threads/CPUs without any additional explicit locking - just make sure that
its size is some power of 2, and that’s it…

But the discussion was, are there “lock free” algorithms for architectures
that don’t have hardware interlocking. If you use hardware interlocking,
then you can’t call it “lock free”

Anton Bassov


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> But the discussion was, are there “lock free” algorithms for architectures that don’t have

hardware interlocking.

OK…

Therefore, let me try to solve the problem that your professor has presented to you - let me try to implement a spinlock on such an arch, and do it is plain C…

Our lock variable has to be initialized to zero. Now let’s acquire a lock

void get_lock( unsigned long *lock)
{

unsigned long temp1=0,temp2=0;

while(1)

{

while(1)
{

temp1=*lock;

*lock++;

temp2=*lock;

if(temp2==1+temp1)break;

*lock–;

}

if(temp2==1)break;

*lock–;

}

}

void unlock(unsigned long *lock)

{

*lock=0;

}

What do you think about this code (apart from the obvious efficiency issues, of course)?

Anton Bassov