Reading of a variable managed atomically for write


I have come across the following post which concludes reading of a variable, which is being managed atomically for write, need not be atomic:

[1] https://community.osr.com/discussion/comment/292626e

But, suppose there is a variable in a structure which is n (n < x/8 on a x bit architecture machine) byte aligned and the memory access might take more than one cycle…

Am I missing something, such as no such alignment possible with such APIs, which is already understood ? …

I believe this is a trivial stuff for experienced developers but for me as I see no such word as alignment can be found in the following links…:

[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/nf-wdm-interlockedcompareexchange

[3] https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/nf-wdm-interlockedincrement

I have come across some relevant information here:

[4] https://stackoverflow.com/questions/1415256/alignment-requirements-for-atomic-x86-instructions

But it also apparently does not talk about the case if the value of n is not an even number.

Note: Even though URL [4] is almost a decade old, still it was the most relevant for the discussion which I could find in my limited search. Also I am not sure if there were any furthere updated discussions in the context of main URL [1].

Thanks
Sudhakar

Contrary to certain statements to the contrary in the MSFT documentation, alignment of any kind is not required for Interlocked intrinsics. Not only can you mis-align the parameters by byte, the operands can even cross a cache line. I won’t repeat the same cut/paste from the SO thread you cited, and my access to the Internet at the moment is very slow… but you can find the relevant section on your own in the Intel processor manual.

It would be a pretty lame processor that DID require this alignment, but let the instruction(s) execute regardless.

Having said all that, that does not mean that unaligned operations perform well. And that’s (one reason) why fields are aligned by default on their “natural” boundaries in structures on Windows.

Peter

Note that while WIndows does allocate pool on an 8-byte alignment, if
you use any headers within structures, or your own allocation helpers,
you might misalign the fields!

I did come across this, when I forgot about the above, and my
allocation helpers added sizeof(size_t) on x86 (and returned the
ExAllocatepool-ed address + 4 bytes!). __int64
(LARGE_INTEGER->QuadPart for example) is misaligned as a result.

I never checked if x64 aligns on a 16-byte boundary, but that might be
a good thing to check if you wish to use non-interlocked 128-bits ops!

Having said all that, that does not mean that unaligned operations perform
well. And that’s (one reason) why fields are aligned by default on their
“natural” boundaries in structures on Windows.

Contrary to certain statements to the contrary in the MSFT documentation, alignment of any kind is not required for Interlocked intrinsics.

This boils down to the difference between Windows’s abstract memory model, and the memory model of all the processors that Windows currently runs on. The abstract memory model is necessarily at least as restrictive as the union of all the processors restrictions.

In this case, Windows currently ships on processors that require alignment for atomic integer access: ARM32 and ARM64 have atomic access only to aligned memory locations. (See, e.g., armarm A3.5.3 for v7, or B2.2.1 for v8).

It’s rare that you’re actually writing code against the memory model of a specific Intel processor: you’re more likely writing code against the memory model of the software platform below you. Likewise, if you write Android apps, you don’t write to ARM’s memory model; you write to the Java memory model.

Contrary to certain statements to the contrary in the MSFT documentation, alignment of any kind is not required for Interlocked intrinsics. > Not only can you mis-align the parameters by byte, the operands can even cross a cache line.

Although there is, indeed, no requirement for interlocked OPs to be properly aligned because they are guaranteed to be atomic anyway,
I believe you may still potentially get into a trouble if you read the target misaligned variable with a simple MOV instruction. The problem is that, according to Intel manuals, only properly-aligned “simple” reads and writes (which means the ones that don’t cross the cache line, at least as far as reasonably modern processors are concerned) are guaranteed to be atomic. This what makes them different from interlocked operations that are guaranteed to be atomic no matter what and no matter how.

Now consider the following scenario. CPU A starts reading a variable the crosses the cache line. This operation cannot be done atomically, right. Therefore, it requires at least two “half-reads” with a certain time slot between them . If CPU B does an interlocked operation on the target variable and gets in between these “half-reads”, CPU A is more than likely to get inconsistent results that don’t correspond to the state of the target variable either before or after the update. This problem cannot arise as long as a "simple’ read is atomic (i.e. is properly aligned)

Having said all that, that does not mean that unaligned operations perform well.

As you can see, it may well beyond the performance issues…

Anton Bassov

I believe you may still potentially get into a trouble if you read the target misaligned variable with a simple MOV instruction.

Sure. But that’s not what the OP is asking about.

The abstract memory model is necessarily at least as restrictive as the union of all the processors restrictions.

Agreed.

ARM32 and ARM64 have atomic access only to aligned memory locations. (See, e.g., armarm A3.5.3 for v7, or B2.2.1 for v8).

I didn’t know that, so thanks (again) Mr. TIppet!

B2.2.1 of the ARM ARM talks about “ordinary” accesses, but B.2.5.2 speaks to this… and very clearly.

All Load-Exclusive/Store-Exclusive, Load-Acquire/Store-Release, and Compare and Swap memory accesses that access a single element or a pair of elements generate an Alignment fault if the address being accessed is not aligned to the size of the data structure being accessed. A failed alignment check results in an Alignment fault, which is taken as a Data Abort exception

So… Good news: When alignment IS required, the processor will not “do the wrong thing” and let you merrily go on your way.

And thanks again to Mr. Tippet, for keeping us (er, ME) on our (er, MY) toes.

P

Sure. But that’s not what the OP is asking about.

Please look at his original question carefully

[begin quote]

I have come across the following post which concludes reading of a variable, which is being managed atomically for write, need not be atomic:

[1] https://community.osr.com/discussion/comment/292626e

But, suppose there is a variable in a structure which is n (n < x/8 on a x bit architecture machine) byte aligned and the memory access might take more than one cycle…

[end quote]

In other words, this is EXACTLY what his original question seems (at least to me) to be all about - he refers to our statements about the “simple read - interlocked_op” interactions, and wonders whether these interactions may be different if the target variable happens to be misaligned…

Anton Bassov

I will go through the latest manual. Meanwhile, after looking at the discussion so far, I take it as “Reading of a variable managed atomically for write” is not possible without a lock or otherwise aligned at even boundary as it appears.

Thanks for pointing at the subtle differene here: coding for software platform vs coding for processor: Atleast I often consider only x86 architecture when coding. I can make more sense now why it was mentioned “InterlockedCompareExchange is atomic only with respect to other InterlockedXxx calls” in the URL [2] of the first post and similar statements in other InterlockedXxx docs. So it is better to access read also with those functions. Meaning the right time proof way of reading a variable managed atomically for write using any of the InterlockedXxx() calls anywhere, when coding for software platform, is by calling something like InterlockedAnd() with …

Btw, yes, it was the corrupted value of variable that I was referring and was indicating same when said about alignment at non even number as at even numbers, access seems to be guaranteed by processor.

thanks
sudhakar