Atomic Memory Operations

RsQUE · June 30, 2025, 5:23pm

Hello.
How can I do atomic memcpy and memset operations in x64?
I can't use Interlocked* because the number of bytes I want to do an atomic operation on exceeds 64 bytes.
Any ideas?

Tim_Roberts · June 30, 2025, 6:14pm

Atomic in what sense? If you have a data structure accessed in multiple places that you do not want to leave in an inconsistent state, then you must do your own interlocking, using a spin lock or a critical section.

MBond2 · June 30, 2025, 9:57pm

Generally, when people say 'atomic' or 'lock free' what they mean is that the memory access is guarded by a hardware level lock.

obviously, this is only useful when a block of memory is accessed my multiple threads (CPUs) concurrently. In the old days, this used to be implemented via bus lock. For the last couple of decades, it has been implemented via coherency protocol. Memory fences allow single interlocked operations to protect other memory access etc. This is a complex topic

If the available hardware privatives are too small to handle the size of memory block you need to work on, then you have no choice except to use a software level lock. All software level locks are built on top of the hardware level ones.

If you are worried about performance, don't discount the effectiveness of the memory coherency protocols for memory accesses that don't conflict. And transactional memory solutions can be even better. Sometimes a lot better.

But there is far too much to say on this topic for a single post, so if you want better help, a more specific question would help you to get it faster

pulso · July 23, 2025, 11:51pm

Thanks for the detailed explanation, really helpful. I was wondering, are there any common patterns or best practices for implementing software-level atomic memcpy/memset for, say, 128 bytes? Especially in performance-critical code?

MBond2 · July 24, 2025, 12:24am

if there is no hardware primitive large enough, then the choices are
1 - use a lock. A lock will use a small integer and a memory fence to guard access to your larger region of memory

2 - use hardware transactional memory. This requires a CPU that support this, and there are lot's of models that don't. These are special CPU instructions that allow your code to interact with the memory coherency protocol, make optimistic changes, and then be informed when another CPU performed a conflicting operation. Then your code either re-does, or un-does the work it was trying to do

3 - use a software transactional memory scheme. This requires no special hardware, and uses locks and versioning to perform what hardware transactional memory does

The choice is almost always to use a standard lock. If your hardware supports it, hardware transactional memory can improve the results. The improvement gets larger the larger and more complex your memory access pattern is - with few actual conflicts of course. A lot of conflicts will make this scale negatively. The same applies to software transactional schemes

for 128 bytes, a standard lock is almost certainly the right choice.

system · September 22, 2025, 12:24am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.