Implementing Volatile Variables!

Hi All,
I just came across a discussion about how the volatile variables are implemented. Essentially the volatile variable is one which can change its value without the running code actually modifying it. A separate thread, hardware. I have seen the use of volatile keyword in interlocked APIs.

https://msdn.microsoft.com/en-us/windows/desktop/ms683614
There are equivalent kernel APIs as well to achieve this.

I have an understanding that the volatile variables are never cached at achieve this. My questions:-

  1. The volatile keyword in itself is a compiler directive. During the time the code is being compiled, the compiler does not know what memory location the variable will be kept in. So it cannot make a decision where the variable lands up during execution. So it cannot decide if the variable will fall in cached area or un-cached region.

  2. The only option that the compiler has at this point it to add memory barriers, across the code which is accessing the variable.What this means is that the entire cache line (64bytes), where this variable falls, will be flushed out to the memory (or to the next level of coherency), in order to reflect the updated value of the variable in the memory.

  3. Now another dimension to this discussion would be the presence of architecture specific primitives, like “lock” prefix present in X86 processors. If lock guarantees atomic increment of the variable in memory, we can disallow the use of memory barriers IF WE CAN KEEP THE VARIABLE in memory. This will greatly eliminate the cost of barriers. But the problem is how does a variable land in uncached region (compiler does not know where the uncached region is).

Please help with in-sights.
Thanks
Ajitabh

Actually, all volatile indicates is that the compiler needs to access it from memory, and not optimize it by storing it in a register or even a temp variable. There are some compilers that use memory fencing, but most do not recognizing the cache is coherent between processors in most modern architectures.

As Don says, the language doesn’t legally require volatile to mean much at the hardware level. Historically, Microsoft’s C compiler silently added memory barriers to any access through volatile. That is “nice” because you don’t have to think much about which barrier to use. But it’s also often excessive.

When we ported Windows to ARM32, we discovered that all those automatic barriers were way too much, since full fences on ARM are much more expensive than on x86/x64. So we got the compiler to make this an option: you can either get the automatic barriers, or limit the compiler to the behavior required by the language (which isn’t all that much).

So you have a choice:

  • compile your code on the Microsoft C compiler and use /volatile:ms
  • write explicit barriers in your code everywhere

Personally, I prefer the explicit barriers. My rationale:

  • You’re going to learn more about how processors work if you have to think about where/which barriers are needed.
  • Your code can be faster (especially on ARM).
  • Your code is portable across compilers.

If you do use explicit barriers, please do be consistent across your code. Don’t mix-and-match volatile with explicit barriers. Also, consistently use the same library. The NT kernel offers methods like WriteRelease8 and ReadULongNoFence. (They’re sometimes implemented using volatile, but you can ignore that detail.) The Microsoft compiler offers methods like _InterlockedAdd and _InterlockedExchangePointer. Finally, the C++ standard offers methods like std::atomic<unsigned int>::operator+= and std::atomic<bool>::exchange. Our toolchain doesn’t currently make the C++ stuff available for kernel code, but it should be available eventually.

To answer your specific questions:

  1. The compiler does not use the cached/noncached flag on the memory page to implement volatile. It is implemented by automatically inserting barriers before/after each access. In general, we don’t use noncached memory for code/data that is handled only on the CPU. Noncached memory is really only used when interfacing with certain types of devices.
  2. Right.
  3. The compiler assumes all memory is cached. (AFAIK, this doesn’t matter, except for some corner cases with misaligned access.) If the processor supports it, the compiler will use tricks to reduce the cost of the barrier. Like you mentioned, with certain types of accesses to an aligned, native-sized integer on x64, the compiler might just emit a lock prefix or for reads, even no prefix at all. (The absence of these sorts of conveniences on ARM are why automatic barriers perform badly on ARM; every access needs an expensive barrier.)

The volatile keyword in itself is a compiler directive. During the time the code is being compiled, the compiler
does not know what memory location the variable will be kept in. So it cannot make a decision where the variable
lands up during execution. So it cannot decide if the variable will fall in cached area or un-cached region.

To begin with, a compiler and a linker have a VERY good understanding of where the variables are going to end up. This is what the executable image file sections are for, in the first place. Therefore, had they any interest in placing certain variables in noncached memory, they could easily achieve this goal simply by placing them is a special section in an image file.

In practical terms, UC Caching Type is normally used only for the memory-mapped devices. If it was used for the “conventional” memory the overall system performance would be simply abysmal. Therefore, a caching type and “volatile” keyword are totally unrelated to one another.

As Don pointed out already, the only thing that “volatile” keyword means is that the target variable has to be read from/written to memory every time it is accessed. This part is defined by the C language specification and, hence, is compulsory for any compiler in existence.
There are no other requirements. Certainly, as Jeffrey pointed out, a compiler may optionally insert a memory barrier, but this is not a part of the language specification, and, hence, may vary across the compilers and/or architectures.

Now another dimension to this discussion would be the presence of architecture specific primitives, like “lock” prefix
present in X86 processors. If lock guarantees atomic increment of the variable in memory, we can disallow the use
of memory barriers IF WE CAN KEEP THE VARIABLE in memory. This will greatly eliminate the cost of barriers.

Well, LOCK prefix just acts as an implicit memory barrier. Every time you use it you can think of having had inserted an explicit memory barrier into your code. However, don’t forget that an interlocked op may be significantly more expensive, compared to a simple FENCE

But the problem is how does a variable land in uncached region (compiler does not know where the uncached region is).

As you can see it yourself, this “problem” is 100% an imaginary one…

Anton Bassov

Appreciate this like always!