There seems to be a lot of confusion about the significance of using std::atomic<T>
over a much more straightforward volatile T
or T
. In the context of concurrency, being aware of the difference between the three cases is one fundamental cornerstone in ensuring correctness.
Consider the following sample:
#include <atomic>
#include <cstdint>
#include <thread>
int main()
{
int64_t shared = 0;
std::thread thread([&shared]() {
shared = 42;
});
while (shared != 42) {
}
thread.join();
return 0;
}
The code uses neither volatile
nor std::atomic
at this point and shares an integer variable shared
with another thread. The expected order of events is as follows:
- Thread A initializes
shared
with 0
- Thread B sets
shared
to 42
- Thread A busy-waits until
shared
is 42
The code is simple and works as expected in case of GCC and default options (corresponding to -O0
). The test seems to indicate that the sample is correct but it is not. With -O3
, the sample hangs indefinitely. Similar code could easily end up in production and might even run flawlessly for years.
To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 based x86_64 assembly corresponding to the working binary:
// Thread B:
// shared = 42;
movq -8(%rbp), %rax
movq (%rax), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
.L11:
movq -32(%rbp), %rax
cmpq $42, %rax
jne .L11
Thread B executes a simple store of the value 42
in shared
.
Thread A reads shared
for each loop iteration until the comparison indicates equality.
Now, we compare that to the -O3
outcome:
// Thread B:
// shared = 42;
movq 8(%rdi), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
cmpq $42, (%rsp)
je .L87
.L88:
jmp .L88
.L87:
Optimizations associated with -O3
replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out.
The problem is that the compiler and its optimizer are not aware of the implementation’s concurrency implications. Consequently, the conclusion needs to be that shared
cannot change in thread A - the loop is equivalent to dead code. Data races are undefined behavior (UB), the optimizer is allowed to assume that the program doesn’t encounter UB.
The solution requires us to communicate to the compiler that shared
is involved in inter-thread communication. One way to accomplish that may be volatile
. While the actual meaning of volatile
varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that volatile
prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of std::atomic
.
With volatile int64_t shared
, the generated instructions change as follows:
// Thread B:
// shared = 42;
movq 24(%rdi), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
.L87:
movq 8(%rsp), %rax
cmpq $42, %rax
jne .L87
The loop cannot be eliminated anymore as it must be assumed that shared
changed even though there’s no evidence of that in the form of code. As a result, the sample now works with -O3
.
If volatile
fixes the issue, why would you ever need std::atomic
? Two aspects relevant for lock-free code are what makes std::atomic
essential: memory operation atomicity and memory order.
To build the case for load/store atomicity, we review the generated assembly compiled with -m32
(the 32-bit version):
// Thread B:
// shared = 42;
movl 4(%esp), %eax
movl 12(%eax), %eax
movl $42, (%eax)
movl $0, 4(%eax)
// Thread A:
// while (shared != 42) {
// }
.L88:
movl 40(%esp), %eax
movl 44(%esp), %edx
xorl $42, %eax
movl %eax, %ecx
orl %edx, %ecx
jne .L88
For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn’t affect other compilers and might not hold true for all supported platforms.
To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, std::atomic
can be employed. Let’s review how std::atomic
affects the generated assembly. The updated sample:
#include <atomic>
#include <cstdint>
#include <thread>
int main()
{
std::atomic<int64_t> shared;
std::thread thread([&shared]() {
shared.store(42, std::memory_order_relaxed);
});
while (shared.load(std::memory_order_relaxed) != 42) {
}
thread.join();
return 0;
}
The generated 32-bit assembly based on GCC 10.2:
// Thread B:
// shared.store(42, std::memory_order_relaxed);
movl $42, %ecx
xorl %ebx, %ebx
subl $8, %esp
movl 16(%esp), %eax
movl 4(%eax), %eax
movl %ecx, (%esp)
movl %ebx, 4(%esp)
movq (%esp), %xmm0
movq %xmm0, (%eax)
addl $8, %esp
// Thread A:
// while (shared.load(std::memory_order_relaxed) != 42) {
// }
.L9:
movq -16(%ebp), %xmm1
movq %xmm1, -32(%ebp)
movl -32(%ebp), %edx
movl -28(%ebp), %ecx
movl %edx, %eax
movl %ecx, %edx
xorl $42, %eax
orl %eax, %edx
jne .L9
To guarantee atomicity for loads and stores, the compiler emits a movq
instruction based on an SSE 128-bit register. Additionally, the assembly shows that the loop remains intact even though volatile
was removed.
By using std::atomic
in the sample, it is guaranteed that
- std::atomic loads and stores are not subject to register-based caching
- std::atomic loads and stores do not allow partial values to be observed
Please note that the C++ standard does not explicitly prohibit (register-based) caching of std::atomic
operations:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
While that leaves room for interpretation, caching std::atomic
loads as triggered in our sample would clearly be a violation - the store might never become visible.
On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes. For different sizes and other platforms, the atomicity guarantee might require atomic read-modify-write (RMW) instructions (single-instruction atomic RMWs internally involve a cache line lock or a bus lock), or even a higher-level locking primitive (e.g. a mutex).
What’s the second aspect mentioned above and what does std::memory_order_relaxed
mean?
Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for std::atomic
loads and stores. std::memory_order_relaxed
does not impose any particular order.
Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness.
Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer std::atomic<T>
with std::memory_order_relaxed
(unless the scenario calls for a specific memory order) to volatile T
(and, of course, T
). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.