For tl;dr people, from Effective Modern C++:
std::atomic
is for data accessed from multiple threads without using mutexes. It’s a tool for writing concurrent software.volatile
is for memory where reads and writes should not be optimized away. It’s a tool for working with special memory.
volatile
is a type qualifier that you can use to declare that an object can be modified in the program by the hardware. It has nothing to do with concurrent programming. However, some compilers, e.g. Microsoft C++ compiler, have imbued volatile with semantics that render it applicable to concurrent software.
std::atomic
template offer operations that are guaranteed to be seen as atomic by other threads. Once a std::atomic object has been constructed, operations on it behave as if they were inside a mutexprotected critical section. The operations are generally implemented using special machine instructions that are more efficient than would be the case of a mutex were employed.
I did some performance tests using codelets from Comparing the performance of atomic, spinlock and mutex for std::atomic
and volatile
. The following results are from GCC 4.8.5 (Ubuntu 4.8.51ubuntu1~14.04), GCC 5.2.1 (Ubuntu 5.2.122ubuntu2), and MSVC 2013 Update 5 on an Intel Core i74790 CPU. Without proper locking, the increment and decrement operations for volatile
are not atomic and the output is not deterministic neither for GCC nor for MSVC 2013. Using std::atomic
template is around 10 times slower for GCC and 5 times slower for MSVC than volatile
but it guarantees the correct result and doesn’t look like there is a practical performance difference.
GCC 4.8.5 on Ubuntu 14.04 VM | GCC 5.2.1 on Ubuntu 15.10 VM | MSVC 2013 Update 5 on Windows 7 | |
---|---|---|---|
volatile int |
39.3956 ms / 39.988 ms / 42.4279 ms | 58.0358 ms / 58.5282 ms / 58.0955 ms | 100.02 ms / 109.522 ms / 106.021 ms |
std::atomic<int> |
346.683 ms / 362.175 ms / 314.465 ms | 436.179 ms / 429.875 ms / 439.644 ms | 488.098 ms / 529.106 ms / 533.607 ms |
The volatile
keyword in C++11 ISO Standard code is to be used only for hardware access; do not use it for interthread communication. For interthread communication, use mechanisms such as std::atomic
from the C++ Standard Template Library.
Updated on March 21, 2020
Q: Why 5-10x slower isn’t a practical performance difference?
A: I should have made it clear that the numbers in the table are from 20 million additions and 10 million subtractions on an integer. For each increment and decrement operation, the running time is almost negligible. Therefore, 10x slower isn’t a practical performance difference in real life applications.
For volatile int
:
#include <chrono>
#include <future>
#include <iostream>
volatile int vi = 0;
void loop(bool inc, int max) {
for (int i = 0; i < max; ++i) {
if (inc) {
++vi;
} else {
--vi;
}
}
}
int main() {
auto start = std::chrono::steady_clock::now();
auto f = std::async(std::launch::async, std::bind(loop, true, 20'000'000));
loop(false, 10'000'000);
f.wait();
auto end = std::chrono::steady_clock::now();
std::cout << vi << std::endl;
auto diff = end - start;
auto ms = std::chrono::duration<double, std::milli>(diff).count();
std::cout << ms << std::endl;
return 0;
}
For std::atomic<int>
:
#include <atomic>
#include <chrono>
#include <future>
#include <iostream>
std::atomic<int> ai(0);
void loop(bool inc, int max) {
for (int i = 0; i < max; ++i) {
if (inc) {
++ai;
} else {
--ai;
}
}
}
int main() {
auto start = std::chrono::steady_clock::now();
auto f = std::async(std::launch::async, std::bind(loop, true, 20'000'000));
loop(false, 10'000'000);
f.wait();
auto end = std::chrono::steady_clock::now();
std::cout << ai << std::endl;
auto diff = end - start;
auto ms = std::chrono::duration<double, std::milli>(diff).count();
std::cout << ms << std::endl;
return 0;
}
The result should be 10 million but the value is not guaranteed using volatile int
. Of course, the testing code is preliminary and there is room for improvement.