Optimize x86 atomic_fence #328

Lastique · 2020-12-28T21:11:32Z

The first commit provides an optimized atomic_fence implementation for x86 on gcc-compatible compilers.

On x86 (32 and 64-bit) any lock-prefixed instruction provides sequential consistency guarantees for atomic operations and is more efficient than mfence. You can see some tests in this article.

We are choosing a "lock not" on a dummy byte on the stack for the following reasons:

The "not" instruction does not affect flags or clobber any registers. The memory operand is presumably accessible through esp/rsp.
The dummy byte variable is at the top of the stack, which is likely hot in cache.
The dummy variable does not alias any other data on the stack, which means the "lock not" instruction won't introduce any false data dependencies with prior or following instructions.

In order to avoid various sanitizers and valgrind complaining, we have to initialize the dummy variable to zero prior to the operation.

Additionally, for memory orders weaker than seq_cst there is no need for any special instructions, and we only need a compiler fence. For the relaxed memory order we don't need even that.

The second commit removes explicit mfence on Windows. The existing std::atomic_thread_fence already provides the necessary instructions to maintain memory order according to its argument.

include/oneapi/tbb/detail/_machine.h

On x86 (32 and 64-bit) any lock-prefixed instruction provides sequential consistency guarantees for atomic operations and is more efficient than mfence. We are choosing a "lock not" on a dummy byte on the stack for the following reasons: - The "not" instruction does not affect flags or clobber any registers. The memory operand is presumably accessible through esp/rsp. - The dummy byte variable is at the top of the stack, which is likely hot in cache. - The dummy variable does not alias any other data on the stack, which means the "lock not" instruction won't introduce any false data dependencies with prior or following instructions. In order to avoid various sanitizers and valgrind complaining, we have to initialize the dummy variable to zero prior to the operation. Additionally, for memory orders weaker than seq_cst there is no need for any special instructions, and we only need a compiler fence. For the relaxed memory order we don't need even that. This optimization is only enabled for gcc up to version 11. In gcc 11 the compiler implements a similar optimization for std::atomic_thread_fence. Compilers compatible with gcc (namely, clang up to 13 and icc up to 2021.3.0, inclusively) identify themselves as gcc < 11 and also benefit from this optimization, as they otherwise generate mfence for std::atomic_thread_fence(std::memory_order_seq_cst). Signed-off-by: Andrey Semashev <[email protected]>

The necessary instructions according to the memory order argument should already be generated by std::atomic_thread_fence. Signed-off-by: Andrey Semashev <[email protected]>

The code uses memory_order_seq_cst in all call sites of atomic_fence, so remove the argument and simplifiy the implementation a bit. Also, renamed the function to make the memory order it implements apparent. Signed-off-by: Andrey Semashev <[email protected]>

* Added optimized x86 atomic_fence for gcc-compatible compilers. On x86 (32 and 64-bit) any lock-prefixed instruction provides sequential consistency guarantees for atomic operations and is more efficient than mfence. We are choosing a "lock not" on a dummy byte on the stack for the following reasons: - The "not" instruction does not affect flags or clobber any registers. The memory operand is presumably accessible through esp/rsp. - The dummy byte variable is at the top of the stack, which is likely hot in cache. - The dummy variable does not alias any other data on the stack, which means the "lock not" instruction won't introduce any false data dependencies with prior or following instructions. In order to avoid various sanitizers and valgrind complaining, we have to initialize the dummy variable to zero prior to the operation. Additionally, for memory orders weaker than seq_cst there is no need for any special instructions, and we only need a compiler fence. For the relaxed memory order we don't need even that. This optimization is only enabled for gcc up to version 11. In gcc 11 the compiler implements a similar optimization for std::atomic_thread_fence. Compilers compatible with gcc (namely, clang up to 13 and icc up to 2021.3.0, inclusively) identify themselves as gcc < 11 and also benefit from this optimization, as they otherwise generate mfence for std::atomic_thread_fence(std::memory_order_seq_cst). Signed-off-by: Andrey Semashev <[email protected]> * Removed explicit mfence in atomic_fence on Windows. The necessary instructions according to the memory order argument should already be generated by std::atomic_thread_fence. Signed-off-by: Andrey Semashev <[email protected]> * Removed memory order argument from atomic_fence. The code uses memory_order_seq_cst in all call sites of atomic_fence, so remove the argument and simplifiy the implementation a bit. Also, renamed the function to make the memory order it implements apparent. Signed-off-by: Andrey Semashev <[email protected]>

Lastique force-pushed the optimize_x86_fence branch from 0ab68f2 to 06a56b3 Compare November 25, 2021 13:45

Lastique changed the base branch from onetbb_2021 to master November 25, 2021 13:45

alexey-katranov reviewed Nov 25, 2021

View reviewed changes

include/oneapi/tbb/detail/_machine.h Show resolved Hide resolved

Lastique force-pushed the optimize_x86_fence branch from 06a56b3 to 1d22948 Compare November 25, 2021 17:23

Lastique added 3 commits November 26, 2021 01:22

Removed explicit mfence in atomic_fence on Windows.

de51c60

The necessary instructions according to the memory order argument should already be generated by std::atomic_thread_fence. Signed-off-by: Andrey Semashev <[email protected]>

Lastique force-pushed the optimize_x86_fence branch from 1d22948 to 8feefce Compare November 25, 2021 22:30

Lastique requested a review from alexey-katranov November 25, 2021 22:30

alexey-katranov approved these changes Nov 26, 2021

View reviewed changes

alexey-katranov added the enhancement label Dec 22, 2021

alexey-katranov merged commit 8a87469 into oneapi-src:master Dec 22, 2021

Lastique deleted the optimize_x86_fence branch December 22, 2021 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize x86 atomic_fence #328

Optimize x86 atomic_fence #328

Lastique commented Dec 28, 2020

Optimize x86 atomic_fence #328

Optimize x86 atomic_fence #328

Conversation

Lastique commented Dec 28, 2020