Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize x86 atomic_fence #328

Merged
merged 3 commits into from
Dec 22, 2021

Conversation

Lastique
Copy link
Contributor

The first commit provides an optimized atomic_fence implementation for x86 on gcc-compatible compilers.

On x86 (32 and 64-bit) any lock-prefixed instruction provides sequential consistency guarantees for atomic operations and is more efficient than mfence. You can see some tests in this article.

We are choosing a "lock not" on a dummy byte on the stack for the following reasons:

  • The "not" instruction does not affect flags or clobber any registers. The memory operand is presumably accessible through esp/rsp.
  • The dummy byte variable is at the top of the stack, which is likely hot in cache.
  • The dummy variable does not alias any other data on the stack, which means the "lock not" instruction won't introduce any false data dependencies with prior or following instructions.

In order to avoid various sanitizers and valgrind complaining, we have to initialize the dummy variable to zero prior to the operation.

Additionally, for memory orders weaker than seq_cst there is no need for any special instructions, and we only need a compiler fence. For the relaxed memory order we don't need even that.

The second commit removes explicit mfence on Windows. The existing std::atomic_thread_fence already provides the necessary instructions to maintain memory order according to its argument.

@Lastique Lastique changed the base branch from onetbb_2021 to master November 25, 2021 13:45
On x86 (32 and 64-bit) any lock-prefixed instruction provides sequential
consistency guarantees for atomic operations and is more efficient than
mfence.

We are choosing a "lock not" on a dummy byte on the stack for the following
reasons:

 - The "not" instruction does not affect flags or clobber any registers.
   The memory operand is presumably accessible through esp/rsp.
 - The dummy byte variable is at the top of the stack, which is likely
   hot in cache.
 - The dummy variable does not alias any other data on the stack, which
   means the "lock not" instruction won't introduce any false data
   dependencies with prior or following instructions.

In order to avoid various sanitizers and valgrind complaining, we have to
initialize the dummy variable to zero prior to the operation.

Additionally, for memory orders weaker than seq_cst there is no need for
any special instructions, and we only need a compiler fence. For the relaxed
memory order we don't need even that.

This optimization is only enabled for gcc up to version 11. In gcc 11 the
compiler implements a similar optimization for std::atomic_thread_fence.
Compilers compatible with gcc (namely, clang up to 13 and icc up to 2021.3.0,
inclusively) identify themselves as gcc < 11 and also benefit from this
optimization, as they otherwise generate mfence for
std::atomic_thread_fence(std::memory_order_seq_cst).

Signed-off-by: Andrey Semashev <[email protected]>
The necessary instructions according to the memory order argument
should already be generated by std::atomic_thread_fence.

Signed-off-by: Andrey Semashev <[email protected]>
The code uses memory_order_seq_cst in all call sites of atomic_fence,
so remove the argument and simplifiy the implementation a bit. Also, renamed
the function to make the memory order it implements apparent.

Signed-off-by: Andrey Semashev <[email protected]>
@alexey-katranov alexey-katranov merged commit 8a87469 into oneapi-src:master Dec 22, 2021
@Lastique Lastique deleted the optimize_x86_fence branch December 22, 2021 11:11
kboyarinov pushed a commit that referenced this pull request Dec 27, 2021
* Added optimized x86 atomic_fence for gcc-compatible compilers.

On x86 (32 and 64-bit) any lock-prefixed instruction provides sequential
consistency guarantees for atomic operations and is more efficient than
mfence.

We are choosing a "lock not" on a dummy byte on the stack for the following
reasons:

 - The "not" instruction does not affect flags or clobber any registers.
   The memory operand is presumably accessible through esp/rsp.
 - The dummy byte variable is at the top of the stack, which is likely
   hot in cache.
 - The dummy variable does not alias any other data on the stack, which
   means the "lock not" instruction won't introduce any false data
   dependencies with prior or following instructions.

In order to avoid various sanitizers and valgrind complaining, we have to
initialize the dummy variable to zero prior to the operation.

Additionally, for memory orders weaker than seq_cst there is no need for
any special instructions, and we only need a compiler fence. For the relaxed
memory order we don't need even that.

This optimization is only enabled for gcc up to version 11. In gcc 11 the
compiler implements a similar optimization for std::atomic_thread_fence.
Compilers compatible with gcc (namely, clang up to 13 and icc up to 2021.3.0,
inclusively) identify themselves as gcc < 11 and also benefit from this
optimization, as they otherwise generate mfence for
std::atomic_thread_fence(std::memory_order_seq_cst).

Signed-off-by: Andrey Semashev <[email protected]>

* Removed explicit mfence in atomic_fence on Windows.

The necessary instructions according to the memory order argument
should already be generated by std::atomic_thread_fence.

Signed-off-by: Andrey Semashev <[email protected]>

* Removed memory order argument from atomic_fence.

The code uses memory_order_seq_cst in all call sites of atomic_fence,
so remove the argument and simplifiy the implementation a bit. Also, renamed
the function to make the memory order it implements apparent.

Signed-off-by: Andrey Semashev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants