Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chacha: safer outputting #1181

Merged
merged 4 commits into from
Sep 13, 2021
Merged

chacha: safer outputting #1181

merged 4 commits into from
Sep 13, 2021

Conversation

kazcw
Copy link
Collaborator

@kazcw kazcw commented Sep 11, 2021

mod guts was originally designed for the byteslice interface RustCrypto APIs require--but the algorithm operates on u32 words internally, and rand wants a wordslice interface, so we were converting to bytes in mod guts and converting back to words in mod chacha. We can simply output directly to a wordslice in guts. It is simpler; it may be marginally faster; it avoids an unsafe (cf. #1170).

@dhardy
Copy link
Member

dhardy commented Sep 12, 2021

Thanks! Could you do cargo +nightly bench --bench generators chacha before/after?

@dhardy
Copy link
Member

dhardy commented Sep 12, 2021

CC @joshlf @Ralith

@kazcw
Copy link
Collaborator Author

kazcw commented Sep 12, 2021

It is actually between 1-2% slower. Looking at the ASM, the optimizer does the right thing with the bytewise loop (unroll the loop, move data in SIMD chunks), but it doesn't see through the wordwise loop.

However, I found that if I manually unroll the loop, the optimizer produces SIMD output equivalent to the current unsafe version. While that adds 16 repetitious lines, it achieves safety without affecting current performance.

Benchmarks (SSE4.1 machine):

Before this PR:
test gen_bytes_chacha12 ... bench: 1,176,279 ns/iter (+/- 9,206) = 870 MB/s
test gen_bytes_chacha20 ... bench: 1,774,785 ns/iter (+/- 17,849) = 576 MB/s
test gen_bytes_chacha8 ... bench: 869,230 ns/iter (+/- 951) = 1178 MB/s

After original PR:
test gen_bytes_chacha12 ... bench: 1,193,299 ns/iter (+/- 9,527) = 858 MB/s
test gen_bytes_chacha20 ... bench: 1,809,063 ns/iter (+/- 1,675) = 566 MB/s
test gen_bytes_chacha8 ... bench: 892,924 ns/iter (+/- 2,077) = 1146 MB/s

After updated PR:
test gen_bytes_chacha12 ... bench: 1,165,595 ns/iter (+/- 9,184) = 878 MB/s
test gen_bytes_chacha20 ... bench: 1,773,808 ns/iter (+/- 2,155) = 578 MB/s
test gen_bytes_chacha8 ... bench: 870,023 ns/iter (+/- 1,885) = 1176 MB/s

@Ralith
Copy link
Contributor

Ralith commented Sep 12, 2021

What if you use explicit indexing (for i in 0..4) rather than zipped iterators? I've found that tends to optimize down more reliably in hecs, at least. Failing that, seems worth leaving a comment to justify the hand-unroll to help provide context in the future.

@kazcw

This comment has been minimized.

This reverts commit 7d9607a.

(Had a bug, after fixing the bug perf was poor)
@kazcw
Copy link
Collaborator Author

kazcw commented Sep 12, 2021

@Ralith: Thanks for the idea, but in this case I'm getting poor performance with a 0..4 loop. Tried as follows:

for i in 0..4 {
    let j = i * 16;
    out[j..(j+4)].copy_from_slice(&(a[i] + k).to_lanes());
    out[(j+4)..(j+8)].copy_from_slice(&(b[i] + sb).to_lanes());
    out[(j+8)..(j+12)].copy_from_slice(&(c[i] + sc).to_lanes());
    out[(j+12)..(j+16)].copy_from_slice(&(d[i] + sd[i]).to_lanes());
}

@Ralith
Copy link
Contributor

Ralith commented Sep 12, 2021

Ah well, thanks for trying!

@vks vks merged commit 6e6b4ce into rust-random:master Sep 13, 2021
@dhardy
Copy link
Member

dhardy commented Sep 15, 2021

Perf numbers for another PR I'm making weren't what I expected, but I narrowed the results down to this:

# before:
test gen_bytes_chacha12      ... bench:     296,278 ns/iter (+/- 3,935) = 3456 MB/s
test gen_bytes_chacha20      ... bench:     434,789 ns/iter (+/- 4,822) = 2355 MB/s
test gen_bytes_chacha8       ... bench:     226,779 ns/iter (+/- 1,013) = 4515 MB/s
# after:
test gen_bytes_chacha12      ... bench:     233,626 ns/iter (+/- 2,274) = 4383 MB/s
test gen_bytes_chacha20      ... bench:     372,063 ns/iter (+/- 4,054) = 2752 MB/s
test gen_bytes_chacha8       ... bench:     162,006 ns/iter (+/- 6,354) = 6320 MB/s

That's 15-29% slower. (CPU is 5800X aka Vermeer/Zen 3.)

@vks
Copy link
Collaborator

vks commented Sep 15, 2021

What is "before", and what is "after"? Your numbers look faster.

@kazcw Did you perform your benchmarks with native optimizations or without?

@dhardy
Copy link
Member

dhardy commented Sep 15, 2021

"Before" is ceb25f8 (master before merging this), "after" is 6e6b4ce.

SSE4.1 was introduced in Intel Penryn in 2008, quite ancient by this point. I guess @kazcw likes vintage hardware?

@vks
Copy link
Collaborator

vks commented Sep 15, 2021

"Before" is ceb25f8 (master before merging this), "after" is 6e6b4ce.

So now is faster? I think you switched "before" and "after".

@vks
Copy link
Collaborator

vks commented Sep 15, 2021

I also observe the performance regression on a Ryzen 9 4900HS, independent of native optimizations. So it looks like the new code does not optimize properly for AVX?

Before:

# ceb25f8
test gen_bytes_chacha12      ... bench:     357,973 ns/iter (+/- 31,898) = 2860 MB/s
test gen_bytes_chacha20      ... bench:     537,607 ns/iter (+/- 71,716) = 1904 MB/s
test gen_bytes_chacha8       ... bench:     277,995 ns/iter (+/- 42,393) = 3683 MB/s
# ceb25f8 with RUSTFLAGS="-Ctarget-cpu=native"
test gen_bytes_chacha12      ... bench:     336,754 ns/iter (+/- 33,961) = 3040 MB/s
test gen_bytes_chacha20      ... bench:     530,485 ns/iter (+/- 113,055) = 1930 MB/s
test gen_bytes_chacha8       ... bench:     253,036 ns/iter (+/- 32,046) = 4046 MB/s

After:

# 6e6b4ce
test gen_bytes_chacha12      ... bench:     442,418 ns/iter (+/- 77,900) = 2314 MB/s
test gen_bytes_chacha20      ... bench:     595,200 ns/iter (+/- 90,416) = 1720 MB/s
test gen_bytes_chacha8       ... bench:     353,452 ns/iter (+/- 51,468) = 2897 MB/s
# 6e6b4ce with RUSTFLAGS="-Ctarget-cpu=native"
test gen_bytes_chacha12      ... bench:     413,945 ns/iter (+/- 40,007) = 2473 MB/s
test gen_bytes_chacha20      ... bench:     619,950 ns/iter (+/- 91,325) = 1651 MB/s
test gen_bytes_chacha8       ... bench:     329,435 ns/iter (+/- 45,898) = 3108 MB/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants