chacha: safer outputting #1181

kazcw · 2021-09-11T22:21:54Z

mod guts was originally designed for the byteslice interface RustCrypto APIs require--but the algorithm operates on u32 words internally, and rand wants a wordslice interface, so we were converting to bytes in mod guts and converting back to words in mod chacha. We can simply output directly to a wordslice in guts. It is simpler; it may be marginally faster; it avoids an unsafe (cf. #1170).

dhardy · 2021-09-12T08:39:56Z

Thanks! Could you do cargo +nightly bench --bench generators chacha before/after?

dhardy · 2021-09-12T08:57:29Z

CC @joshlf @Ralith

kazcw · 2021-09-12T16:25:46Z

It is actually between 1-2% slower. Looking at the ASM, the optimizer does the right thing with the bytewise loop (unroll the loop, move data in SIMD chunks), but it doesn't see through the wordwise loop.

However, I found that if I manually unroll the loop, the optimizer produces SIMD output equivalent to the current unsafe version. While that adds 16 repetitious lines, it achieves safety without affecting current performance.

Benchmarks (SSE4.1 machine):

Before this PR:
test gen_bytes_chacha12 ... bench: 1,176,279 ns/iter (+/- 9,206) = 870 MB/s
test gen_bytes_chacha20 ... bench: 1,774,785 ns/iter (+/- 17,849) = 576 MB/s
test gen_bytes_chacha8 ... bench: 869,230 ns/iter (+/- 951) = 1178 MB/s

After original PR:
test gen_bytes_chacha12 ... bench: 1,193,299 ns/iter (+/- 9,527) = 858 MB/s
test gen_bytes_chacha20 ... bench: 1,809,063 ns/iter (+/- 1,675) = 566 MB/s
test gen_bytes_chacha8 ... bench: 892,924 ns/iter (+/- 2,077) = 1146 MB/s

After updated PR:
test gen_bytes_chacha12 ... bench: 1,165,595 ns/iter (+/- 9,184) = 878 MB/s
test gen_bytes_chacha20 ... bench: 1,773,808 ns/iter (+/- 2,155) = 578 MB/s
test gen_bytes_chacha8 ... bench: 870,023 ns/iter (+/- 1,885) = 1176 MB/s

Ralith · 2021-09-12T19:32:16Z

What if you use explicit indexing (for i in 0..4) rather than zipped iterators? I've found that tends to optimize down more reliably in hecs, at least. Failing that, seems worth leaving a comment to justify the hand-unroll to help provide context in the future.

This reverts commit 7d9607a. (Had a bug, after fixing the bug perf was poor)

kazcw · 2021-09-12T20:02:50Z

@Ralith: Thanks for the idea, but in this case I'm getting poor performance with a 0..4 loop. Tried as follows:

for i in 0..4 {
    let j = i * 16;
    out[j..(j+4)].copy_from_slice(&(a[i] + k).to_lanes());
    out[(j+4)..(j+8)].copy_from_slice(&(b[i] + sb).to_lanes());
    out[(j+8)..(j+12)].copy_from_slice(&(c[i] + sc).to_lanes());
    out[(j+12)..(j+16)].copy_from_slice(&(d[i] + sd[i]).to_lanes());
}

Ralith · 2021-09-12T23:41:40Z

Ah well, thanks for trying!

dhardy · 2021-09-15T15:51:57Z

Perf numbers for another PR I'm making weren't what I expected, but I narrowed the results down to this:

# before:
test gen_bytes_chacha12      ... bench:     296,278 ns/iter (+/- 3,935) = 3456 MB/s
test gen_bytes_chacha20      ... bench:     434,789 ns/iter (+/- 4,822) = 2355 MB/s
test gen_bytes_chacha8       ... bench:     226,779 ns/iter (+/- 1,013) = 4515 MB/s
# after:
test gen_bytes_chacha12      ... bench:     233,626 ns/iter (+/- 2,274) = 4383 MB/s
test gen_bytes_chacha20      ... bench:     372,063 ns/iter (+/- 4,054) = 2752 MB/s
test gen_bytes_chacha8       ... bench:     162,006 ns/iter (+/- 6,354) = 6320 MB/s

That's 15-29% slower. (CPU is 5800X aka Vermeer/Zen 3.)

vks · 2021-09-15T15:59:15Z

What is "before", and what is "after"? Your numbers look faster.

@kazcw Did you perform your benchmarks with native optimizations or without?

dhardy · 2021-09-15T16:06:15Z

"Before" is ceb25f8 (master before merging this), "after" is 6e6b4ce.

SSE4.1 was introduced in Intel Penryn in 2008, quite ancient by this point. I guess @kazcw likes vintage hardware?

vks · 2021-09-15T16:12:07Z

"Before" is ceb25f8 (master before merging this), "after" is 6e6b4ce.

So now is faster? I think you switched "before" and "after".

vks · 2021-09-15T16:31:30Z

I also observe the performance regression on a Ryzen 9 4900HS, independent of native optimizations. So it looks like the new code does not optimize properly for AVX?

Before:

# ceb25f8
test gen_bytes_chacha12      ... bench:     357,973 ns/iter (+/- 31,898) = 2860 MB/s
test gen_bytes_chacha20      ... bench:     537,607 ns/iter (+/- 71,716) = 1904 MB/s
test gen_bytes_chacha8       ... bench:     277,995 ns/iter (+/- 42,393) = 3683 MB/s
# ceb25f8 with RUSTFLAGS="-Ctarget-cpu=native"
test gen_bytes_chacha12      ... bench:     336,754 ns/iter (+/- 33,961) = 3040 MB/s
test gen_bytes_chacha20      ... bench:     530,485 ns/iter (+/- 113,055) = 1930 MB/s
test gen_bytes_chacha8       ... bench:     253,036 ns/iter (+/- 32,046) = 4046 MB/s

After:

# 6e6b4ce
test gen_bytes_chacha12      ... bench:     442,418 ns/iter (+/- 77,900) = 2314 MB/s
test gen_bytes_chacha20      ... bench:     595,200 ns/iter (+/- 90,416) = 1720 MB/s
test gen_bytes_chacha8       ... bench:     353,452 ns/iter (+/- 51,468) = 2897 MB/s
# 6e6b4ce with RUSTFLAGS="-Ctarget-cpu=native"
test gen_bytes_chacha12      ... bench:     413,945 ns/iter (+/- 40,007) = 2473 MB/s
test gen_bytes_chacha20      ... bench:     619,950 ns/iter (+/- 91,325) = 1651 MB/s
test gen_bytes_chacha8       ... bench:     329,435 ns/iter (+/- 45,898) = 3108 MB/s

chacha: safer outputting

eb32404

dhardy mentioned this pull request Sep 12, 2021

fill_via_chunks: use safe code via chunks_exact_mut on BE #1180

Merged

4 tasks

vks approved these changes Sep 12, 2021

View reviewed changes

chacha: safer outputting: manually unroll the loop

7fa7c43

chacha: safer outputting: 0..4 loop

7d9607a

This comment has been minimized.

Sign in to view

Revert "chacha: safer outputting: 0..4 loop"

aa5b0e0

This reverts commit 7d9607a. (Had a bug, after fixing the bug perf was poor)

dhardy approved these changes Sep 13, 2021

View reviewed changes

vks merged commit 6e6b4ce into rust-random:master Sep 13, 2021

This was referenced Sep 15, 2021

fill_via_chunks: mutate src on BE (small optimisation) #1182

Merged

A few suggestions regarding unsafe code #1170

Closed

dhardy mentioned this pull request Oct 14, 2021

Chacha: performance improvements #1192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chacha: safer outputting #1181

chacha: safer outputting #1181

kazcw commented Sep 11, 2021 •

edited

Loading

dhardy commented Sep 12, 2021

dhardy commented Sep 12, 2021

kazcw commented Sep 12, 2021 •

edited

Loading

Ralith commented Sep 12, 2021

This comment has been minimized.

kazcw commented Sep 12, 2021 •

edited

Loading

Ralith commented Sep 12, 2021

dhardy commented Sep 15, 2021

vks commented Sep 15, 2021 •

edited

Loading

dhardy commented Sep 15, 2021

vks commented Sep 15, 2021 •

edited

Loading

vks commented Sep 15, 2021 •

edited

Loading

chacha: safer outputting #1181

chacha: safer outputting #1181

Conversation

kazcw commented Sep 11, 2021 • edited Loading

dhardy commented Sep 12, 2021

dhardy commented Sep 12, 2021

kazcw commented Sep 12, 2021 • edited Loading

Ralith commented Sep 12, 2021

This comment has been minimized.

kazcw commented Sep 12, 2021 • edited Loading

Ralith commented Sep 12, 2021

dhardy commented Sep 15, 2021

vks commented Sep 15, 2021 • edited Loading

dhardy commented Sep 15, 2021

vks commented Sep 15, 2021 • edited Loading

vks commented Sep 15, 2021 • edited Loading

kazcw commented Sep 11, 2021 •

edited

Loading

kazcw commented Sep 12, 2021 •

edited

Loading

kazcw commented Sep 12, 2021 •

edited

Loading

vks commented Sep 15, 2021 •

edited

Loading

vks commented Sep 15, 2021 •

edited

Loading

vks commented Sep 15, 2021 •

edited

Loading