fill_via_chunks: use safe code via chunks_exact_mut on BE #1180

dhardy · 2021-09-11T19:36:52Z

fill_via_chunks: remove/replace first use of unsafe
fill_via_chunks: replace second use of unsafe
inline fill_via_chunks or use a generic function?
~~remove unsafe in BlockRngCore::generate for ChaChaXCore~~ (chacha: safer outputting #1181)

This removes some unsafe code and is actually a substantial performance boost (if the LE specific path is commented out). Specifically, using chunks_exact_mut is a massive boost to the gen_bytes_chacha* benches (the only ones affected).

We could remove the LE-specific branch, but as mentioned it is substantially faster than the chunking code (much more than the 8% previously mentioned). I don't think there's a better safe alternative than chunking.

Some quick benchmarks:

# LE code (copy_nonoverlapping):
test gen_bytes_chacha12      ... bench:     235,341 ns/iter (+/- 2,141) = 4351 MB/s
test gen_bytes_chacha20      ... bench:     374,516 ns/iter (+/- 2,762) = 2734 MB/s
test gen_bytes_chacha8       ... bench:     167,612 ns/iter (+/- 1,709) = 6109 MB/s
# New chunking code:
test gen_bytes_chacha12      ... bench:     336,035 ns/iter (+/- 5,764) = 3047 MB/s
test gen_bytes_chacha20      ... bench:     472,304 ns/iter (+/- 3,544) = 2168 MB/s
test gen_bytes_chacha8       ... bench:     270,032 ns/iter (+/- 4,526) = 3792 MB/s
# Old chunking code:
test gen_bytes_chacha12      ... bench:     888,131 ns/iter (+/- 4,710) = 1152 MB/s
test gen_bytes_chacha20      ... bench:   1,022,726 ns/iter (+/- 2,474) = 1001 MB/s
test gen_bytes_chacha8       ... bench:     819,164 ns/iter (+/- 3,944) = 1250 MB/s

This (specifically, using chunks_exact_mut) actually improves performance substantially.

joshlf · 2021-09-11T22:29:56Z

rand_core/src/impls.rs

            unsafe {
                core::ptr::copy_nonoverlapping(
                    $src.as_ptr() as *const u8,
                    $dst.as_mut_ptr(),
                    chunk_size_u8);
            }


You could also try the following safe approach. It adds some up-front bounds checks (which the compiler might be able to optimize out), but I would guess that it won't be a noticeable performance hit:

let src_u8_slice = zerocopy::AsBytes::as_bytes($src); (&mut $dst[..chunk_size_u8]).copy_from_slice(&src_u8_slice[..chunk_size_u8]);

Isn't this just trading one unsafe for another? We try to minimise rand's dependency count (people have complained multiple times in the past). Another point of compromise :-(

Technically, yes, although zerocopy's entire reason for existing is to encapsulate these unsafes so that other crates don't have to think about it. The more crates that use zerocopy instead of using unsafe, the fewer instances of unsafe the ecosystem has to get right. Plus, when the Lang Item for Transmutability is implemented and stabilized, AsBytes will be modified to just be a thin wrapper around those utilities, and will no longer require any unsafe itself.

I'm still not keen on so many extra dependencies just for this one use of unsafe:

rand_core v0.6.3 (/home/dhardy/projects/rand/rand/rand_core) ├── getrandom v0.2.3 │ ├── cfg-if v1.0.0 │ └── libc v0.2.101 ├── serde v1.0.130 │ └── serde_derive v1.0.130 (proc-macro) │ ├── proc-macro2 v1.0.29 │ │ └── unicode-xid v0.2.2 │ ├── quote v1.0.9 │ │ └── proc-macro2 v1.0.29 (*) │ └── syn v1.0.76 │ ├── proc-macro2 v1.0.29 (*) │ ├── quote v1.0.9 (*) │ └── unicode-xid v0.2.2 └── zerocopy v0.6.0 ├── byteorder v1.4.3 └── zerocopy-derive v0.3.0 (proc-macro) ├── proc-macro2 v1.0.29 (*) ├── syn v1.0.76 (*) └── synstructure v0.12.5 ├── proc-macro2 v1.0.29 (*) ├── quote v1.0.9 (*) ├── syn v1.0.76 (*) └── unicode-xid v0.2.2

That's four extra deps when using serde (optional), nine when not, in what is essentially just an API crate. I know there's a valid argument that we shouldn't care, but (1) we've had complaints: #713, #850 and (2) if any of those happen to want to use rand (and I note that zerocopy does as a dev-dependency), then you have circular dependencies.

zerocopy-derive looks like it would increase compile time significantly.

just for this one use of unsafe

I've identified a few other places that could use zerocopy:

https://github.com/rust-random/rand/blob/master/src/rng.rs#L352-L357

https://github.com/rust-random/rand/blob/master/src/rng.rs#L370-L375 (would require zerocopy to impl FromBytes and AsBytes for core::num::Wrapping)

https://github.com/rust-random/rand/blob/master/src/distributions/integer.rs#L131-L135 (would require packed_simd::Simd: FromBytes + AsBytes, which isn't currently the case)

That's four extra deps when using serde (optional)

Which four? I count one (synstructure v0.12.5) that is used transitively by zerocopy-derive and not transitively by serde-derive.

I mention these only to provide support for using zerocopy (and most importantly for getting rid of unsafe - if another crate like bytemuck is a better tradeoff, you should go with that instead), not to suggest that the tradeoff is definitely worth it.

bytemuck looks like a better option at the moment, because the derive support is optional, so we could disable it and avoid the dependency / compile-time overhead. I think depending on bytemuck might be acceptable to reduce the unsafe code in Rand.

joshlf · 2021-09-11T22:31:25Z

Those are some awesome perf numbers!

cc @Ralith

joshlf · 2021-09-11T22:34:06Z

One approach I've seen before for endianness handling is to just do the copy and then go back and do a second pass afterwards to fix the endianness in-place. This can be more cache friendly on large inputs since, during the endianness swap pass, you don't need both the source and destination memory ranges in cache.

Ralith · 2021-09-11T23:54:51Z

Nice, that worked out better than I expected! It's interesting to see what LLVM is and isn't capable of optimizing out here.

dhardy · 2021-09-12T08:57:13Z

One approach I've seen before for endianness handling is to just do the copy and then go back and do a second pass afterwards to fix the endianness in-place

Doing this is problematic since we know nothing about the alignment of the output byte-array, and may have a partial word at the end. But doing the reverse works fine, and there doesn't appear to be a reason we can't mutate the source (though it is a breaking change). These are the results I get (byte-flipping to BE on my machine, since I don't have a BE test machine):

test gen_bytes_chacha12      ... bench:     273,486 ns/iter (+/- 531) = 3744 MB/s
test gen_bytes_chacha20      ... bench:     411,202 ns/iter (+/- 1,400) = 2490 MB/s
test gen_bytes_chacha8       ... bench:     203,651 ns/iter (+/- 953) = 5028 MB/s

dhardy · 2021-09-12T09:08:16Z

Regarding Ben Sanders comment:

This is sound if we assume src and dst are both slices of with elements that have no invalid bitpatterns. However, that's not statically checked, and this macro can be invoked in safe code, so it's a bit sketchy. It would be better as an unsafe generic function, though the comment suggests that we'd want to be careful to ensure it gets inlined.

My take is that (1) the macro is private to the module and (2) that the only way to make it a generic function would be to define our own trait (or use num-traits), since to_le is not covered by any std trait. Alternatives are (a) removing all unsafe, (b) inlining (duplicating code) or (c) pushing the unsafe up to the function using the macro (thus making all of the macro code unsafe). My current preference is just to leave this as it is, but I'm open to further discussion.

rand_core/src/impls.rs

joshlf · 2021-09-12T15:54:58Z

Doing this is problematic since we know nothing about the alignment of the output byte-array, and may have a partial word at the end.

I believe you're guaranteed not to have a partial word at the end because the source's length is a multiple of the chunk size.

As for the alignment, I suppose I could see that being a performance issue, but it's probably worth benchmarking so you don't have to break the API. You obviously have a better sense than I do of how big of a deal that kind of breaking change is, though.

Ralith · 2021-09-12T18:44:09Z

(c) pushing the unsafe up to the function using the macro (thus making all of the macro code unsafe). My current preference is just to leave this as it is, but I'm open to further discussion.

Ben here. I think (c) is a large improvement because it makes it clear at the macro callsite that statically unchecked soundness requirements are in play. The current form makes it easy to miss that for a reader who's not already familiar with the code. Nothing wrong with unsafe private helpers, but it's a lot clearer when that unsafety isn't hidden.

I'm a fan of replacing macros with traits whenever possible since macros are a much bigger hammer, but I understand if you'd prefer to avoid the boilerplate.

dhardy · 2021-09-13T08:35:32Z

I believe you're guaranteed not to have a partial word at the end because the source's length is a multiple of the chunk size.

No it's not. It often will be in practice but the destination is just whatever byte-slice was passed to rng.fill_bytes.

I think the API breakage is fine, though obviously it delays releasing this fix. We could split this PR and do that part later, I guess (it's only a perf boost). @vks @newpavlov?

Thanks for the arguments Ben. Traits and generics are also a pretty big hammer, but still it doesn't have a noticeable compile-time impact. Your argument:

This is sound if we assume src and dst are both slices of with elements that have no invalid bitpatterns

.. still partly applies with a generic function: now we know dest: &mut [u8], so writing any bit-pattern is valid, but src: &mut [T] implies we don't know what we're reading (and we don't have bytemuck::Pod to add as a trait bound). But I think this is good enough that local usage of unsafe is fine?

vks · 2021-09-13T13:26:16Z

@dhardy

.. still partly applies with a generic function: now we know dest: &mut [u8], so writing any bit-pattern is valid, but src: &mut [T] implies we don't know what we're reading (and we don't have bytemuck::Pod to add as a trait bound). But I think this is good enough that local usage of unsafe is fine?

Shouldn't ToLe be an unsafe trait then?

dhardy · 2021-09-13T15:03:23Z

Shouldn't ToLe be an unsafe trait then?

But is there anything unsafe about performing a byte copy from the type? Sure, forgetting to run a destructor can break some things, but it can't break memory safety.

Anyway, I think I should move the breaking parts to another PR so that we can do another patch release.

vks · 2021-09-13T15:08:03Z

But is there anything unsafe about performing a byte copy from the type? Sure, forgetting to run a destructor can break some things, but it can't break memory safety.

I thought the problem was that for some types T, src: &mut [T] may contain invalid byte patterns, which breaks memory safety.

vks · 2021-09-13T15:14:32Z

It would also be good to check whether the safe implementation is still too slow:

        let mut iter_src = $src.iter();
        let mut chunks = $dst.chunks_exact_mut(SIZE);
        for (chunk, n) in (&mut chunks).zip(&mut iter_src) {
            chunk.copy_from_slice(&n.to_le_bytes());
        }
        let rem = chunks.into_remainder();
        if let Some(n) = iter_src.next() {
            rem.copy_from_slice(&n.to_le_bytes()[..rem.len()]);
        }

dhardy · 2021-09-13T15:22:08Z

I removed the breaking changes. The results are actually a little better than before, though I don't understand why:

test gen_bytes_chacha12      ... bench:     302,031 ns/iter (+/- 2,346) = 3390 MB/s
test gen_bytes_chacha20      ... bench:     437,592 ns/iter (+/- 4,405) = 2340 MB/s
test gen_bytes_chacha8       ... bench:     230,950 ns/iter (+/- 1,861) = 4433 MB/s

dhardy · 2021-09-13T15:26:35Z

src: &mut [T] may contain invalid byte patterns

What's an invalid byte pattern to [u8]?

Your code is marginally slower than mine.

joshlf · 2021-09-13T19:24:16Z

It often will be in practice but the destination is just whatever byte-slice was passed to rng.fill_bytes.

Ack.

rand_core/src/impls.rs

vks · 2021-09-13T22:07:39Z

@dhardy

What's an invalid byte pattern to [u8]?

Fair enough, I can only think of the undefined behavior being introduced by an unsound implementation of ToLe.

Your code is marginally slower than mine.

Thanks for checking! How much slower? Is it fast enough so we can use the safe code?

vks · 2021-09-13T22:13:44Z

Looks like value stability on big endian platforms was broken:

---- rngs::std::test::test_stdrng_construction stdout ----
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[10719222850664546238, 12020057239787804515]`,
 right: `[10719222850664546238, 14064965282130556830]`', src/rngs/std.rs:96:9

vks

I think we need to restore value stability across endinaness.

Ralith · 2021-09-13T22:59:50Z

In the current form, I believe either fill_via_chunks or (preferably) trait ToLe should, strictly speaking, be marked unsafe, because they allow the observation of uninitialized padding bytes in a hypothetical padded T on little-endian targets.

dhardy · 2021-09-14T07:10:13Z

Thanks for checking! How much slower? Is it fast enough so we can use the safe code?

Sorry, I meant it's a couple of percent slower than my "safe" code (which performs as below on LE):

test gen_bytes_chacha12      ... bench:     290,558 ns/iter (+/- 2,151) = 3524 MB/s
test gen_bytes_chacha20      ... bench:     427,874 ns/iter (+/- 7,506) = 2393 MB/s
test gen_bytes_chacha8       ... bench:     219,347 ns/iter (+/- 2,568) = 4668 MB/s

Good point @Ralith.

vks · 2021-09-14T14:41:37Z

rand_core/src/impls.rs

@@ -52,17 +52,19 @@ pub fn fill_bytes_via_next<R: RngCore + ?Sized>(rng: &mut R, dest: &mut [u8]) {
    }
 }

-trait ToLe: Copy {
+/// Contract: implementing type must be memory-safe to observe as a byte array
+/// (implies no uninitialised padding).


I'm not sure this means the trait has to be unsafe. How would a safe implementation violating this contract look like?

An implementation of ToLe for any type that could have padding (e.g. any repr(Rust) struct), which may be uninitialized, would be unsound because fill_via_chunks is safe and could expose the uninitialized data on a little-endian target.

@Ralith an unsound implementation of ToLe cannot be implemented without unsafe, can it? A trait is unsafe to indicate that due to the way the trait is used, soundness requires that the impl maintain certain invariants (on top of the standard invariants of safe Rust), which I don't see to be the case here. If usage of ToLe is sound as long as the implementation is sound, it's a regular trait.

An unsound impl is trivial:

struct ProbablyPadded(u64, u8); impl ToLe for ProbablyPadded { type Bytes = [u8; 0]; fn to_le_bytes(self) -> Self::Bytes { [] } }

The key is that fill_via_chunks doesn't use to_le_bytes on little-endian targets, it just relies on the unsafe guarantee made by the trait.

@Ralith
Code dependent on to_le_bytes does not use any unsafe, so it does not make any assumptions about ToLe which may cause UB. Your impl will simply panic on chunk.copy_from_slice(src[i].to_le_bytes().as_ref());, since chunk size will not be equal to the src array. You simply can not expose the padding bytes without using unsafe.

To reiterate, on little-endian targets, fill_via_chunks does not invoke to_le_bytes ever, instead using unsafe to access the ToLe implementer's representation directly.

Thanks @Ralith. There's a subtlety here in that the trait is serving dual roles as a (unsafe) marker trait, and a regular trait. @dhardy maybe we could use a comment that specifically mentions that memcpy is sometimes used in lieu of the given implementation, as that does open a can of worms. Worms that are certainly worth that much performance gain, but worms nonetheless.

I hadn't put quite so much effort into this trait since it is supposed to change again in the next breaking release, but sure. Perhaps it should be renamed Observable with extended documentation?

I added a new method to the trait, moving the usage of unsafe into the trait. Benchmarks are unaffected.

vks

@dhardy Thanks, the new trait encapsulates the unsafe code in a more understandable way.

fill_via_chunks: use safe code via chunks_exact_mut on BE

9684ebf

This (specifically, using chunks_exact_mut) actually improves performance substantially.

joshlf reviewed Sep 11, 2021

View reviewed changes

dhardy force-pushed the work2 branch from cfa9969 to 45f33d1 Compare September 12, 2021 09:00

vks added the B-API Breakage: API label Sep 12, 2021

joshlf reviewed Sep 12, 2021

View reviewed changes

rand_core/src/impls.rs Outdated Show resolved Hide resolved

dhardy mentioned this pull request Sep 13, 2021

Suggestion: 'force_le' feature flag? #1179

Closed

vks approved these changes Sep 13, 2021

View reviewed changes

dhardy force-pushed the work2 branch from 1327048 to 64b7e57 Compare September 13, 2021 15:21

joshlf reviewed Sep 13, 2021

View reviewed changes

rand_core/src/impls.rs Outdated Show resolved Hide resolved

joshlf reviewed Sep 13, 2021

View reviewed changes

rand_core/src/impls.rs Outdated Show resolved Hide resolved

vks requested changes Sep 13, 2021

View reviewed changes

vks mentioned this pull request Sep 13, 2021

Tracker: rand 0.9 #1165

Open

24 tasks

vks removed the B-API Breakage: API label Sep 13, 2021

dhardy added 2 commits September 14, 2021 08:10

fill_via_chunks: make a generic function

9972046

fill_via_chunks: better value names

93ade1a

dhardy force-pushed the work2 branch from 64b7e57 to 93ade1a Compare September 14, 2021 07:13

fill_via_chunks: make ToLe an unsafe trait

19b7a76

vks reviewed Sep 14, 2021

View reviewed changes

Replace ToLe with Observable, including as_byte_slice method

34a8f13

dhardy force-pushed the work2 branch from 44f8002 to 34a8f13 Compare September 15, 2021 08:31

vks approved these changes Sep 15, 2021

View reviewed changes

dhardy merged commit 3c8f92b into master Sep 15, 2021

vks deleted the work2 branch September 15, 2021 14:41

vks mentioned this pull request Sep 15, 2021

fill_via_chunks: mutate src on BE (small optimisation) #1182

Merged

dhardy mentioned this pull request Nov 15, 2021

Investigate use of bytemuck #957

Closed

fill_via_chunks: use safe code via chunks_exact_mut on BE #1180

fill_via_chunks: use safe code via chunks_exact_mut on BE #1180

Conversation

dhardy commented Sep 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshlf commented Sep 11, 2021

joshlf commented Sep 11, 2021

Ralith commented Sep 11, 2021

dhardy commented Sep 12, 2021

dhardy commented Sep 12, 2021 • edited Loading

joshlf commented Sep 12, 2021

Ralith commented Sep 12, 2021

dhardy commented Sep 13, 2021

vks commented Sep 13, 2021

dhardy commented Sep 13, 2021

vks commented Sep 13, 2021

vks commented Sep 13, 2021

dhardy commented Sep 13, 2021

dhardy commented Sep 13, 2021

joshlf commented Sep 13, 2021

vks commented Sep 13, 2021

vks commented Sep 13, 2021

vks left a comment

Choose a reason for hiding this comment

Ralith commented Sep 13, 2021

dhardy commented Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

Ralith Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

kazcw Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

Ralith Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

newpavlov Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

Ralith Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

kazcw Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhardy Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

vks left a comment • edited Loading

Choose a reason for hiding this comment

dhardy commented Sep 11, 2021 •

edited

Loading

dhardy commented Sep 12, 2021 •

edited

Loading

dhardy commented Sep 14, 2021 •

edited

Loading

Ralith Sep 14, 2021 •

edited

Loading

kazcw Sep 14, 2021 •

edited

Loading

Ralith Sep 14, 2021 •

edited

Loading

newpavlov Sep 14, 2021 •

edited

Loading

Ralith Sep 14, 2021 •

edited

Loading

kazcw Sep 14, 2021 •

edited

Loading

dhardy Sep 15, 2021 •

edited

Loading

vks left a comment •

edited

Loading