Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fill_via_chunks: use safe code via chunks_exact_mut on BE #1180

Merged
merged 5 commits into from
Sep 15, 2021
Merged

Conversation

dhardy
Copy link
Member

@dhardy dhardy commented Sep 11, 2021

Address #1170 (@joshlf):

  • fill_via_chunks: remove/replace first use of unsafe
  • fill_via_chunks: replace second use of unsafe
  • inline fill_via_chunks or use a generic function?
  • remove unsafe in BlockRngCore::generate for ChaChaXCore (chacha: safer outputting #1181)

This removes some unsafe code and is actually a substantial performance boost (if the LE specific path is commented out). Specifically, using chunks_exact_mut is a massive boost to the gen_bytes_chacha* benches (the only ones affected).

We could remove the LE-specific branch, but as mentioned it is substantially faster than the chunking code (much more than the 8% previously mentioned). I don't think there's a better safe alternative than chunking.

Some quick benchmarks:

# LE code (copy_nonoverlapping):
test gen_bytes_chacha12      ... bench:     235,341 ns/iter (+/- 2,141) = 4351 MB/s
test gen_bytes_chacha20      ... bench:     374,516 ns/iter (+/- 2,762) = 2734 MB/s
test gen_bytes_chacha8       ... bench:     167,612 ns/iter (+/- 1,709) = 6109 MB/s
# New chunking code:
test gen_bytes_chacha12      ... bench:     336,035 ns/iter (+/- 5,764) = 3047 MB/s
test gen_bytes_chacha20      ... bench:     472,304 ns/iter (+/- 3,544) = 2168 MB/s
test gen_bytes_chacha8       ... bench:     270,032 ns/iter (+/- 4,526) = 3792 MB/s
# Old chunking code:
test gen_bytes_chacha12      ... bench:     888,131 ns/iter (+/- 4,710) = 1152 MB/s
test gen_bytes_chacha20      ... bench:   1,022,726 ns/iter (+/- 2,474) = 1001 MB/s
test gen_bytes_chacha8       ... bench:     819,164 ns/iter (+/- 3,944) = 1250 MB/s

This (specifically, using chunks_exact_mut) actually
improves performance substantially.
Comment on lines 63 to 68
unsafe {
core::ptr::copy_nonoverlapping(
$src.as_ptr() as *const u8,
$dst.as_mut_ptr(),
chunk_size_u8);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also try the following safe approach. It adds some up-front bounds checks (which the compiler might be able to optimize out), but I would guess that it won't be a noticeable performance hit:

let src_u8_slice = zerocopy::AsBytes::as_bytes($src);
(&mut $dst[..chunk_size_u8]).copy_from_slice(&src_u8_slice[..chunk_size_u8]);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just trading one unsafe for another? We try to minimise rand's dependency count (people have complained multiple times in the past). Another point of compromise :-(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, yes, although zerocopy's entire reason for existing is to encapsulate these unsafes so that other crates don't have to think about it. The more crates that use zerocopy instead of using unsafe, the fewer instances of unsafe the ecosystem has to get right. Plus, when the Lang Item for Transmutability is implemented and stabilized, AsBytes will be modified to just be a thin wrapper around those utilities, and will no longer require any unsafe itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not keen on so many extra dependencies just for this one use of unsafe:

rand_core v0.6.3 (/home/dhardy/projects/rand/rand/rand_core)
├── getrandom v0.2.3
│   ├── cfg-if v1.0.0
│   └── libc v0.2.101
├── serde v1.0.130
│   └── serde_derive v1.0.130 (proc-macro)
│       ├── proc-macro2 v1.0.29
│       │   └── unicode-xid v0.2.2
│       ├── quote v1.0.9
│       │   └── proc-macro2 v1.0.29 (*)
│       └── syn v1.0.76
│           ├── proc-macro2 v1.0.29 (*)
│           ├── quote v1.0.9 (*)
│           └── unicode-xid v0.2.2
└── zerocopy v0.6.0
    ├── byteorder v1.4.3
    └── zerocopy-derive v0.3.0 (proc-macro)
        ├── proc-macro2 v1.0.29 (*)
        ├── syn v1.0.76 (*)
        └── synstructure v0.12.5
            ├── proc-macro2 v1.0.29 (*)
            ├── quote v1.0.9 (*)
            ├── syn v1.0.76 (*)
            └── unicode-xid v0.2.2

That's four extra deps when using serde (optional), nine when not, in what is essentially just an API crate. I know there's a valid argument that we shouldn't care, but (1) we've had complaints: #713, #850 and (2) if any of those happen to want to use rand (and I note that zerocopy does as a dev-dependency), then you have circular dependencies.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zerocopy-derive looks like it would increase compile time significantly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for this one use of unsafe

I've identified a few other places that could use zerocopy:

That's four extra deps when using serde (optional)

Which four? I count one (synstructure v0.12.5) that is used transitively by zerocopy-derive and not transitively by serde-derive.

I mention these only to provide support for using zerocopy (and most importantly for getting rid of unsafe - if another crate like bytemuck is a better tradeoff, you should go with that instead), not to suggest that the tradeoff is definitely worth it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytemuck looks like a better option at the moment, because the derive support is optional, so we could disable it and avoid the dependency / compile-time overhead. I think depending on bytemuck might be acceptable to reduce the unsafe code in Rand.

@joshlf
Copy link
Contributor

joshlf commented Sep 11, 2021

Those are some awesome perf numbers!

cc @Ralith

@joshlf
Copy link
Contributor

joshlf commented Sep 11, 2021

One approach I've seen before for endianness handling is to just do the copy and then go back and do a second pass afterwards to fix the endianness in-place. This can be more cache friendly on large inputs since, during the endianness swap pass, you don't need both the source and destination memory ranges in cache.

@Ralith
Copy link
Contributor

Ralith commented Sep 11, 2021

Nice, that worked out better than I expected! It's interesting to see what LLVM is and isn't capable of optimizing out here.

@dhardy
Copy link
Member Author

dhardy commented Sep 12, 2021

One approach I've seen before for endianness handling is to just do the copy and then go back and do a second pass afterwards to fix the endianness in-place

Doing this is problematic since we know nothing about the alignment of the output byte-array, and may have a partial word at the end. But doing the reverse works fine, and there doesn't appear to be a reason we can't mutate the source (though it is a breaking change). These are the results I get (byte-flipping to BE on my machine, since I don't have a BE test machine):

test gen_bytes_chacha12      ... bench:     273,486 ns/iter (+/- 531) = 3744 MB/s
test gen_bytes_chacha20      ... bench:     411,202 ns/iter (+/- 1,400) = 2490 MB/s
test gen_bytes_chacha8       ... bench:     203,651 ns/iter (+/- 953) = 5028 MB/s

@dhardy
Copy link
Member Author

dhardy commented Sep 12, 2021

Regarding Ben Sanders comment:

This is sound if we assume src and dst are both slices of with elements that have no invalid bitpatterns. However, that's not statically checked, and this macro can be invoked in safe code, so it's a bit sketchy. It would be better as an unsafe generic function, though the comment suggests that we'd want to be careful to ensure it gets inlined.

My take is that (1) the macro is private to the module and (2) that the only way to make it a generic function would be to define our own trait (or use num-traits), since to_le is not covered by any std trait. Alternatives are (a) removing all unsafe, (b) inlining (duplicating code) or (c) pushing the unsafe up to the function using the macro (thus making all of the macro code unsafe). My current preference is just to leave this as it is, but I'm open to further discussion.

@vks vks added the B-API Breakage: API label Sep 12, 2021
rand_core/src/impls.rs Outdated Show resolved Hide resolved
@joshlf
Copy link
Contributor

joshlf commented Sep 12, 2021

Doing this is problematic since we know nothing about the alignment of the output byte-array, and may have a partial word at the end.

I believe you're guaranteed not to have a partial word at the end because the source's length is a multiple of the chunk size.

As for the alignment, I suppose I could see that being a performance issue, but it's probably worth benchmarking so you don't have to break the API. You obviously have a better sense than I do of how big of a deal that kind of breaking change is, though.

@Ralith
Copy link
Contributor

Ralith commented Sep 12, 2021

(c) pushing the unsafe up to the function using the macro (thus making all of the macro code unsafe). My current preference is just to leave this as it is, but I'm open to further discussion.

Ben here. I think (c) is a large improvement because it makes it clear at the macro callsite that statically unchecked soundness requirements are in play. The current form makes it easy to miss that for a reader who's not already familiar with the code. Nothing wrong with unsafe private helpers, but it's a lot clearer when that unsafety isn't hidden.

I'm a fan of replacing macros with traits whenever possible since macros are a much bigger hammer, but I understand if you'd prefer to avoid the boilerplate.

@dhardy
Copy link
Member Author

dhardy commented Sep 13, 2021

I believe you're guaranteed not to have a partial word at the end because the source's length is a multiple of the chunk size.

No it's not. It often will be in practice but the destination is just whatever byte-slice was passed to rng.fill_bytes.

I think the API breakage is fine, though obviously it delays releasing this fix. We could split this PR and do that part later, I guess (it's only a perf boost). @vks @newpavlov?

Thanks for the arguments Ben. Traits and generics are also a pretty big hammer, but still it doesn't have a noticeable compile-time impact. Your argument:

This is sound if we assume src and dst are both slices of with elements that have no invalid bitpatterns

.. still partly applies with a generic function: now we know dest: &mut [u8], so writing any bit-pattern is valid, but src: &mut [T] implies we don't know what we're reading (and we don't have bytemuck::Pod to add as a trait bound). But I think this is good enough that local usage of unsafe is fine?

@vks
Copy link
Collaborator

vks commented Sep 13, 2021

@dhardy

.. still partly applies with a generic function: now we know dest: &mut [u8], so writing any bit-pattern is valid, but src: &mut [T] implies we don't know what we're reading (and we don't have bytemuck::Pod to add as a trait bound). But I think this is good enough that local usage of unsafe is fine?

Shouldn't ToLe be an unsafe trait then?

@dhardy
Copy link
Member Author

dhardy commented Sep 13, 2021

Shouldn't ToLe be an unsafe trait then?

But is there anything unsafe about performing a byte copy from the type? Sure, forgetting to run a destructor can break some things, but it can't break memory safety.

Anyway, I think I should move the breaking parts to another PR so that we can do another patch release.

@vks
Copy link
Collaborator

vks commented Sep 13, 2021

But is there anything unsafe about performing a byte copy from the type? Sure, forgetting to run a destructor can break some things, but it can't break memory safety.

I thought the problem was that for some types T, src: &mut [T] may contain invalid byte patterns, which breaks memory safety.

@vks
Copy link
Collaborator

vks commented Sep 13, 2021

It would also be good to check whether the safe implementation is still too slow:

        let mut iter_src = $src.iter();
        let mut chunks = $dst.chunks_exact_mut(SIZE);
        for (chunk, n) in (&mut chunks).zip(&mut iter_src) {
            chunk.copy_from_slice(&n.to_le_bytes());
        }
        let rem = chunks.into_remainder();
        if let Some(n) = iter_src.next() {
            rem.copy_from_slice(&n.to_le_bytes()[..rem.len()]);
        }

@dhardy
Copy link
Member Author

dhardy commented Sep 13, 2021

I removed the breaking changes. The results are actually a little better than before, though I don't understand why:

test gen_bytes_chacha12      ... bench:     302,031 ns/iter (+/- 2,346) = 3390 MB/s
test gen_bytes_chacha20      ... bench:     437,592 ns/iter (+/- 4,405) = 2340 MB/s
test gen_bytes_chacha8       ... bench:     230,950 ns/iter (+/- 1,861) = 4433 MB/s

@dhardy
Copy link
Member Author

dhardy commented Sep 13, 2021

src: &mut [T] may contain invalid byte patterns

What's an invalid byte pattern to [u8]?


Your code is marginally slower than mine.

@joshlf
Copy link
Contributor

joshlf commented Sep 13, 2021

It often will be in practice but the destination is just whatever byte-slice was passed to rng.fill_bytes.

Ack.

rand_core/src/impls.rs Outdated Show resolved Hide resolved
rand_core/src/impls.rs Outdated Show resolved Hide resolved
@vks
Copy link
Collaborator

vks commented Sep 13, 2021

@dhardy

What's an invalid byte pattern to [u8]?

Fair enough, I can only think of the undefined behavior being introduced by an unsound implementation of ToLe.

Your code is marginally slower than mine.

Thanks for checking! How much slower? Is it fast enough so we can use the safe code?

@vks
Copy link
Collaborator

vks commented Sep 13, 2021

Looks like value stability on big endian platforms was broken:

---- rngs::std::test::test_stdrng_construction stdout ----
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[10719222850664546238, 12020057239787804515]`,
 right: `[10719222850664546238, 14064965282130556830]`', src/rngs/std.rs:96:9

Copy link
Collaborator

@vks vks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to restore value stability across endinaness.

@vks vks mentioned this pull request Sep 13, 2021
24 tasks
@vks vks removed the B-API Breakage: API label Sep 13, 2021
@Ralith
Copy link
Contributor

Ralith commented Sep 13, 2021

In the current form, I believe either fill_via_chunks or (preferably) trait ToLe should, strictly speaking, be marked unsafe, because they allow the observation of uninitialized padding bytes in a hypothetical padded T on little-endian targets.

@dhardy
Copy link
Member Author

dhardy commented Sep 14, 2021

Thanks for checking! How much slower? Is it fast enough so we can use the safe code?

Sorry, I meant it's a couple of percent slower than my "safe" code (which performs as below on LE):

test gen_bytes_chacha12      ... bench:     290,558 ns/iter (+/- 2,151) = 3524 MB/s
test gen_bytes_chacha20      ... bench:     427,874 ns/iter (+/- 7,506) = 2393 MB/s
test gen_bytes_chacha8       ... bench:     219,347 ns/iter (+/- 2,568) = 4668 MB/s

Good point @Ralith.

@@ -52,17 +52,19 @@ pub fn fill_bytes_via_next<R: RngCore + ?Sized>(rng: &mut R, dest: &mut [u8]) {
}
}

trait ToLe: Copy {
/// Contract: implementing type must be memory-safe to observe as a byte array
/// (implies no uninitialised padding).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this means the trait has to be unsafe. How would a safe implementation violating this contract look like?

Copy link
Contributor

@Ralith Ralith Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An implementation of ToLe for any type that could have padding (e.g. any repr(Rust) struct), which may be uninitialized, would be unsound because fill_via_chunks is safe and could expose the uninitialized data on a little-endian target.

Copy link
Collaborator

@kazcw kazcw Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ralith an unsound implementation of ToLe cannot be implemented without unsafe, can it? A trait is unsafe to indicate that due to the way the trait is used, soundness requires that the impl maintain certain invariants (on top of the standard invariants of safe Rust), which I don't see to be the case here. If usage of ToLe is sound as long as the implementation is sound, it's a regular trait.

Copy link
Contributor

@Ralith Ralith Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unsound impl is trivial:

struct ProbablyPadded(u64, u8);
impl ToLe for ProbablyPadded {
    type Bytes = [u8; 0];
    fn to_le_bytes(self) -> Self::Bytes { [] }
}

The key is that fill_via_chunks doesn't use to_le_bytes on little-endian targets, it just relies on the unsafe guarantee made by the trait.

Copy link
Member

@newpavlov newpavlov Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ralith
Code dependent on to_le_bytes does not use any unsafe, so it does not make any assumptions about ToLe which may cause UB. Your impl will simply panic on chunk.copy_from_slice(src[i].to_le_bytes().as_ref());, since chunk size will not be equal to the src array. You simply can not expose the padding bytes without using unsafe.

Copy link
Contributor

@Ralith Ralith Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reiterate, on little-endian targets, fill_via_chunks does not invoke to_le_bytes ever, instead using unsafe to access the ToLe implementer's representation directly.

Copy link
Collaborator

@kazcw kazcw Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ralith. There's a subtlety here in that the trait is serving dual roles as a (unsafe) marker trait, and a regular trait. @dhardy maybe we could use a comment that specifically mentions that memcpy is sometimes used in lieu of the given implementation, as that does open a can of worms. Worms that are certainly worth that much performance gain, but worms nonetheless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't put quite so much effort into this trait since it is supposed to change again in the next breaking release, but sure. Perhaps it should be renamed Observable with extended documentation?

Copy link
Member Author

@dhardy dhardy Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new method to the trait, moving the usage of unsafe into the trait. Benchmarks are unaffected.

Copy link
Collaborator

@vks vks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhardy Thanks, the new trait encapsulates the unsafe code in a more understandable way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants