VanityGPG is slow #2

nwalfield · 2020-09-29T19:56:44Z

An OpenPGP fingerprint is a hash not only over the public key material, but also the key's meta data. Thus, if you vary the key's creation time, you'll get a different fingerprint. Given a certificate, that should be pretty easy to do: extract the primary key (let pk = cert.primary_key().key().clone()), get the v4 variant (let mut pk = match pk { Key::V4(k) => k, _ => unreachable!() }, modify the creation time (k.set_creation_time(k.creation_time() - delta)), and check the resulting fingerprint (k.fingerprint()). When you find a match, you'll need to generate new self-signatures. See CertBuilder's code for how to do that. Depending on your requirements, you can vary from a few hours in the past to weeks or months in the past. This should speed up the search by multiple orders of magnitude.

The text was updated successfully, but these errors were encountered:

RedL0tus · 2020-09-29T20:07:23Z

Thank you for the suggestion. If I did not understood it wrong, what you are talking about is similar to Scallion's approach. I'm not familiar with neither OpenPGP nor cryptography, so I took the simplest approach.

I'll try to do that, or if you are willing to help, PRs are extremely welcome.

By the way, I'm so glad to see suggestions like what you said here. To be honest, I started this as a meme project as I was joking about it with my friends, never thought I could receive such useful suggestion. I just want to say thank you again!

RedL0tus · 2020-09-30T04:40:29Z

I just wrote a working PoC, the updated version should come soon.

nwalfield · 2020-09-30T07:04:54Z

Thank you for the suggestion.

No problem!

If I did not understood it wrong, what you are talking about is similar to Scallion's approach.

How scallion works is described here. If I understand correctly, scallion increments the public key's exponent. I don't know how safe that is... generating RSA keys is a tricky business. Letting a cryptographic library generate the key pair, and then varying the creation time seems safer. But, I'm not a cryptographer. Perhaps the approach is sound.

I'm not familiar with neither OpenPGP nor cryptography, so I took the simplest approach.

No worries!

By the way, I'm so glad to see suggestions like what you said here. To be honest, I started this as a meme project as I was joking about it with my friends, never thought I could receive such useful suggestion. I just want to say thank you again!

:)

Signed-off-by: Kay Lin <[email protected]>

RedL0tus · 2020-10-03T06:09:57Z

@nwalfield Please checkout the 0.3 branch!

On my Jetson Nano I got an average speed of over 850,000 fingerprints per seconds (was ~1000/s before), which is, just as you described, several magnitudes faster than my original approach. Thank you!

nwalfield · 2020-10-03T09:53:06Z

Very cool!

A few more ideas for improving performance:

You create a 100k keys (by default) up front. I'd create the keys on demand: in your main loop, have a worker thread creates a key, and then searches that key's search space in a tight inner loop. This should increase cache locality a lot, and avoid the expensive startup time.
Avoid locking as much as possible. For instance:
- Don't push the progress bar for every try, but every 1000 tries, say.
Default to ECC, not RSA. RSA keys are longer (more bytes to hash), and much more computationally intensive to generate (lots of sanity checks) than ECC keys (pretty much any 32 byte value is a valid ECC key).

Some general comments:

When saving a key, only generate the primary, not the general purpose key. The subkeys don't change the fingerprint and the user can add them later by importing the key into, say, gpg and adding the subkeys there (--quick-add-key).
Allow the user to limit the search space. A vanity key that was created in 1992 is not that interesting. By default, I'd search up to say a month in the past. That gives you over 2.5 million possibilities, and probably amortizes the key creation cost.

FWIW, on my Xeon E3-1275 v6 (4 cores, 8 threads), I see about 800k tries per second using the following configuration: --pattern 'BADCAFE$' --pool_size 10000 -c cv25519. Interestingly, the "tries per second" seem to increase with time. That's a bit strange.

nwalfield · 2020-10-03T11:12:17Z

I just realized I didn't compile with --release. With that, I get about 8M tries per second.

RedL0tus · 2020-10-03T19:33:55Z

Very cool!

Thanks!

A few more ideas for improving performance:

You create a 100k keys (by default) up front. I'd create the keys on demand: in your main loop, have a worker thread creates a key, and then searches that key's search space in a tight inner loop. This should increase cache locality a lot, and avoid the expensive startup time.

I cannot find a proper way to do that. The main loop is controlled by Rayon for parallel processing and I cannot come up with a clean enough solution to do that.

Avoid locking as much as possible. For instance:

Don't push the progress bar for every try, but every 1000 tries, say.

The progress bar is controlled by a separate thread, and the counter uses atomic operations, so I don't think this is a big problem...

Default to ECC, not RSA. RSA keys are longer (more bytes to hash), and much more computationally intensive to generate (lots of sanity checks) than ECC keys (pretty much any 32 byte value is a valid ECC key).

Will do. I always use Ed25519 personally. but GPG defaults to RSA so I just used RSA here.

Some general comments:

When saving a key, only generate the primary, not the general purpose key. The subkeys don't change the fingerprint and the user can add them later by importing the key into, say, gpg and adding the subkeys there (--quick-add-key).

I choose to generate the full general-purpose key is because some older versions of GPG that does not support adding ECC subkeys are still in use. That function is also executed only when a match has been found. Maybe I can add an option for generating subkeys or not.

Allow the user to limit the search space. A vanity key that was created in 1992 is not that interesting. By default, I'd search up to say a month in the past. That gives you over 2.5 million possibilities, and probably amortizes the key creation cost.

Will do.

FWIW, on my Xeon E3-1275 v6 (4 cores, 8 threads), I see about 800k tries per second using the following configuration: --pattern 'BADCAFE$' --pool_size 10000 -c cv25519. Interestingly, the "tries per second" seem to increase with time. That's a bit strange.

That's probably because the status thread was spawned before the initialization finishes, and the "tries per second" is simply the result of dividing the total number of fingerprints by the number of seconds passed after the thread was spawned. I'll fix that.

Signed-off-by: Kay Lin <[email protected]>

And list possible values. Signed-off-by: Kay Lin <[email protected]>

- See RedL0tus#2 (comment)

nwalfield · 2020-10-03T20:36:32Z

Take a look at https://github.com/nwalfield/VanityGPG/tree/ideas for some ideas of how to speed it up in the way that I discussed.

nwalfield · 2020-10-03T20:37:44Z

The progress bar is controlled by a separate thread, and the counter uses atomic operations, so I don't think this is a big problem...

atomic operations create a memory fence, and lock the cache. They are better than a lock, but they are still expensive, especially when they are contended.

Signed-off-by: Kay Lin <[email protected]>

RedL0tus · 2020-10-03T23:33:49Z

Now it shows a reading of over 2 million tries per second on my Jetson Nano 🤣

nwalfield · 2020-10-04T09:02:28Z

Now it shows a reading of over 2 million tries per second on my Jetson Nano rofl

Nice! I suspect that most people will want to choose a short key id. It might make sense to provide matching against a fixed suffix (perhaps even a list taken either from the command line or a file) in addition to matching a regex.

RedL0tus · 2020-10-04T09:07:27Z

That may be over-complicating things. Regex it self can be used to achieve this and the regex crate has a decent performance. I'll write a brief description about how to do thatin the README.

…

On Oct 4, 2020, 02:02 -0700, nwalfield ***@***.***>, wrote: > Now it shows a reading of over 2 million tries per second on my Jetson Nano rofl Nice! I suspect that most people will want to choose a short key id. It might make sense to provide matching against a fixed suffix (perhaps even a list taken either from the command line or a file) in addition to matching a regex. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

nwalfield · 2020-10-04T10:04:26Z

I think it would be interesting to benchmark it. When I replaced the regex with a fixed string comparison, I saw a 5-10% improvement.

nwalfield · 2020-10-04T12:07:56Z

I bet that you could get a factor 4-5 improvement in throughput if you computed the fingerprint manually, and adjusted the timestamp in place.

I'd serialize the key into a buffer:

        let pk = Key::V4(self.primary_key.clone().take_secret().0);
        // Serialize them leaving space for the header.
        let mut data = [ 0u8; 2048 ];
        let len = pk.serialize_into(&mut data[3..]).unwrap();
        let data = &mut data[..len + 3];
        // Add the header: 0x99, big endian 16-bit length.
        data[0] = 0x99;
        data[1] = ((len >> 8) & 0xFF) as u8;
        data[2] = (len & 0xFF) as u8;

To compute the hash, you could use the sha1 crate.

nwalfield · 2020-10-04T12:22:04Z

FWIW, there is also https://crates.io/crates/sha-1 :D

RedL0tus · 2020-10-05T03:10:19Z

I think it would be interesting to benchmark it. When I replaced the regex with a fixed string comparison, I saw a 5-10% improvement.

Of course it would be faster with ends_with if I only have one rule, but it lacks flexibility and if we make it to accept multiple rules, it will be kind of messy and the efficiency may not be ideal. One way of improving this may be by using SIMD accelerated regex implementations, and seems the Rust Core Team has been working on this for a while.

I bet that you could get a factor 4-5 improvement in throughput if you computed the fingerprint manually, and adjusted the timestamp in place.

I'd serialize the key into a buffer:
        let pk = Key::V4(self.primary_key.clone().take_secret().0);
        // Serialize them leaving space for the header.
        let mut data = [ 0u8; 2048 ];
        let len = pk.serialize_into(&mut data[3..]).unwrap();
        let data = &mut data[..len + 3];
        // Add the header: 0x99, big endian 16-bit length.
        data[0] = 0x99;
        data[1] = ((len >> 8) & 0xFF) as u8;
        data[2] = (len & 0xFF) as u8;
To compute the hash, you could use the sha1 crate.

Checkout the newest commit on the 0.3 branch, I introduced a new rPGP backend. It is purely implemented in rust (so libclang is no longer needed) and utilizes the strategy you mentioned here. I know sequoia may be a better choice over all, but it is way too complicated and one of my server is using FreeBSD which does not offer libclang by default.

Edit: The performance has only been doubled on my Jetson Nano, but I don't know why.

nwalfield · 2020-10-05T06:51:58Z

but it is way too complicated

Could you provide some concrete feedback about that? sequoia-openpgp provides a low-level API, but we've tried to make it as easy to use as possible with lots of documentation and examples. What were the issues that you ran into? What did you find to be too complicated?

RedL0tus · 2020-10-05T07:17:36Z

but it is way too complicated

Could you provide some concrete feedback about that? sequoia-openpgp provides a low-level API, but we've tried to make it as easy to use as possible with lots of documentation and examples. What were the issues that you ran into? What did you find to be too complicated?

Hmmm... Let me do some clarafication. The sequoia-openpgp crate itself is user-friendly enough and is full of features, but for this project I only need the key generation functionality; As I mentioned above, nettle-sys requires libclang to be present in the system, but it could be a pain to achieve in certain situation (e.g. FreeBSD does not offer binary for it, and compiling LLVM+clang is kind of painful).

Don't get me wrong, it does not mean I don't like sequoia, it's just heavier than ideal for this project. If I'm going to start any new project that needs anything related to OpenPGP, sequoia is definitely my first choice. It is easier to use and may be faster than rPGP.

Back to VanityGPG, I will maintain these two backends in parallel since rPGP lacks security audit and does not support the ECC curves from NIST, while sequoia requires FFI. These two backends are feature-gated and can be switched on compilation.

nwalfield · 2020-10-05T07:34:23Z

Thanks for the feedback. To be clear, I'm not offended, if you don't like something about Sequoia. In fact, I'd appreciate any type of feedback.

I know there is a set of low-level API, but there's barely documentations for them, and sequoia is a large project.

What APIs were you thinking about here?

FWIW, sequoia should be in the FreeBSD ports tree these days. I'm not a FreeBSD user, so I don't know how one would go about using it.

RedL0tus · 2020-10-05T07:55:34Z

Thanks for the feedback. To be clear, I'm not offended, if you don't like something about Sequoia. In fact, I'd appreciate any type of feedback.

I hope sequoia can get better and better.

Actually I have another small question. I see there is a struct named KeyBlueprint but it is private and the actual generation work is left to the CertBuilder struct's impl blocks. Is it a good idea to move the key generation part to KeyBlueprint and make it public?

Also, is it a good idea to make the generate method of CipherSuite public?

In VanityGPG, I somehow have to recreate these two methods since they are private.

I know there is a set of low-level API, but there's barely documentations for them, and sequoia is a large project.

What APIs were you thinking about here?

Actually no... My brain got confused back then and sequoia's documentations are good, seems I've messed up my memory. Sorry for that.

FWIW, sequoia should be in the FreeBSD ports tree these days. I'm not a FreeBSD user, so I don't know how one would go about using it.

Yes it's in the FreeBSD ports, but as you know Rust does not have a stable ABI, and install nettle-sys from cargo requires libclang, so...

Actually the people here to be blamed should be someone from the maintainers of FreeBSD. From a message in the mailing list, they say they don't do binary builds for libclang is because they think building that .so would consume too much resources, and I think that's kind of ridiculous.

RedL0tus · 2020-10-06T17:49:23Z

I bet that you could get a factor 4-5 improvement in throughput if you computed the fingerprint manually, and adjusted the timestamp in place.

I'd serialize the key into a buffer:
        let pk = Key::V4(self.primary_key.clone().take_secret().0);
        // Serialize them leaving space for the header.
        let mut data = [ 0u8; 2048 ];
        let len = pk.serialize_into(&mut data[3..]).unwrap();
        let data = &mut data[..len + 3];
        // Add the header: 0x99, big endian 16-bit length.
        data[0] = 0x99;
        data[1] = ((len >> 8) & 0xFF) as u8;
        data[2] = (len & 0xFF) as u8;
To compute the hash, you could use the sha1 crate.

I've just pushed a commit to let the sequoia backend use this strategy.

RedL0tus · 2020-10-06T19:21:05Z

Do you think this issue can be closed now?

nwalfield · 2020-10-07T19:46:37Z

Looking good!

FWIW, my hacked up version was at 29.7 Mhashes/sec on my computer. This version is at 21.0 Mhashes/sec.

These are the things that could yields some more improvement, I suspect:

Use Nettle's sha instead of sha1, +1 Mhashes/sec for me.
Avoid allocating on the heap. Try changing Backend::fingerprint to place the fingerprint into a buffer passed by &mut instead of allocating a String on the heap.
https://crates.io/crates/faster-hex claims to be 10x faster than the hex crate (although it encodes the data as lower case).
Try and benchmark the suffix list vs. regex approach :)

Since the performance of the program has been increased dramatically (thanks to #2), the reporting rate can be further redused to improve a little bit of performance (idealy). Signed-off-by: Kay Lin <[email protected]>

RedL0tus · 2020-11-21T11:02:59Z

Sorry, I was kind of busy in the last month and did not reply in time.

I've just switched to nettle for the sequoia backend. The rpgp one still uses SHA-1 as it is intended to be pure-rust.

For the second part, I'm actually not sure about how to improve on that, because we need the string anyway. Do you mean to pass a &mut [u8] to the main program and encode the string there?

For faster-hex, I'm still hesitating for switching to that, because it looks like it is no longer being maintained.

For regex, I wrote a simple "benchmark", and the result is, when there is a relatively large number of suffixes, regex can be a lot faster than using starts_with.

My "benchmark" code: https://termbin.com/6tnlv

And the result on my Ryzen 3 2200G is:

>>> Result: regex_average=3928807.5, startswith_average=9898233

The unit is microseconds.

Another advantage of regex is that, it is just easier to write complex search patterns.

RedL0tus · 2021-03-21T15:29:29Z

@nwalfield I've implemented custom binary-to-hex conversion with AVX2 and SSE4.1, based on code from faster-hex. Because it's only used for converting SHA-1 digest to hex and the data length is fixed, I made some extra "optimization". This change almost doubled the hashrate on 3700X, achieving over 100,000,000 hash/s with 16 threads.

Also, I've added a feature to use jemalloc as the default allocator, and it seems to have an improvement of around 1M hash/s.

I've just gathered some perf data and generated this flamegraph:

So... it seems there's nothing more that can be easily optimized. And I guess it may be time to consider closing this issue...?

nwalfield · 2021-03-22T21:19:49Z

Sounds cool! I haven't taken a look at the changes, but they make sense.

RedL0tus added a commit that referenced this issue Oct 3, 2020

Refactor with strategy described in #2

f06680a

Signed-off-by: Kay Lin <[email protected]>

RedL0tus added a commit that referenced this issue Oct 3, 2020

summary: Start the timer only after the key pool is initialized (#2)

4e34b60

Signed-off-by: Kay Lin <[email protected]>

RedL0tus added a commit that referenced this issue Oct 3, 2020

options: Use Ed25519 as the default cipher suite (#2)

f668155

And list possible values. Signed-off-by: Kay Lin <[email protected]>

nwalfield pushed a commit to nwalfield/VanityGPG that referenced this issue Oct 3, 2020

Some ideas.

4c93043

- See RedL0tus#2 (comment)

RedL0tus added a commit that referenced this issue Oct 3, 2020

sequoia_backend: Remove duplicated record of the creation time (#2)

b864f16

Signed-off-by: Kay Lin <[email protected]>

RedL0tus added a commit that referenced this issue Oct 3, 2020

main: Use ThreadPool and avoid frequent context switching (#2)

5ee85fe

Signed-off-by: Kay Lin <[email protected]>

RedL0tus added a commit that referenced this issue Oct 3, 2020

workers: Reduce progress reporting frequency (#2)

26565e1

Signed-off-by: Kay Lin <[email protected]>

nwalfield closed this as completed Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VanityGPG is slow #2

VanityGPG is slow #2

nwalfield commented Sep 29, 2020

RedL0tus commented Sep 29, 2020

RedL0tus commented Sep 30, 2020

nwalfield commented Sep 30, 2020

RedL0tus commented Oct 3, 2020 •

edited

Loading

nwalfield commented Oct 3, 2020

nwalfield commented Oct 3, 2020

RedL0tus commented Oct 3, 2020 •

edited

Loading

nwalfield commented Oct 3, 2020

nwalfield commented Oct 3, 2020

RedL0tus commented Oct 3, 2020

nwalfield commented Oct 4, 2020

RedL0tus commented Oct 4, 2020 via email

nwalfield commented Oct 4, 2020

nwalfield commented Oct 4, 2020

nwalfield commented Oct 4, 2020

RedL0tus commented Oct 5, 2020 •

edited

Loading

nwalfield commented Oct 5, 2020

RedL0tus commented Oct 5, 2020 •

edited

Loading

nwalfield commented Oct 5, 2020

RedL0tus commented Oct 5, 2020

RedL0tus commented Oct 6, 2020

RedL0tus commented Oct 6, 2020

nwalfield commented Oct 7, 2020

RedL0tus commented Nov 21, 2020 •

edited

Loading

RedL0tus commented Mar 21, 2021

nwalfield commented Mar 22, 2021

VanityGPG is slow #2

VanityGPG is slow #2

Comments

nwalfield commented Sep 29, 2020

RedL0tus commented Sep 29, 2020

RedL0tus commented Sep 30, 2020

nwalfield commented Sep 30, 2020

RedL0tus commented Oct 3, 2020 • edited Loading

nwalfield commented Oct 3, 2020

nwalfield commented Oct 3, 2020

RedL0tus commented Oct 3, 2020 • edited Loading

nwalfield commented Oct 3, 2020

nwalfield commented Oct 3, 2020

RedL0tus commented Oct 3, 2020

nwalfield commented Oct 4, 2020

RedL0tus commented Oct 4, 2020 via email

nwalfield commented Oct 4, 2020

nwalfield commented Oct 4, 2020

nwalfield commented Oct 4, 2020

RedL0tus commented Oct 5, 2020 • edited Loading

nwalfield commented Oct 5, 2020

RedL0tus commented Oct 5, 2020 • edited Loading

nwalfield commented Oct 5, 2020

RedL0tus commented Oct 5, 2020

RedL0tus commented Oct 6, 2020

RedL0tus commented Oct 6, 2020

nwalfield commented Oct 7, 2020

RedL0tus commented Nov 21, 2020 • edited Loading

RedL0tus commented Mar 21, 2021

nwalfield commented Mar 22, 2021

RedL0tus commented Oct 3, 2020 •

edited

Loading

RedL0tus commented Oct 3, 2020 •

edited

Loading

RedL0tus commented Oct 5, 2020 •

edited

Loading

RedL0tus commented Oct 5, 2020 •

edited

Loading

RedL0tus commented Nov 21, 2020 •

edited

Loading