Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: --skip-content-hash, --max-prefix-size, --max-suffix-size options #202

Merged

Conversation

johnpyp
Copy link
Contributor

@johnpyp johnpyp commented Jun 3, 2023

Partially fixes: #201

(Completely open to changes in the naming/wording/api/etc.)

Changes

  • Upgrade deps
  • Add --max-prefix-size and --max-suffix-size options
    • These options will set the max prefix and suffix size for the prescan, reducing the chance of duplicates before the full hash scan.
  • Add --skip-content-hash option
    • Skips the final stage content hash, and just returns the result after the suffix stage (didn't implement for --transform)

Potential Follow-up

Random chunk checks:

--random-chunk-checks=5
--random-chunk-size=16MiB

Though prefix and suffix size checks are a great pre-filtering step, they are of course the parts of the file that would seem the most likely to be the same among different files. However, there are still cases where fully-hashing the file would take a prohibitively long time or be too expensive.

Instead of a full hash, we could use the file's byte-size as a seed to randomly select n chunks to read from and group in the same fashion as the prefix and suffix checks. Doing this should make it very unlikely for duplication while still being orders of magnitude faster than full content hashing. It also has the nice side effect of being a great continuous tuning-lever to find a balance between safety and speed.

- dirs 4.0 -> 5.0.1
- fallible-iterator 0.2 -> 0.3
- sysinfo 0.28 -> 0.29
  - Required renaming DiskType to DiskKind in various places
--max-prefix-size - Configurable byte-size parameter for the max length of a file to hash for prefix checking

--max-suffix-size - Same as --max-prefix-size, but for the suffix check
--skip-content-hash will skip the final stage, returning the results from the
previous groupings as the final result.

This can speed up the checking byorders of magnitude on large files, and alongside
--max-prefix-size and --max-suffix-size, can still provide reasonable guarantees
on whether files are duplicates.
Copy link
Owner

@pkolaczk pkolaczk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! This is a very nice feature.

@pkolaczk pkolaczk merged commit ccb4e18 into pkolaczk:main Jun 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to skip full checking (maybe extended checksums)
2 participants