Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support BF with lower and/or upper case hashes #15

Closed
wants to merge 1 commit into from

Conversation

Hu6li
Copy link
Contributor

@Hu6li Hu6li commented Aug 29, 2023

On line 123 the value of a file is checked against a bloom filter passed by arguments. If the bloom filter was generated using upper case characters the result will be unknown even if the hash is inside the set.

my first approach was to use:
if value.encode() in map(str.lower,bf['bf']):

unfortunately bf is not iterable thus i solved it using an or

@adulau
Copy link
Member

adulau commented Sep 20, 2023

The default format of the hashlookup Bloom filter is SHA1 in upper-case. But it's indeed a good idea, if there are other sources using a different format. I'll update the PR to include it as an option to avoid doing the double check by default.

In a near future, we would like to create a hashlookup format definition which includes the type of encoding and canonization used in the Bloom filter.

adulau added a commit that referenced this pull request Sep 20, 2023
A new option has been added `--bloomfilters-lower-case` to
support now standard Bloom filter.

Based on discussion from pull-request #15
@adulau
Copy link
Member

adulau commented Sep 20, 2023

Thank you for the pull-request and the very good point. I fixed by adding an option for the lower-case lookup.

d0410cd

If you see something else, let me know.

@adulau adulau closed this Sep 20, 2023
@Hu6li
Copy link
Contributor Author

Hu6li commented Sep 20, 2023

Perfect, thanks for your reply.

I was concerned about the performance as well and therefore gave it another thought.
Maybe another approach could be to first check for upper-case hashes in the or-operation since python's logical or-operation works as a short-circuit evaluation:

In short-circuit evaluation, the second operand is only evaluated if the first operand does not determine the outcome of the entire expression.

This would mean by default the lookup would be as fast as normally but if there was a bloom filter with lower case values inserted the second one will be evaluated (and thus take longer).

Not sure which approach would be better but the optional one as his own advantages as well.

Thanks for accepting and adding this fix.

@adulau
Copy link
Member

adulau commented Sep 20, 2023

I see. Thank you very much for the feedback. Maybe we should improve the Bloom filter selection at some point when there are multiple ones to choose from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants