support BF with lower and/or upper case hashes #15

Hu6li · 2023-08-29T08:55:57Z

On line 123 the value of a file is checked against a bloom filter passed by arguments. If the bloom filter was generated using upper case characters the result will be unknown even if the hash is inside the set.

my first approach was to use:
if value.encode() in map(str.lower,bf['bf']):

unfortunately bf is not iterable thus i solved it using an or

adulau · 2023-09-20T05:17:22Z

The default format of the hashlookup Bloom filter is SHA1 in upper-case. But it's indeed a good idea, if there are other sources using a different format. I'll update the PR to include it as an option to avoid doing the double check by default.

In a near future, we would like to create a hashlookup format definition which includes the type of encoding and canonization used in the Bloom filter.

A new option has been added `--bloomfilters-lower-case` to support now standard Bloom filter. Based on discussion from pull-request #15

adulau · 2023-09-20T06:00:47Z

Thank you for the pull-request and the very good point. I fixed by adding an option for the lower-case lookup.

d0410cd

If you see something else, let me know.

Hu6li · 2023-09-20T06:11:42Z

Perfect, thanks for your reply.

I was concerned about the performance as well and therefore gave it another thought.
Maybe another approach could be to first check for upper-case hashes in the or-operation since python's logical or-operation works as a short-circuit evaluation:

In short-circuit evaluation, the second operand is only evaluated if the first operand does not determine the outcome of the entire expression.

This would mean by default the lookup would be as fast as normally but if there was a bloom filter with lower case values inserted the second one will be evaluated (and thus take longer).

Not sure which approach would be better but the optional one as his own advantages as well.

Thanks for accepting and adding this fix.

adulau · 2023-09-20T09:06:10Z

I see. Thank you very much for the feedback. Maybe we should improve the Bloom filter selection at some point when there are multiple ones to choose from.

support BF with lower and/or upper case hashes

a9f15de

adulau added a commit that referenced this pull request Sep 20, 2023

chg: [Bloom filter] a new option added for non-hashlookup BF

d0410cd

A new option has been added `--bloomfilters-lower-case` to support now standard Bloom filter. Based on discussion from pull-request #15

adulau closed this Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support BF with lower and/or upper case hashes #15

support BF with lower and/or upper case hashes #15

Hu6li commented Aug 29, 2023 •

edited

Loading

adulau commented Sep 20, 2023

adulau commented Sep 20, 2023

Hu6li commented Sep 20, 2023

adulau commented Sep 20, 2023 •

edited

Loading

support BF with lower and/or upper case hashes #15

support BF with lower and/or upper case hashes #15

Conversation

Hu6li commented Aug 29, 2023 • edited Loading

adulau commented Sep 20, 2023

adulau commented Sep 20, 2023

Hu6li commented Sep 20, 2023

adulau commented Sep 20, 2023 • edited Loading

Hu6li commented Aug 29, 2023 •

edited

Loading

adulau commented Sep 20, 2023 •

edited

Loading