Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Significant speedup for difference_inner_join #86

Open
fburic opened this issue Jan 20, 2023 · 0 comments
Open

Suggestion: Significant speedup for difference_inner_join #86

fburic opened this issue Jan 20, 2023 · 0 comments

Comments

@fburic
Copy link

fburic commented Jan 20, 2023

Hi!

The current implementation of difference_inner_join relies on the more general fuzzy_join and (conceptually) does a Cartesian join, followed by a filter on the pairs that are within the given distance. This is a O(N M) time and memory cost (assuming N=length of shorter dataframe, M=length of the longer).

A more efficient way is to represent the longer join column as a [value - dist, value + dist] interval tree. Then, for each element in the shorter column, query this tree in O(log M) time. The resulting cost is O( (N+M) log M), including the build time for the tree. In the case of constant N and M growing to large values (e.g. small sample searched against a huge database), the cost is then O(M log M) instead of O(M^2).

I'm not fluent enough in R and unfortunately have very little time to help out, should this be interesting, but if it helps, see my implementation for Pandas here. I was motivated to write my own since I needed this functionality for a project with big data but sadly difference_inner_join wasn't feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant