Suggestion: Significant speedup for `difference_inner_join` #86

fburic · 2023-01-20T13:49:20Z

Hi!

The current implementation of difference_inner_join relies on the more general fuzzy_join and (conceptually) does a Cartesian join, followed by a filter on the pairs that are within the given distance. This is a O(N M) time and memory cost (assuming N=length of shorter dataframe, M=length of the longer).

A more efficient way is to represent the longer join column as a [value - dist, value + dist] interval tree. Then, for each element in the shorter column, query this tree in O(log M) time. The resulting cost is O( (N+M) log M), including the build time for the tree. In the case of constant N and M growing to large values (e.g. small sample searched against a huge database), the cost is then O(M log M) instead of O(M^2).

I'm not fluent enough in R and unfortunately have very little time to help out, should this be interesting, but if it helps, see my implementation for Pandas here. I was motivated to write my own since I needed this functionality for a project with big data but sadly difference_inner_join wasn't feasible.

The text was updated successfully, but these errors were encountered:

maxharlow mentioned this issue Feb 28, 2023

Add other fuzzy matching algorithms maxharlow/csvmatch#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Significant speedup for `difference_inner_join` #86

Suggestion: Significant speedup for `difference_inner_join` #86

fburic commented Jan 20, 2023

Suggestion: Significant speedup for difference_inner_join #86

Suggestion: Significant speedup for difference_inner_join #86

Comments

fburic commented Jan 20, 2023

Suggestion: Significant speedup for `difference_inner_join` #86

Suggestion: Significant speedup for `difference_inner_join` #86