Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuzzy join based on similarity instead of distance #71

Open
fangzhou-xie opened this issue Jun 23, 2020 · 4 comments
Open

fuzzy join based on similarity instead of distance #71

fangzhou-xie opened this issue Jun 23, 2020 · 4 comments

Comments

@fangzhou-xie
Copy link

Hi! Thanks for this wonderful package.

I am interested in matched two columns by similarity score and I read from the README that there is only stringdist_* family of functions provided. I wonder if there is a way for me to use join functions based on stringsim?

Thanks a lot!

@fangzhou-xie
Copy link
Author

fangzhou-xie commented Jun 23, 2020

It seems that, in method = 'jw' case, if I set max_dist = 0.1, that is equivalent to setting a similarity threshold of 0.9. I wonder if such a shortcut/workaround is available to other distance functions as well?

(BTW, the default max_dist = 2 under method = 'jw' seems to always match.)

@JBGruber
Copy link

I thought this was a pretty good idea and implemented the function(s). Not sure what @dgrtwo will think of it but it was a nice practice. This is how it works:

library(dplyr)
library(fuzzyjoin)

a <- tibble(id = 1, text = "Lorem ipsum dolor sit")
b <- tibble(id = 2, text = "Lorem ipsum dolor sit amet")

stringdist::stringsim(a$text[1], b$text[1], method = "soundex")
#> [1] 1

a %>% 
  stringsim_left_join(b, by = "text", similarity_col = "sim", min_sim = 0.8)
#> # A tibble: 1 x 5
#>    id.x text.x                 id.y text.y                       sim
#>   <dbl> <chr>                 <dbl> <chr>                      <dbl>
#> 1     1 Lorem ipsum dolor sit     2 Lorem ipsum dolor sit amet 0.808

You can test it from my repo (remotes::install_github("JBGruber/fuzzyjoin")).

@fangzhou-xie
Copy link
Author

@JBGruber Thanks a lot! I tried it out a bit and it seems that your implementation works fine. I am not sure what @dgrtwo would think but I personally like it!

Maybe you can try to send a PR and see whether they would like to merge it into the main branch?

@JBGruber
Copy link

I already created the PR but haven't got a reply yet: #74

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants