Adds stringsim_join functions as requested in #71 #74

JBGruber · 2020-10-19T16:16:27Z

I love the fuzzyjoin package and today I wanted to learn a little better how exactly it works. By coincidence, I stumbled across #71 and thought it was a pretty good idea to try and implement it, so I would understand the working of the package a bit better (but feel free to reject this as it was mainly a practice that turned out better than I thought).

The PR is still lacking some tests but I wanted to check if you are interested in adding these functions first.

For me, the main reason I want to work with similarity instead of distances is that they are standardized between 0 and 1 (at least most methods). Since I usually work with longer texts of heterogeneous lengths. Newspaper articles, for example, vary significantly in lengths and trying to find duplicates based on distance alone is basically impossible.

emilBeBri · 2020-10-30T09:51:22Z

Very nice, hopefully it will be implemented in the main branch! thank you.

Adds stringsim_join functions as requested in dgrtwo#71

364c3bc

JBGruber mentioned this pull request Dec 27, 2020

fuzzy join based on similarity instead of distance #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds stringsim_join functions as requested in #71 #74

Adds stringsim_join functions as requested in #71 #74

JBGruber commented Oct 19, 2020 •

edited

Loading

emilBeBri commented Oct 30, 2020

Adds stringsim_join functions as requested in #71 #74

Are you sure you want to change the base?

Adds stringsim_join functions as requested in #71 #74

Conversation

JBGruber commented Oct 19, 2020 • edited Loading

emilBeBri commented Oct 30, 2020

JBGruber commented Oct 19, 2020 •

edited

Loading