Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds stringsim_join functions as requested in #71 #74

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JBGruber
Copy link

@JBGruber JBGruber commented Oct 19, 2020

I love the fuzzyjoin package and today I wanted to learn a little better how exactly it works. By coincidence, I stumbled across #71 and thought it was a pretty good idea to try and implement it, so I would understand the working of the package a bit better (but feel free to reject this as it was mainly a practice that turned out better than I thought).

The PR is still lacking some tests but I wanted to check if you are interested in adding these functions first.

For me, the main reason I want to work with similarity instead of distances is that they are standardized between 0 and 1 (at least most methods). Since I usually work with longer texts of heterogeneous lengths. Newspaper articles, for example, vary significantly in lengths and trying to find duplicates based on distance alone is basically impossible.

@emilBeBri
Copy link

Very nice, hopefully it will be implemented in the main branch! thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants