Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect processing result for keywords having symbols #5

Open
brandonbai opened this issue Mar 1, 2019 · 4 comments
Open

Incorrect processing result for keywords having symbols #5

brandonbai opened this issue Mar 1, 2019 · 4 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@brandonbai
Copy link

Use word "s&m", "s & m", "2 girls 1 cups" ... to run profanity.censor with the default config got the incorrect result.
for example:

print(profanity.censor("s & m"))
# s & m

why ?

@snguyenthanh
Copy link
Owner

Thank you for reporting the issue.

I'm in the first stage of troubleshooting the problem. It seems to be caused by function update_next_words_indices, which returns a wrong list of next words to be parsed.

I will keep this issue updated when I have any new findings.

@snguyenthanh
Copy link
Owner

snguyenthanh commented Mar 14, 2019

From my side, 2 girls 1 cups returns the correct result.

s & m seems to be caused by update_next_words_indices, which doesn't create the expected list of words, due to the character &.

Take hello 123 as an example:

  1. how the library works is, when the a word is identified (hello), it checks for if any continuous combination of it and the following word(s) forms a swear word in the wordlist.
  2. What function update_next_words_indices does is, returning a list of following words starting from the current one found. So in this sample it will return a List ['123', ' 123']

However, for s & m, the & character is specified as a separated value (just as , and ), instead of being grouped into the List of following words from update_next_words_indices.

As I'm very busy with my study in this period, I won't be able to fix this bug anytime soon in ~1 month.
Please feel free to create a PR for this.

@snguyenthanh snguyenthanh added bug Something isn't working help wanted Extra attention is needed labels Mar 14, 2019
@snguyenthanh
Copy link
Owner

snguyenthanh commented May 15, 2019

This is considered a major development for the library, which I wouldn't be able to do this in the near future, due to a tight schedule as a last-year student.

A suggestion on how to fix is to create a separated wordlist for special words, ones with separators different than an empty space ' ' and requires the separator(s) to have an exact match (such as s & m).
While parsing the text, if the current word and next word(s) matches a set of words in the special wordlist, return True if the separator is also identical to return True; otherwise, return False.

@snguyenthanh snguyenthanh changed the title Incorrect processing result Incorrect processing result for keywords having symbols Jul 23, 2019
@oliver408i
Copy link

Can't you just run the check on the text first, then if there is no detect, use regex to remove duplicates, and try again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants