Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using spacy for Hindi #10

Open
rohanrajpal opened this issue Oct 8, 2020 · 3 comments
Open

using spacy for Hindi #10

rohanrajpal opened this issue Oct 8, 2020 · 3 comments

Comments

@rohanrajpal
Copy link

Hey man, sorry to open an issue here, but I saw your commit on the spacy repo

I was trying to use spacy to do some simple stemming on Hindi text, could you please share some examples? I can't find anything on the internet.

@rohanrajpal
Copy link
Author

As far as i can understand, this is the way to do it right now

from spacy.lang.hi.lex_attrs import norm
from spacy.lang.hi.examples import sentences
def stemming(texts):
    texts_out = []
    for sent in texts:
        texts_out.append(norm(sent))
    return texts_out
print(stemming(sentences[0].split(' ')))

Am I correct?

@rahul1990gupta
Copy link
Owner

Hi @rohanrajpal. sorry I couldn't get to you sooner.
Basic features of spacy have the same API for all languages. You woundn't need to do anything special for spacy. For example, the code below runs well for Hindi.

from spacy.lang.hi import Hindi
sentence = "पाठशाला मे अभी जलपान की छुट्टी हुई थी। "
nlp = Hindi()
doc = nlp(sentence)
for token in doc:
  print(token.text, token.norm_, token.orth_)

It outputs

पाठशाला पाठशाल पाठशाला
मे मे मे
अभी अभी अभी
जलपान जलपान जलपान
की की की
छुट्टी छुट्ट छुट्टी
हुई हुई हुई
थी थी थी
। . ।

Having said that stemmer for Hindi is still in development. So, to be able to use it reliably, you will need to improve and make PRs.

@rohanrajpal
Copy link
Author

Hi @rohanrajpal. sorry I couldn't get to you sooner.
Basic features of spacy have the same API for all languages. You woundn't need to do anything special for spacy. For example, the code below runs well for Hindi.

from spacy.lang.hi import Hindi
sentence = "पाठशाला मे अभी जलपान की छुट्टी हुई थी। "
nlp = Hindi()
doc = nlp(sentence)
for token in doc:
  print(token.text, token.norm_, token.orth_)

It outputs

पाठशाला पाठशाल पाठशाला
मे मे मे
अभी अभी अभी
जलपान जलपान जलपान
की की की
छुट्टी छुट्ट छुट्टी
हुई हुई हुई
थी थी थी
। . ।

Having said that stemmer for Hindi is still in development. So, to be able to use it reliably, you will need to improve and make PRs.

Thanks for the detailed reply!
Yes I'm willing to improve upon this, I plan to incorporate all of it into my library
https://github.com/lingualytics/py-lingualytics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants