Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Russian fastText embeddings trained on Araneum web corpus #27

Open
akutuzov opened this issue May 7, 2018 · 4 comments
Open

Russian fastText embeddings trained on Araneum web corpus #27

akutuzov opened this issue May 7, 2018 · 4 comments

Comments

@akutuzov
Copy link

akutuzov commented May 7, 2018

Name: fasttext-ru_araneum-300
Link: http://rusvectores.org/static/models/rusvectores4/fasttext/araneum_none_fasttextcbow_300_5_2018.tgz
Description: fastText vectors trained on Araneum Russicum Maximum corpus (about 10 billion words). The model contains 196K words and 403K 3-4-5-grams.
License: CC-BY (http://rusvectores.org/en/about/)
Related papers: https://arxiv.org/abs/1801.06407, https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
Preprocessing: The corpus was lemmatized with Mystem.
Parameters: vector size 300, window size 5
Code example:

$ tar xzf araneum_none_fasttextcbow_300_5_2018.tgz
$ python3
model = gensim.models.KeyedVectors.load('araneum_none_fasttextcbow_300_5_2018.model')
for n in model.most_similar(positive=['уточка']):
    print(n[0], round(n[1], 3))
чуточка 0.754
досочка 0.726
пинеточка 0.724
деточка 0.704
улиточка 0.693
нямочка 0.693
белочка 0.69
квочка 0.69
выточка 0.689
козочка 0.683
@akutuzov
Copy link
Author

akutuzov commented May 7, 2018

One can lemmatize Russian texts before using this model, with the help of pymystem:

def tag(word):
    from pymystem3 import Mystem
    m = Mystem()
    processed = m.analyze(word)[0]
    lemma = processed["analysis"][0]["lex"].lower().strip()
    return lemma

tag('стульев')
стул

@andrei-q
Copy link

andrei-q commented Feb 11, 2019

I got the following error:

>>> model = gensim.models.fasttext.FastText.load('araneum_none_fasttextcbow_300_5_2018.model')
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/fasttext.py", line 936, in load
    model = super(FastText, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/base_any2vec.py", line 1247, in load
    if not hasattr(model.vocabulary, 'ns_exponent'):
AttributeError: 'FastTextKeyedVectors' object has no attribute 'vocabulary'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/fasttext.py", line 945, in load
    return load_old_fasttext(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/fasttext.py", line 53, in load_old_fasttext
    old_model = FastText.load(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/word2vec.py", line 1618, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/old_saveload.py", line 87, in load
    obj = unpickle(fname)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/old_saveload.py", line 380, in unpickle
    return _pickle.loads(file_bytes, encoding='latin1')
AttributeError: Can't get attribute 'FastTextKeyedVectors' on <module 'gensim.models.deprecated.keyedvectors' from '/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/keyedvectors.py'>

@akutuzov
Copy link
Author

akutuzov commented Feb 11, 2019

@andrei-q Gensim fastText code has been refactored since the time this issue was created.
In the recent versions of Gensim, you should use gensim.models.KeyedVectors.load() to load this model.
I've changed the code snippet above accordingly.

@andrei-q
Copy link

Thanks. It works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants