Historical word embeddings #12

piskvorky · 2017-12-16T08:16:32Z

…by Stanford, https://nlp.stanford.edu/projects/histwords/

We released pre-trained historical word embeddings (spanning all decades from 1800 to 2000) for multiple languages (English, French, German, and Chinese). Embeddings constructed from many different corpora and using different embedding approaches are included.

Paper: Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change
Code: Github
License: Public Domain Dedication and License

menshikh-iv · 2017-12-18T13:42:57Z

@piskvorky can you be more concrete, which embeddings need to be added (there are many)?

piskvorky · 2017-12-18T15:07:13Z

All, preferably (and the non-English ones are particularly interesting).

menshikh-iv · 2017-12-19T09:44:56Z

@piskvorky got it!

menshikh-iv · 2017-12-20T10:09:02Z

@piskvorky problem: each "zip" contains many models named like 1800-w.npy + 1800-vocab.pkl, 1810-w.npy + 1810-vocab.pkl, all this makes sense if we give all the embeddings at once for the user (which is now impossible).

It is probably worth closing this issue (because it does not apply to us)

piskvorky · 2017-12-20T10:23:18Z

I don't understand. What is the problem?

menshikh-iv · 2017-12-20T10:32:25Z

@piskvorky for example, All English (1800s-1990s) (from Google N-Grams eng-all) http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip

This archive contains many files (pairs of matrix + vocab)

Archive:  eng-all_sgns.zip
sgns/1860-vocab.pkl
sgns/1850-w.npy
sgns/1900-vocab.pkl
sgns/1930-w.npy
sgns/1880-w.npy
sgns/1870-w.npy
sgns/1910-w.npy
sgns/1970-vocab.pkl
sgns/1810-vocab.pkl
sgns/1970-w.npy
sgns/1810-w.npy
sgns/1920-vocab.pkl
sgns/1840-vocab.pkl
sgns/1990-vocab.pkl
sgns/1950-w.npy
sgns/1880-vocab.pkl
sgns/1980-w.npy
sgns/1830-w.npy
sgns/1830-vocab.pkl
sgns/1950-vocab.pkl
sgns/1890-vocab.pkl
sgns/1820-vocab.pkl
sgns/1800-w.npy
sgns/1940-vocab.pkl
sgns/1960-w.npy
sgns/1930-vocab.pkl
sgns/1850-vocab.pkl
sgns/1990-w.npy
sgns/1820-w.npy
sgns/1940-w.npy
sgns/1980-vocab.pkl
sgns/1920-w.npy
sgns/1890-w.npy
sgns/1960-vocab.pkl
sgns/1800-vocab.pkl
sgns/1840-w.npy
sgns/1870-vocab.pkl
sgns/1910-vocab.pkl
sgns/1900-w.npy
sgns/1860-w.npy

i.e. this file contains 20 distinct models (same situation for other links). To use these models for their intended purpose, they are needed all at once (they do not make sense separately). In our case, adding 20 models (which are useless apart from each other) is a very bad idea (moreover, it is extremely inconvenient for the user how to use all at once).

piskvorky · 2017-12-20T13:55:15Z

I see what you mean, but don't see it as a problem. Why couldn't the dataset loader just return a dictionary of models?

menshikh-iv · 2017-12-20T16:16:57Z

You suggest to join all of this to one large pickle (dict of KeyedVectors) and return it to the user, am I right?

piskvorky · 2017-12-20T17:31:35Z

No, I mean a dictionary where the key is a particular model name string (year?) and value the relevant Python object (Word2Vec or whatever).

If, as you say, the models are worthless in isolation, then we should return them all in bulk.

menshikh-iv · 2017-12-20T17:37:47Z

We can store only one gz file for model right now, for this reason, I talked about large pickle before.

piskvorky · 2017-12-20T18:04:06Z

Aha, I see. Yes, that is a possibility -- if the models are sufficiently small, we could pickle everything as a single dict (no separate .npy files etc).

ResearchLabDev · 2022-04-10T03:32:10Z

Sorry for exhuming an old issue, but I was wondering if adding these pre-trained historical word embeddings is still under consideration. These would be very valuable to research I am conducting. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Historical word embeddings #12

Historical word embeddings #12

piskvorky commented Dec 16, 2017 •

edited

Loading

menshikh-iv commented Dec 18, 2017 •

edited

Loading

piskvorky commented Dec 18, 2017

menshikh-iv commented Dec 19, 2017

menshikh-iv commented Dec 20, 2017

piskvorky commented Dec 20, 2017

menshikh-iv commented Dec 20, 2017 •

edited

Loading

piskvorky commented Dec 20, 2017

menshikh-iv commented Dec 20, 2017 •

edited

Loading

piskvorky commented Dec 20, 2017 •

edited

Loading

menshikh-iv commented Dec 20, 2017 •

edited

Loading

piskvorky commented Dec 20, 2017 •

edited

Loading

ResearchLabDev commented Apr 10, 2022

Historical word embeddings #12

Historical word embeddings #12

Comments

piskvorky commented Dec 16, 2017 • edited Loading

menshikh-iv commented Dec 18, 2017 • edited Loading

piskvorky commented Dec 18, 2017

menshikh-iv commented Dec 19, 2017

menshikh-iv commented Dec 20, 2017

piskvorky commented Dec 20, 2017

menshikh-iv commented Dec 20, 2017 • edited Loading

piskvorky commented Dec 20, 2017

menshikh-iv commented Dec 20, 2017 • edited Loading

piskvorky commented Dec 20, 2017 • edited Loading

menshikh-iv commented Dec 20, 2017 • edited Loading

piskvorky commented Dec 20, 2017 • edited Loading

ResearchLabDev commented Apr 10, 2022

piskvorky commented Dec 16, 2017 •

edited

Loading

menshikh-iv commented Dec 18, 2017 •

edited

Loading

menshikh-iv commented Dec 20, 2017 •

edited

Loading

menshikh-iv commented Dec 20, 2017 •

edited

Loading

piskvorky commented Dec 20, 2017 •

edited

Loading

menshikh-iv commented Dec 20, 2017 •

edited

Loading

piskvorky commented Dec 20, 2017 •

edited

Loading