This repository contains links to word embeddings for Finnish language, including code for training your own embeddings. Word embeddings represent words as low dimensional numerical vectors, which are helpful in various NLP applications, such as building chatbots, calculating semantic similarities or detecting fake news.
Source | Model | Dimension | Trained on | Download link |
---|---|---|---|---|
FastText | 300 | Wikipedia and CommonCrawl | Binary / Text | |
FastText | 300 | Wikipedia | Binary + Text / Text | |
Turku NLP | Word2Vec | Unknown | Finnish Internet Parsebank | Binary |
Turku NLP | Word2Vec | Unknown | Suomi24 | Binary |
Turku NLP | Word2Vec | Unknown | Suomi24 with lemmatization | Binary |
Yle | Word2Vec / FastText | Unknown | Wikipedia and Yle articles | Text (need to fill form) |
This repository | Word2Vec / FastText | 300 | Crawled from popular Finnish websites (details) | Binary files from Kaggle datasets (only viable free option for now, let me know if you are willing to host these:) |
# Word embeddings in word2vec-format can easily be loaded and queried with gensim
# See https://radimrehurek.com/gensim/models/keyedvectors.html for reference
from gensim.models.keyedvectors import KeyedVectors
# Load vectors into memory (bin in filename means binary=True)
embeddings_path = './data/embeddings/fasttext.fi.all.1045M.100d.bin.gz'
kv = KeyedVectors.load_word2vec_format(embeddings_path, binary=True)
# Find most similar word to 'koira'
print(kv.most_similar('koira'))
This repository also contains the code used for crawling data from popular Finnish web sites, extracting sentences from those, and training word embeddings. The spiders used for web scraping can be found from the crawling-folder, whereas preprocessing and training of embeddings can be found from the embeddings-folder.
Three steps are required:
-
Clone this repository using
git clone https://github.com/jmyrberg/finnish-word-embeddings
and install required packages withpip install -r requirements.txt
. -
Crawl data by starting a spider by running run_spider.bat and typing in the name of the spider, such as iltalehti. All available spider names can be found from the spider class definitions in all_spiders.py. See Scrapy for more information on how to create your own spiders. Optionally, you may also use your own source documents for training.
-
Preprocess crawled material and train word embeddings by running update.py. Or optionally, prepare your own documents into sentence lines and train them by running train.py.
If you follow the steps above without modifying any code, you should be able to reproduce the custom word embeddings provided in this repository. The provided code should also automatically create the folder structure under ./data/* as follows:
- crawl: State of the spider to avoid duplicate scrapes
- feed: Crawled material with JSON line files named like <spiderName>.jl
- processed: Preprocessed crawled material in sentence line files like all.sl
- embeddings: Trained word embeddings named like <modelName>.fi.<sentenceLineFilename>.<numberOfTokensTrainedOn>.<embeddingsDimension>.<format>.gz
If you want to add, modify or remove something in the list of word embeddings or code, please feel free to make a pull request, file an issue, or contact me.
Jesse Myrberg ([email protected])