[PYTHON] We have released a trained model of fastText

We have released a trained model of fastText. You can download the trained model from:

The embedded vector information is summarized in the following repository, so please check it out as well. awesome-embedding-models

Motivation In the following article, I have pasted the link that icoxfog417 published on GitHub.

List of ready-to-use word embedding vectors

However, there was a problem that Git LFS was required to download the published vector and the location was difficult to understand. Therefore, this time, I learned and published it so that it can be easily downloaded.

How to make I referred to the following article for how to use fastText. This is a good article that explains the theory and usage of fastText.

Getting distributed expressions of words in Fast with Facebook fastText

The data used for learning is wikipedia 2017/01/01.

jawiki 20170101

Hyperparameters are set as follows. Other hyperparameters use the Default setting.

dim: 300
epoch: 10
minCount: 20

How to use After downloading the data, you can load it as follows. (For gensim)

import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('model.vec', binary=False)

Related words can be found as follows.

>>> model.most_similar(positive=['Japanese'])
[('Korean', 0.7338133454322815),
 ('Chinese', 0.717720627784729),
 ('American', 0.6725355982780457),
 ('Japanese woman', 0.6723321676254272),
 ('Foreigner', 0.6420464515686035),
 ('Filipino', 0.6264426708221436),
 ('Westerners', 0.621786892414093),
 ('Asian', 0.6192302703857422),
 ('Taiwanese', 0.6034690141677856),
 ('Nikkei', 0.5906497240066528)]

Good NLP Life!