[PYTHON] List of ready-to-use word embed vectors

Introduction

Word embedding is a technique for expressing words as low-dimensional (although there are about 200 dimensions) real number vectors. The feature is that words with similar meanings can be associated with close vectors, and the addition and subtraction of vectors gives meaningful results (for example, king --man + women = queen).

Word embedding vectors are an important technology used in various natural language processing applications such as part-of-speech tagging, information retrieval, and question answering. However, ** the actual work to prepare is a daunting task **. Download a large amount of data, preprocess it, train it over a long period of time, and tune the parameters by looking at the results.

So ** it's easier to use a trained vector if you just want to use it **. So, I picked up the ones that can be used right now from such word embedding vectors.

The embedded vector information is summarized in the following repository, so please check it out as well. awesome-embedding-models

First of all, the three classics: Word2Vec, GloVe, fastText

Word2Vec

One word comment Needless to say, Word2Vec's pre-trained vector. If you don't know what to use, you can use it.
Years of announcement 2013
URL https://code.google.com/archive/p/word2vec/

Multilingual learned vectors, including Japanese, can be obtained from the links below:

GloVe

One word comment GloVe that Stanford is proud of. It claims to have better performance than Word2Vec. We learned a word vector that can be combined with a model of global matrix factorization and a model of local context window.
Years of announcement 2014
URL http://nlp.stanford.edu/projects/glove/

fastText

One word comment FastText created by the genius Mikolov who created Word2Vec. Anyway, learning is fast. In order to consider morphemes, each word is expressed by the letter ngram, and their vector expressions are learned.
Years of announcement 2016
URL1 Download Word Vectors
URL2 Download Word Vectors(NEologd)

※ only Japanese

I wrote below including how to use it. The trained model of fastText has been released

Three pre-learning vectors with the latest achievements applied

Dependency-Based Word Embeddings

One word comment Word embedding vector by Levy et al. By learning using dependencies, I became stronger in syntactic tasks. This may be good if you want to use it for syntactic tasks.
Years of announcement 2014
URL https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

Meta-Embeddings

One word comment Meta announced at ACL 2016-Embeddings. Better vectors by combining sets of word-embedded vectors with different properties(meta embedding)Succeeded in getting. The advantage is that vocabulary coverage can be increased by combining vector sets.
Years of announcement 2016
URL http://cistern.cis.lmu.de/meta-emb/

LexVec

One word comment This is also LexVec announced at ACL2016. The word similarity task outperforms Word2Vec in some rating sets.
Years of announcement 2016
URL https://github.com/alexandres/lexvec

Bonus: How to use the downloaded vector

Here's how to read Word2Vec's pre-trained vectors, which are likely to be used.

It's super easy to read. Just install gensim and write the following code.

import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

You can write the following code to make an evaluation. Note that you need to download the evaluation data questions-words.txt before running it.

import logging
import pprint

# for logging                                                                                                                                                                                                                                                                      
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# Load evaluation dataset of analogy task                                                                                                                                                                                                                                          
model.accuracy('questions-words.txt')
# execute analogy task like king - man + woman = queen                                                                                                                                                                                                                             
pprint.pprint(model.most_similar(positive=['woman', 'king'], negative=['man']))

When the written code is executed, the following evaluation results are output.

2017-01-20 09:29:11,767 : INFO : loading projection weights from ./GoogleNews-vectors-negative300.bin
2017-01-20 09:30:10,891 : INFO : loaded (3000000, 300) matrix from ./GoogleNews-vectors-negative300.bin
2017-01-20 09:30:10,994 : INFO : precomputing L2-norms of word weight vectors
2017-01-20 09:30:42,097 : INFO : capital-common-countries: 83.6% (423/506)
2017-01-20 09:30:49,899 : INFO : capital-world: 82.7% (1144/1383)
2017-01-20 09:30:50,795 : INFO : currency: 39.8% (51/128)
2017-01-20 09:31:03,579 : INFO : city-in-state: 74.6% (1739/2330)
2017-01-20 09:31:05,574 : INFO : family: 90.1% (308/342)
2017-01-20 09:31:09,928 : INFO : gram1-adjective-to-adverb: 32.3% (262/812)
2017-01-20 09:31:12,052 : INFO : gram2-opposite: 50.5% (192/380)
2017-01-20 09:31:19,719 : INFO : gram3-comparative: 91.9% (1224/1332)
2017-01-20 09:31:23,574 : INFO : gram4-superlative: 88.0% (618/702)
2017-01-20 09:31:28,210 : INFO : gram5-present-participle: 79.8% (694/870)
2017-01-20 09:31:35,082 : INFO : gram6-nationality-adjective: 97.1% (1193/1229)
2017-01-20 09:31:43,390 : INFO : gram7-past-tense: 66.5% (986/1482)
2017-01-20 09:31:49,136 : INFO : gram8-plural: 85.6% (849/992)
2017-01-20 09:31:53,394 : INFO : gram9-plural-verbs: 68.9% (484/702)
2017-01-20 09:31:53,396 : INFO : total: 77.1% (10167/13190)
[('queen', 0.7118192315101624),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235946178436279),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087412595748901)]

Looking at this result, we can see that the total accuracy is ** 77.1% **.

By the way, word vectors such as GloVe can be read in almost the same way.

in conclusion

This article introduced some of the pre-learned vectors of word embedding vectors. Unless you have a specific motivation to learn by yourself, we recommend using these pre-trained vectors.

Recommended Posts

List of ready-to-use word embed vectors
List of python modules
Copy of multiple List
List of activation functions (2020)
Depth of nested list
Display of fractions (list)
GloVe: Prototype of Word Embedding by Gloval Vectors for Word Representation
Summary of Python3 list operations
Operation of filter (None, list)
List of nodes in diagrams
List of self-made Docker images
Multidimensional array initialization of list
[Python] Copy of multidimensional list
List of useful coding styles