Word embedding is a technique for expressing words as low-dimensional (although there are about 200 dimensions) real number vectors. The feature is that words with similar meanings can be associated with close vectors, and the addition and subtraction of vectors gives meaningful results (for example, king --man + women = queen).
Word embedding vectors are an important technology used in various natural language processing applications such as part-of-speech tagging, information retrieval, and question answering. However, ** the actual work to prepare is a daunting task **. Download a large amount of data, preprocess it, train it over a long period of time, and tune the parameters by looking at the results.
So ** it's easier to use a trained vector if you just want to use it **. So, I picked up the ones that can be used right now from such word embedding vectors.
The embedded vector information is summarized in the following repository, so please check it out as well. awesome-embedding-models
Word2Vec
One word comment | Needless to say, Word2Vec's pre-trained vector. If you don't know what to use, you can use it. |
---|---|
Years of announcement | 2013 |
URL | https://code.google.com/archive/p/word2vec/ |
Multilingual learned vectors, including Japanese, can be obtained from the links below:
GloVe
One word comment | GloVe that Stanford is proud of. It claims to have better performance than Word2Vec. We learned a word vector that can be combined with a model of global matrix factorization and a model of local context window. |
---|---|
Years of announcement | 2014 |
URL | http://nlp.stanford.edu/projects/glove/ |
fastText
One word comment | FastText created by the genius Mikolov who created Word2Vec. Anyway, learning is fast. In order to consider morphemes, each word is expressed by the letter ngram, and their vector expressions are learned. |
---|---|
Years of announcement | 2016 |
URL1 | Download Word Vectors |
URL2 | Download Word Vectors(NEologd) |
※ only Japanese
I wrote below including how to use it. The trained model of fastText has been released
Dependency-Based Word Embeddings
One word comment | Word embedding vector by Levy et al. By learning using dependencies, I became stronger in syntactic tasks. This may be good if you want to use it for syntactic tasks. |
---|---|
Years of announcement | 2014 |
URL | https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ |
Meta-Embeddings
One word comment | Meta announced at ACL 2016-Embeddings. Better vectors by combining sets of word-embedded vectors with different properties(meta embedding)Succeeded in getting. The advantage is that vocabulary coverage can be increased by combining vector sets. |
---|---|
Years of announcement | 2016 |
URL | http://cistern.cis.lmu.de/meta-emb/ |
LexVec
One word comment | This is also LexVec announced at ACL2016. The word similarity task outperforms Word2Vec in some rating sets. |
---|---|
Years of announcement | 2016 |
URL | https://github.com/alexandres/lexvec |
Here's how to read Word2Vec's pre-trained vectors, which are likely to be used.
It's super easy to read. Just install gensim and write the following code.
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
You can write the following code to make an evaluation. Note that you need to download the evaluation data questions-words.txt before running it.
import logging
import pprint
# for logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# Load evaluation dataset of analogy task
model.accuracy('questions-words.txt')
# execute analogy task like king - man + woman = queen
pprint.pprint(model.most_similar(positive=['woman', 'king'], negative=['man']))
When the written code is executed, the following evaluation results are output.
2017-01-20 09:29:11,767 : INFO : loading projection weights from ./GoogleNews-vectors-negative300.bin
2017-01-20 09:30:10,891 : INFO : loaded (3000000, 300) matrix from ./GoogleNews-vectors-negative300.bin
2017-01-20 09:30:10,994 : INFO : precomputing L2-norms of word weight vectors
2017-01-20 09:30:42,097 : INFO : capital-common-countries: 83.6% (423/506)
2017-01-20 09:30:49,899 : INFO : capital-world: 82.7% (1144/1383)
2017-01-20 09:30:50,795 : INFO : currency: 39.8% (51/128)
2017-01-20 09:31:03,579 : INFO : city-in-state: 74.6% (1739/2330)
2017-01-20 09:31:05,574 : INFO : family: 90.1% (308/342)
2017-01-20 09:31:09,928 : INFO : gram1-adjective-to-adverb: 32.3% (262/812)
2017-01-20 09:31:12,052 : INFO : gram2-opposite: 50.5% (192/380)
2017-01-20 09:31:19,719 : INFO : gram3-comparative: 91.9% (1224/1332)
2017-01-20 09:31:23,574 : INFO : gram4-superlative: 88.0% (618/702)
2017-01-20 09:31:28,210 : INFO : gram5-present-participle: 79.8% (694/870)
2017-01-20 09:31:35,082 : INFO : gram6-nationality-adjective: 97.1% (1193/1229)
2017-01-20 09:31:43,390 : INFO : gram7-past-tense: 66.5% (986/1482)
2017-01-20 09:31:49,136 : INFO : gram8-plural: 85.6% (849/992)
2017-01-20 09:31:53,394 : INFO : gram9-plural-verbs: 68.9% (484/702)
2017-01-20 09:31:53,396 : INFO : total: 77.1% (10167/13190)
[('queen', 0.7118192315101624),
('monarch', 0.6189674139022827),
('princess', 0.5902431011199951),
('crown_prince', 0.5499460697174072),
('prince', 0.5377321839332581),
('kings', 0.5236844420433044),
('Queen_Consort', 0.5235946178436279),
('queens', 0.5181134343147278),
('sultan', 0.5098593235015869),
('monarchy', 0.5087412595748901)]
Looking at this result, we can see that the total accuracy is ** 77.1% **.
By the way, word vectors such as GloVe can be read in almost the same way.
This article introduced some of the pre-learned vectors of word embedding vectors. Unless you have a specific motivation to learn by yourself, we recommend using these pre-trained vectors.
Recommended Posts