[PYTHON] 100 language processing knock-90 (using Gensim): learning with word2vec

This is a record of the 90th "Learning with word2vec" of Language Processing 100 Knock 2015. The question is, let's easily do what we have done in Chapter 9 using a package. The fact that the content that I desperately came up with while worrying about running out of memory can be made with about 3 lines of code is awesome, but I am keenly aware of it. This time, instead of using Google's word2vec specified in the question, the open source Geinsim I'm using /). I've heard that packages are updated frequently and are often used (I haven't researched them thoroughly because of my knowledge).

Reference link

Link Remarks
090.Learning with word2vec.ipynb Answer program GitHub link
100 amateur language processing knocks:90 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
gensim 3.8.1
numpy 1.17.4

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

90. Learning with word2vec

Apply word2vec to the corpus created in 81 and learn the word vector. In addition, convert the form of the learned word vector and run the program 86-89.

Answer

Answer program [090.word2vec learning.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88%E3 % 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /090.word2vec%E3%81%AB%E3%82%88%E3%82%8B% E5% AD% A6% E7% BF% 92.ipynb)

from pprint import pprint

from gensim.models import word2vec

corpus = word2vec.Text8Corpus('./../09.Vector space method(I)/081.corpus.txt')

model = word2vec.Word2Vec(corpus, size=300)
model.save('090.word2vec.model')

# 86.Display word vector
pprint(model.wv['United_States'])

# 87.Word similarity
print(np.dot(model.wv['United_States'], model.wv['U.S']) / (np.linalg.norm(model.wv['United_States']) * np.linalg.norm(model.wv['U.S'])))

# 88.10 words with high similarity
pprint(model.wv.most_similar('England'))

# 89.Analogy by additive construct
# vec("Spain") - vec("Madrid") + vec("Athens")
pprint(model.wv.most_similar(positive=['Spain', 'Athens'], negative=['Madrid']))

Answer commentary

Word vector generation

First, read the file. I thought that there were many examples of using the Text8Corpus function, so I wondered what the Text8Corpus was in the first place. According to the article "Making a Japanese version of the text8 corpus and learning distributed expressions" (https://hironsan.hatenablog.com/entry/japanese-text8-corpus), text8 is Wikipedia data that has been processed as follows. It seems.

--Keep text and image captions --Removed links to tables and foreign language versions --Remove citations, footnotes and markup --Hypertext retains only anchor text. Remove everything else --Numbers convert spelling. For example, "20" is converted to "two zero" --Convert uppercase to lowercase --Convert characters that do not fall within the a-z range to spaces

I think there were capital letters, but I felt that they generally met the conditions, so I used Text8Corpus.

corpus = word2vec.Text8Corpus('./../09.Vector space method(I)/081.corpus.txt')

All you have to do is use the Word2Vec function to complete the 300-dimensional word vector. It took less than 4 minutes to generate. Wow ... I didn't use any options, but the gemsim word2vec option list was easy to understand.

model = word2vec.Word2Vec(corpus, size=300)

Then save the file for subsequent knocks.

model.save('090.word2vec.model')

Then, it seems that the following 3 files are created. It's unpleasant not to be one.

File size
090.word2vec.model 5MB
090.word2vec.model.trainables.syn1neg.npy 103MB
090.word2vec.model.wv.vectors.npy 103MB

86. Display of word vector

Read the word meaning vector obtained in> 85 and display the "United States" vector. However, note that "United States" is internally referred to as "United_States".

There is a vector in model.wv, so just specify it.

pprint(model.wv['United_States'])
array([ 2.3478289 , -0.61461514,  0.0478639 ,  0.6709404 ,  1.1090833 ,
       -1.0814637 , -0.78162867, -1.2584596 , -0.04286158,  1.2928476 ,
Result omitted

87. Word similarity

Read the word meaning vector obtained in> 85 and calculate the cosine similarity between "United States" and "U.S.". However, note that "U.S." is internally expressed as "U.S.".

Use model to calculate the cosine similarity between the same vectors as in Chapter 9. In Chapter 9, it was 0.837516976284694, which gives a higher degree of similarity.

print(np.dot(model.wv['United_States'], model.wv['U.S']) / (np.linalg.norm(model.wv['United_States']) * np.linalg.norm(model.wv['U.S'])))
0.8601596

88. 10 words with high similarity

Read the meaning vector of the word obtained in> 85, and output 10 words with high cosine similarity to "England" and their similarity.

You can output it just by using the modst_similar function.

pprint(model.wv.most_similar('England'))
[('Scotland', 0.7884809970855713),
 ('Wales', 0.7721374034881592),
 ('Ireland', 0.6838206052780151),
 ('Britain', 0.6335258483886719),
 ('Hampshire', 0.6147407293319702),
 ('London', 0.6021863222122192),
 ('Cork', 0.5809425115585327),
 ('Manchester', 0.5767091512680054),
 ('Liverpool', 0.5765234231948853),
 ('Orleans', 0.5624016523361206)]

By the way, the result in Chapter 9 was as follows, but this time you can see that the words related to the United Kingdom have come out higher and more correct data is output.

Scotland    0.6364961613062289
Italy   0.6033905306935802
Wales   0.5961887337227456
Australia   0.5953277272306978
Spain   0.5752511915429617
Japan   0.5611603300967408
France  0.5547284075334182
Germany 0.5539239745925412
United_Kingdom  0.5225684232409136
Cheshire    0.5125286144779688

89. Analogy by additive construct

Read the word meaning vector obtained in 85, calculate vec ("Spain") --vec ("Madrid") + vec ("Athens"), and find 10 words with high similarity to that vector and their similarity. Output it.

If you pass positive and negative to the modst_similar function, it will calculate and output 10 words with high similarity.

pprint(model.wv.most_similar(positive=['Spain', 'Athens'], negative=['Madrid']))
[('Denmark', 0.7606724500656128),
 ('Italy', 0.7585107088088989),
 ('Austria', 0.7528095841407776),
 ('Greece', 0.7401891350746155),
 ('Egypt', 0.7314825057983398),
 ('Russia', 0.7225484848022461),
 ('Great_Britain', 0.7184625864028931),
 ('Norway', 0.7148114442825317),
 ('Rome', 0.7076312303543091),
 ('kingdom', 0.6994863748550415)]

By the way, the result in Chapter 9 was as follows, but this time Greece also came out in 4th place and you can see that more correct data is output.

Spain   0.8178213952646727
Sweden  0.8071582503798717
Austria 0.7795030693787409
Italy   0.7466099164394225
Germany 0.7429125848677439
Belgium 0.729240312232219
Netherlands 0.7193045612969573
Télévisions   0.7067876635156688
Denmark 0.7062857691945504
France  0.7014078181006329

Recommended Posts

100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing with Python Knock 2015
100 language processing knock-92 (using Gensim): application to analogy data
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Language Processing Knock with Python (Chapter 1)
100 language processing knock-76 (using scikit-learn): labeling
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock-88: 10 Words with High Similarity
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 language processing knock-75 (using scikit-learn): weight of features
100 Amateur Language Processing Knock: 09
100 language processing knock-99 (using pandas): visualization by t-SNE
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break