This is the record of the 92nd "Application to analogy data" of Language processing 100 knock 2015.
Word vector calculation and extraction of similar words are performed in two ways: when using the Numpy format word vector data handmade in Chapter 9 and when using Gensim
. You can experience the greatness of Gensim
, such as the speed of calculation.
Link | Remarks |
---|---|
092.Application to analogy data_1.ipynb | AnswerprogramGitHublink(selfmade) |
092.Application to analogy data_2.ipynb | AnswerprogramGitHublink(Gensimversion) |
100 amateur language processing knocks:92 | I am always indebted to you by knocking 100 language processing |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
gensim | 3.8.1 |
numpy | 1.17.4 |
pandas | 0.25.3 |
In Chapter 10, we will continue to study word vectors from the previous chapter.
For each case of the evaluation data created in> 91, vec (word in the second column) --vec (word in the first column) + vec (word in the third column) is calculated, and the vector and similarity are Find the highest word and its similarity. Add the obtained word and similarity to the end of each case. Apply this program to the word vector created in 85 and the word vector created in 90.
import csv
import numpy as np
import pandas as pd
#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('./../09.Vector space method(I)/085.matrix_x300.npz')['arr_0']
print('matrix_x300 Shape:', matrix_x300.shape)
group_t = pd.read_pickle('./../09.Vector space method(I)/083_group_t.zip')
#Cosine similarity calculation
def get_cos_similarity(v1, v1_norm, v2):
#If the vectors are all zero-Returns 1
if np.count_nonzero(v2) == 0:
return -1
else:
return np.dot(v1, v2) / (v1_norm * np.linalg.norm(v2))
#Get words with high similarity
def get_similar_word(cols):
try:
vec = matrix_x300[group_t.index.get_loc(cols[1])] \
- matrix_x300[group_t.index.get_loc(cols[0])] \
+ matrix_x300[group_t.index.get_loc(cols[2])]
vec_norm = np.linalg.norm(vec)
#Exclude your own 3 words used in the calculation
cos_sim = [-1 if group_t.index[i] in cols[:3] else get_cos_similarity(vec, vec_norm, matrix_x300[i]) for i in range(len(group_t))]
index = np.argmax(cos_sim)
cols.extend([group_t.index[index], cos_sim[index]])
except KeyError:
cols.extend(['', -1])
return cols
#Read evaluation data
with open('./091.analogy_family.txt') as file_in:
result = [get_similar_word(line.split()) for line in file_in]
with open('092.analogy_word2vec_1.txt', 'w') as file_out:
writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
writer.writerows(result)
I'm getting similar words here. I didn't write it in the question, but I try to exclude the words used in the calculation. I don't know if this is okay, but excluding it will increase the percentage of correct answers.
cos_sim = [-1 if group_t.index[i] in cols[:3] else get_cos_similarity(vec, vec_norm, matrix_x300[i]) for i in range(len(group_t))]
Words that are not on the corpus have a similarity of -1.
except KeyError:
cols.extend(['', -1])
After that, there is a lot of content written by knocking so far, and there is not much special thing in the code, and there is no particular explanation. It takes about 17 minutes to say the least, so I tried to write as much as possible in list comprehension. If you put out the first 10 lines of the contents of the output file, it looks like this. It may or may not match.
csv:091.analogy_family.txt
boy girl brother sister son 0.8804225566858075
boy girl brothers sisters sisters 0.8426790631091488
boy girl dad mom mum 0.8922065515297802
boy girl father mother mother 0.847494164274725
boy girl grandfather grandmother grandmother 0.820584129035444
boy girl grandpa grandma -1
boy girl grandson granddaughter grandfather 0.6794604718339272
boy girl groom bride seduce 0.5951703092628703
boy girl he she she 0.8144501058726975
boy girl his her Mihailov 0.5752869854520882
Omission
import csv
from gensim.models import Word2Vec
model = Word2Vec.load('./090.word2vec.model')
print(model)
#Get words with high similarity
def get_similar_word(cols):
try:
cos_sim = model.wv.most_similar(positive=[cols[1], cols[2]], negative=[cols[0]], topn=4)
for word, similarity in cos_sim:
#Exclude the 3 words used in the calculation
if word not in cols[:2]:
cols.extend([word, similarity])
break
#For words not in the original corpus
except KeyError:
cols.extend(['', -1])
return cols
#Read evaluation data
with open('./091.analogy_family.txt') as file_in:
result = [get_similar_word(line.split()) for line in file_in]
with open('./092.analogy_word2vec_2.txt', 'w') as file_out:
writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
writer.writerows(result)
It's a little slimmer than the self-made version because it's done using a package. And, as you can see when you run it, the process is fast! It takes about 4 seconds and is ** more than 200 times faster than the self-made version **. Gensim is amazing. This is the output result. ** The percentage of correct answers is also increasing. ** **
csv:092.analogy_word2vec_2.txt
boy girl brother sister sister 0.745887041091919
boy girl brothers sisters sisters 0.8522343039512634
boy girl dad mom mum 0.7720432281494141
boy girl father mother mother 0.8608728647232056
boy girl grandfather grandmother granddaughter 0.8341050148010254
boy girl grandpa grandma -1
boy girl grandson granddaughter granddaughter 0.8497666120529175
boy girl groom bride bride 0.7476662397384644
boy girl he she she 0.7702984809875488
boy girl his her her 0.6540039777755737
Recommended Posts