[PYTHON] 100 language processing knock-92 (using Gensim): application to analogy data

This is the record of the 92nd "Application to analogy data" of Language processing 100 knock 2015. Word vector calculation and extraction of similar words are performed in two ways: when using the Numpy format word vector data handmade in Chapter 9 and when using Gensim. You can experience the greatness of Gensim, such as the speed of calculation.

Reference link

Link Remarks
092.Application to analogy data_1.ipynb AnswerprogramGitHublink(selfmade)
092.Application to analogy data_2.ipynb AnswerprogramGitHublink(Gensimversion)
100 amateur language processing knocks:92 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
gensim 3.8.1
numpy 1.17.4
pandas 0.25.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

92. Application to analogy data

For each case of the evaluation data created in> 91, vec (word in the second column) --vec (word in the first column) + vec (word in the third column) is calculated, and the vector and similarity are Find the highest word and its similarity. Add the obtained word and similarity to the end of each case. Apply this program to the word vector created in 85 and the word vector created in 90.

Answer

Self-made answer program [092. Application to analogy data_1.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3% 83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /092.%E3%82%A2%E3%83%8A%E3% 83% AD% E3% 82% B8% E3% 83% BC% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% B8% E3% 81% AE% E9% 81% A9% E7% 94% A8_1.ipynb)

import csv

import numpy as np
import pandas as pd

#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('./../09.Vector space method(I)/085.matrix_x300.npz')['arr_0']

print('matrix_x300 Shape:', matrix_x300.shape)

group_t = pd.read_pickle('./../09.Vector space method(I)/083_group_t.zip')

#Cosine similarity calculation
def get_cos_similarity(v1, v1_norm, v2):
    
    #If the vectors are all zero-Returns 1
    if np.count_nonzero(v2) == 0:
        return -1
    else:
        return np.dot(v1, v2) / (v1_norm * np.linalg.norm(v2))

#Get words with high similarity
def get_similar_word(cols):
    
    try:        
        vec = matrix_x300[group_t.index.get_loc(cols[1])] \
              - matrix_x300[group_t.index.get_loc(cols[0])] \
              + matrix_x300[group_t.index.get_loc(cols[2])]
        vec_norm = np.linalg.norm(vec)
        
        #Exclude your own 3 words used in the calculation
        cos_sim = [-1 if group_t.index[i] in cols[:3] else get_cos_similarity(vec, vec_norm, matrix_x300[i]) for i in range(len(group_t))]
        index = np.argmax(cos_sim)
        
        cols.extend([group_t.index[index], cos_sim[index]])
        
    except KeyError:
        cols.extend(['', -1])
    return cols

#Read evaluation data
with open('./091.analogy_family.txt') as file_in:
    result = [get_similar_word(line.split()) for line in file_in]

with open('092.analogy_word2vec_1.txt', 'w') as file_out:
    writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
    writer.writerows(result)

Answer commentary

I'm getting similar words here. I didn't write it in the question, but I try to exclude the words used in the calculation. I don't know if this is okay, but excluding it will increase the percentage of correct answers.

cos_sim = [-1 if group_t.index[i] in cols[:3] else get_cos_similarity(vec, vec_norm, matrix_x300[i]) for i in range(len(group_t))]

Words that are not on the corpus have a similarity of -1.

except KeyError:
    cols.extend(['', -1])

After that, there is a lot of content written by knocking so far, and there is not much special thing in the code, and there is no particular explanation. It takes about 17 minutes to say the least, so I tried to write as much as possible in list comprehension. If you put out the first 10 lines of the contents of the output file, it looks like this. It may or may not match.

csv:091.analogy_family.txt


boy	girl	brother	sister	son	0.8804225566858075
boy	girl	brothers	sisters	sisters	0.8426790631091488
boy	girl	dad	mom	mum	0.8922065515297802
boy	girl	father	mother	mother	0.847494164274725
boy	girl	grandfather	grandmother	grandmother	0.820584129035444
boy	girl	grandpa	grandma		-1
boy	girl	grandson	granddaughter	grandfather	0.6794604718339272
boy	girl	groom	bride	seduce	0.5951703092628703
boy	girl	he	she	she	0.8144501058726975
boy	girl	his	her	Mihailov	0.5752869854520882
Omission

Gensim usage answer program [092. Application to analogy data_2.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3 % 83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /092.%E3%82%A2%E3%83%8A%E3 % 83% AD% E3% 82% B8% E3% 83% BC% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% B8% E3% 81% AE% E9% 81 % A9% E7% 94% A8_2.ipynb)

import csv

from gensim.models import Word2Vec

model = Word2Vec.load('./090.word2vec.model')

print(model)

#Get words with high similarity
def get_similar_word(cols):
    try:
        cos_sim = model.wv.most_similar(positive=[cols[1], cols[2]], negative=[cols[0]], topn=4)       
        for word, similarity in cos_sim:
            
            #Exclude the 3 words used in the calculation
            if word not in cols[:2]:
                cols.extend([word, similarity])
                break
                
    #For words not in the original corpus
    except KeyError:
        cols.extend(['', -1])
    
    return cols

#Read evaluation data
with open('./091.analogy_family.txt') as file_in:
    result = [get_similar_word(line.split()) for line in file_in]

with open('./092.analogy_word2vec_2.txt', 'w') as file_out:
    writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
    writer.writerows(result)

Answer commentary

It's a little slimmer than the self-made version because it's done using a package. And, as you can see when you run it, the process is fast! It takes about 4 seconds and is ** more than 200 times faster than the self-made version **. Gensim is amazing. This is the output result. ** The percentage of correct answers is also increasing. ** **

csv:092.analogy_word2vec_2.txt


boy	girl	brother	sister	sister	0.745887041091919
boy	girl	brothers	sisters	sisters	0.8522343039512634
boy	girl	dad	mom	mum	0.7720432281494141
boy	girl	father	mother	mother	0.8608728647232056
boy	girl	grandfather	grandmother	granddaughter	0.8341050148010254
boy	girl	grandpa	grandma		-1
boy	girl	grandson	granddaughter	granddaughter	0.8497666120529175
boy	girl	groom	bride	bride	0.7476662397384644
boy	girl	he	she	she	0.7702984809875488
boy	girl	his	her	her	0.6540039777755737

Recommended Posts

100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-91: Preparation of Analogy Data
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-20 (using pandas): reading JSON data
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knock-76 (using scikit-learn): labeling
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-89: Analogy by Additive Constitutiveness
100 Language Processing Knock (2020): 28
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
Python inexperienced person tries to knock 100 language processing 14-16
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Python inexperienced person tries to knock 100 language processing 07-09
100 language processing knock-75 (using scikit-learn): weight of features
Python inexperienced person tries to knock 100 language processing 10 ~ 13
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
Python inexperienced person tries to knock 100 language processing 05-06
100 language processing knock-72 (using Stanford NLP): feature extraction
Python inexperienced person tries to knock 100 language processing 00-04
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis