[PYTHON] 100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353

This is the record of the 94th "Similarity calculation with WordSimilarity-353" of Language processing 100 knock 2015. Calculates the similarity between words on a file. Technically, it's just a small change in coding for what you've done so far.

Reference link

Link Remarks
094.WordSimilarity-Similarity calculation at 353_1.ipynb Answer program GitHub link
094.WordSimilarity-Similarity calculation at 353_2.ipynb Gensim version answer program GitHub link
100 amateur language processing knocks:94 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
gensim 3.8.1
numpy 1.17.4
pandas 0.25.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

94. Similarity calculation with WordSimilarity-353

Enter the evaluation data of The WordSimilarity-353 Test Collection and use the words in the first and second columns. Create a program that calculates the similarity of and adds the similarity value to the end of each line. Apply this program to the word vector created in 85 and the word vector created in 90.

Problem supplement

When I downloaded the ZIP file, there were several files, and I used combined.tab in them. The first row is the header row, there are two words in the first two columns, and it seems that the numerical value that humans judge the similarity comes to the third column (similarity out of 10 points). For cosine similarity, calculate how similar it is and set it in the 4th column.

combined.tab


Word 1	Word 2	Human (mean)
love	sex	6.77
tiger	cat	7.35
tiger	tiger	10.00
book	paper	7.46
computer	keyboard	7.62
computer	internet	7.58
plane	car	5.77
train	car	6.31
telephone	communication	7.50
Omission

Answer

Self-made answer program [094.Similarity calculation with WordSimilarity-353](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF % E3% 83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /094.WordSimilarity-353%E3%81%A7%E3% 81% AE% E9% A1% 9E% E4% BC% BC% E5% BA% A6% E8% A8% 88% E7% AE% 97_1.ipynb

import csv

import numpy as np
import pandas as pd

#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('./../09.Vector space method(I)/085.matrix_x300.npz')['arr_0']

print('matrix_x300 Shape:', matrix_x300.shape)

group_t = pd.read_pickle('./../09.Vector space method(I)/083_group_t.zip')

#Cosine similarity calculation
def get_cos_similarity(line):
    
    try:
        v1 = matrix_x300[group_t.index.get_loc(line[0])]
        v2 = matrix_x300[group_t.index.get_loc(line[1])]
    
        #If the vectors are all zero-Returns 1
        if  np.count_nonzero(v1) == 0 \
         or np.count_nonzero(v2) == 0:
            line.extend([-1])
        else:
            line.extend([np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))])
    except:
        line.extend([-1])
    return line

#Read evaluation data
with open('./combined.tab') as file_in:
    reader = csv.reader(file_in, delimiter='\t')
    header = next(reader)
    
    result = [get_cos_similarity(line) for line in reader]

with open('094.combine_1.txt', 'w') as file_out:
    writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
    writer.writerows(result)

Answer commentary

There is no technical explanation because it is just a combination of the contents so far. The result is output as a tab-delimited text file without header lines. It may be natural, but it is quite different from the similarity set by humans. This doesn't extract similar words, so it takes less than a second.

text:094.combine_1.txt


love	sex	6.77	0.28564147035983395
tiger	cat	7.35	0.848285056343736
tiger	tiger	10.00	1.0000000000000002
book	paper	7.46	0.4900762715360672
computer	keyboard	7.62	0.09513773584009234
computer	internet	7.58	0.2659421289876719
plane	car	5.77	0.48590778050802136
train	car	6.31	0.2976902017313069
telephone	communication	7.50	0.1848868997304664
television	radio	6.77	0.7724947668094843
Omission

Gensim version answer program [094.Similarity calculation with WordSimilarity-353](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82% AF% E3% 83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /094.WordSimilarity-353%E3%81%A7%E3 % 81% AE% E9% A1% 9E% E4% BC% BC% E5% BA% A6% E8% A8% 88% E7% AE% 97_2.ipynb)

import csv

import numpy as np
from gensim.models import Word2Vec

model = Word2Vec.load('./090.word2vec.model')

print(model)

#Cosine similarity calculation
def get_cos_similarity(line):
       
    try:    
        v1 = model.wv[line[0]]
        v2 = model.wv[line[1]]

        #If the vectors are all zero-Returns 1
        if  np.count_nonzero(v1) == 0 \
         or np.count_nonzero(v2) == 0:
            line.extend([-1])
        else:
            line.extend([np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))])
    except KeyError:
        line.extend([-1])
    return line

#Read evaluation data
with open('./combined.tab') as file_in:
    reader = csv.reader(file_in, delimiter='\t')
    header = next(reader)
    
    result = [get_cos_similarity(line) for line in reader]

with open('094.combine_2.txt', 'w') as file_out:
    writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
    writer.writerows(result)

Answer commentary

It's not much different from my own program. As for the result, the Gensim version is much better than the self-made program.

text:094.combine_2.txt


love	sex	6.77	0.5481953
tiger	cat	7.35	0.7811356
tiger	tiger	10.00	1.0
book	paper	7.46	0.5549785
computer	keyboard	7.62	0.6746693
computer	internet	7.58	0.6775914
plane	car	5.77	0.5873176
train	car	6.31	0.6229327
telephone	communication	7.50	0.52026355
television	radio	6.77	0.7744317
Omission

Recommended Posts

100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-88: 10 Words with High Similarity
100 Language Processing Knock-87: Word Similarity
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-38 (using pandas): Histogram
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock (2020): 28
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction