This is the record of the 89th "Analogy by Additive Composition" of Language Processing 100 Knock 2015. Since it is "additive constructivity", vector operations are performed to obtain the result. It's the famous "King + Female-Male = Princess" calculation. It is a calculation that you want to try in various things in the world with a calculation like "boss --competent =?".

Reference link

Link	Remarks
089.Analogy by additive construct.ipynb	Answer program GitHub link
100 amateur language processing knocks:89	I am always indebted to you by knocking 100 language processing

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
numpy	1.17.4
pandas	0.25.3

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

89. Analogy by additive construct

Read the word meaning vector obtained in 85, calculate vec ("Spain") --vec ("Madrid") + vec ("Athens"), and find 10 words with high similarity to that vector and their similarity. Output it.

Answer

Answer Program [089. Additive Constitutive Analogy.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3%83%88 % E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) / 089.% E5% 8A% A0% E6% B3% 95% E6% A7% 8B % E6% 88% 90% E6% 80% A7% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E3% 82% A2% E3% 83% 8A% E3% 83% AD% E3 % 82% B8% E3% 83% BC.ipynb)

import numpy as np
import pandas as pd

#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('085.matrix_x300.npz')['arr_0']

print('matrix_x300 Shape:', matrix_x300.shape)

group_t = pd.read_pickle('./083_group_t.zip')


# 'vec("Spain") - vec("Madrid") + vec("Athens")Vector calculation
vec = matrix_x300[group_t.index.get_loc('Spain')] \
      - matrix_x300[group_t.index.get_loc('Madrid')] \
      + matrix_x300[group_t.index.get_loc('Athens')]
vec_norm = np.linalg.norm(vec)

#Cosine similarity calculation
def get_cos_similarity(v2):
    
    #If the vectors are all zero-Returns 1
    if np.count_nonzero(v2) == 0:
        return -1
    else:
        return np.dot(v1, v2) / (v1_norm * np.linalg.norm(v2))

cos_sim = [get_cos_similarity(matrix_x300[i]) for i in range(len(group_t))]
print('Cosign Similarity result length:', len(cos_sim))

#Sort by leaving index
cos_sim_sorted = np.argsort(cos_sim)

#From the very end of the array sorted in ascending order-10(-11)Output one by one
for index in cos_sim_sorted[:-11:-1]:
    print('{}\t{}'.format(group_t.index[index], cos_sim[index]))

Answer commentary

This is the main part of this time. It's just adding and subtracting.

# 'vec("Spain") - vec("Madrid") + vec("Athens")Vector calculation
vec = matrix_x300[group_t.index.get_loc('Spain')] \
      - matrix_x300[group_t.index.get_loc('Madrid')] \
      + matrix_x300[group_t.index.get_loc('Athens')]

This is the final output result. Since the capital Madrid is subtracted from Spain and Athens is added, is Greece the correct answer in terms of meaning? Greece was in 12th place with a cosine similarity of 0.686.

Spain	0.8178213952646727
Sweden	0.8071582503798717
Austria	0.7795030693787409
Italy	0.7466099164394225
Germany	0.7429125848677439
Belgium	0.729240312232219
Netherlands	0.7193045612969573
Télévisions	0.7067876635156688
Denmark	0.7062857691945504
France	0.7014078181006329

[PYTHON] 100 Language Processing Knock-89: Analogy by Additive Constitutiveness