[PYTHON] 100 amateur language processing knocks: 92

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 10: Vector Space Law (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

92. Application to analogy data

For each case of the evaluation data created in> 91, vec (word in the second column) --vec (word in the first column) + vec (word in the third column) is calculated, and the vector and similarity are Find the highest word and its similarity. Add the obtained word and similarity to the end of each case. Apply this program to the word vector created in 85 and the word vector created in 90.

The finished code:

main.py


# coding: utf-8
import pickle
from collections import OrderedDict
from scipy import io
import numpy as np

fname_dict_index_t = 'dict_index_t'
fname_matrix_x300 = 'matrix_x300'
fname_input = 'family.txt'
fname_output = 'family_out.txt'


def cos_sim(vec_a, vec_b):
	'''Calculation of cosine similarity
Vector vec_a、vec_Find the cosine similarity of b

Return value:
Cosine similarity
	'''
	norm_ab = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
	if norm_ab != 0:
		return np.dot(vec_a, vec_b) / norm_ab
	else:
		#The lowest value because it is not even possible to determine if the vector norm is similar to 0
		return -1


#Read dictionary
with open(fname_dict_index_t, 'rb') as data_file:
		dict_index_t = pickle.load(data_file)
keys = list(dict_index_t.keys())

#Matrix reading
matrix_x300 = io.loadmat(fname_matrix_x300)['matrix_x300']

#Read evaluation data
with open(fname_input, 'rt') as data_file, \
		open(fname_output, 'wt') as out_file:

	for line in data_file:
		cols = line.split(' ')

		try:

			#Vector calculation
			vec = matrix_x300[dict_index_t[cols[1]]] \
					- matrix_x300[dict_index_t[cols[0]]] \
					+ matrix_x300[dict_index_t[cols[2]]]

			#Extract words with the highest cosine similarity
			dist_max = -1
			index_max = 0
			result = ''
			for i in range(len(dict_index_t)):
				dist = cos_sim(vec, matrix_x300[i])
				if dist > dist_max:
					index_max = i
					dist_max = dist

			result = keys[index_max]

		except KeyError:

			#Cosine similarity of 0 letters if there are no words-Output at 1
			result = ''
			dist_max = -1

		#output
		print('{} {} {}'.format(line.strip(), result, dist_max), file=out_file)
		print('{} {} {}'.format(line.strip(), result, dist_max))

Execution result:

The result is saved in "family_out.txt". Since it takes time to process, I tried to output the same contents not only to the file but also to the screen. The execution time is 1 hour and 50 minutes for the word vector of Problem 85 on your computer, Problem 90 The word vector of items / 890d34a40991dd634cdf) was about 25 minutes. Is there a way to make it a little faster ...

Below is the beginning of the result for Problem 90.

Family for word vector in question 90_out.The beginning of txt


boy girl brother sister brother 0.9401630421547305
boy girl brothers sisters brothers 0.8945072765275828
boy girl dad mom girl 0.7280971994658558
boy girl father mother father 0.9436608943376003
boy girl grandfather grandmother grandfather 0.8252139862667345
boy girl grandpa grandma  -1
boy girl grandson granddaughter granddaughter 0.8146889309237173
boy girl groom bride girl 0.7017715459762993
boy girl he she he 0.9651317504873835
boy girl his her his 0.9587287668802774
boy girl husband wife husband 0.9468113068648676
boy girl king queen king 0.9286736850752637
boy girl man woman man 0.9452997293828569
boy girl nephew niece niece 0.8271499425140075
boy girl policeman policewoman girl 0.7420750545104479
boy girl prince princess prince 0.7707574165422014
boy girl son daughter son 0.9564752654538731
boy girl sons daughters sons 0.9366514358470139
boy girl stepbrother stepsister  -1
boy girl stepfather stepmother girl 0.680540253333323
(Omitted below)

Application to analogy data

What you do with this problem is almost the same as Problem 89. Repeat this for the evaluation word created in Problem 91.

In addition, the evaluation word created in Problem 91 may not be included in the word vector. Words that are not used in the sampling data of wikipedia are not included, and since word2vec specifies to remove low-frequency words, such words are not included. Words that are not included cannot be calculated, so the resulting word outputs an empty string and the similarity outputs -1.

The first part of the above execution result is the word vector of word2vec created by Problem 90, but if you look only at this first part, it works well. "boy girl grandson grand daughter" and "boy girl nephew niece", the others are almost wrong ... The correct answer rate is calculated in question 93 below.

That's all for the 93rd knock. If you have any mistakes, I would appreciate it if you could point them out.


Recommended Posts

100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 50
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 12
100 amateur language processing knocks: 14
100 amateur language processing knocks: 08
100 amateur language processing knocks: 42
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 98
100 amateur language processing knocks: 83
100 amateur language processing knocks: 95
100 amateur language processing knocks: 32
100 amateur language processing knocks: 96
100 amateur language processing knocks: 87
100 amateur language processing knocks: 72
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 05
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 37
100 amateur language processing knocks: 21
100 amateur language processing knocks: 68
100 amateur language processing knocks: 11
100 amateur language processing knocks: 90
100 amateur language processing knocks: 74
100 amateur language processing knocks: 66
100 amateur language processing knocks: 28
100 amateur language processing knocks: 64
100 amateur language processing knocks: 34
100 amateur language processing knocks: 36
100 amateur language processing knocks: 77
100 amateur language processing knocks: 01
100 amateur language processing knocks: 16