It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).
enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.
Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).
Read the word meaning vector obtained in> 85, calculate vec ("Spain") --vec ("Madrid") + vec ("Athens"), and calculate 10 words with high similarity to that vector and their similarity. Output it.
main.py
# coding: utf-8
import pickle
from collections import OrderedDict
from scipy import io
import numpy as np
fname_dict_index_t = 'dict_index_t'
fname_matrix_x300 = 'matrix_x300'
def cos_sim(vec_a, vec_b):
'''Calculation of cosine similarity
Vector vec_a、vec_Find the cosine similarity of b
Return value:
Cosine similarity
'''
norm_ab = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
if norm_ab != 0:
return np.dot(vec_a, vec_b) / norm_ab
else:
#The lowest value because it is not even possible to determine if the vector norm is similar to 0
return -1
#Read dictionary
with open(fname_dict_index_t, 'rb') as data_file:
dict_index_t = pickle.load(data_file)
#Matrix reading
matrix_x300 = io.loadmat(fname_matrix_x300)['matrix_x300']
# vec("Spain") - vec("Madrid") + vec("Athens")Cosine similarity calculation with
vec = matrix_x300[dict_index_t['Spain']] \
- matrix_x300[dict_index_t['Madrid']] \
+ matrix_x300[dict_index_t['Athens']]
distances = [cos_sim(vec, matrix_x300[i])
for i in range(0, len(dict_index_t))]
#Show top 10
index_sorted = np.argsort(distances)
keys = list(dict_index_t.keys())
for index in index_sorted[:-11:-1]:
print('{}\t{}'.format(keys[index], distances[index]))
Execution result
Spain 0.8915792748600528
Sweden 0.8719563254078373
Italy 0.8157221349558227
Austria 0.8086425542832402
Netherlands 0.7820356485764023
Denmark 0.7785976171354217
Belgium 0.7654520863664993
Greece 0.7513058649568729
Norway 0.749115358268825
France 0.7441934553247148
Additive constructiveness seems to mean that the meaning can be calculated by adding or subtracting vectors. An analogy is an analogy, and here it seems to refer to guessing a relationship that has already been confirmed by applying it to something else.
Since the vector created in Problem 85 has an additive construct, the word relationship can be extracted by calculating the difference. For example, if you subtract "Madrid" from "Spain" as in the problem, you can get the meaning of "a word that is used like the relationship between a country and the capital". The problem this time is to use this to infer the country for the capital "Athens".
vec("Spain") - vec("Madrid") $ \fallingdotseq $ vec("?") - vec("Athens")
Because of the assumption that this relationship holds, if the formula is transformed
vec("?") $ \fallingdotseq $ vec("Spain") - vec("Madrid") + vec("Athens")
It will be. You can calculate this right-hand side and look up words close to that vector as in Problem 88.
If it is the relationship between the country and the capital, the correct answer is "Greece", but the result was 8th. However, since only European countries are lined up in the results, it can be said that the analogy is roughly as expected. There is a limit because it is vectorized only by the information of surrounding words, but it is quite interesting.
That's all for the 90th knock. If you have any mistakes, I would appreciate it if you could point them out.
Recommended Posts