[PYTHON] 100 amateur language processing knocks: 88

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 9: Vector Space Law (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

88. 10 words with high similarity

Read the meaning vector of the word obtained in> 85, and output 10 words with high cosine similarity to "England" and their similarity.

The finished code:

main.py


# coding: utf-8
import pickle
from collections import OrderedDict
from scipy import io
import numpy as np

fname_dict_index_t = 'dict_index_t'
fname_matrix_x300 = 'matrix_x300'


def cos_sim(vec_a, vec_b):
	'''Calculation of cosine similarity
Vector vec_a、vec_Find the cosine similarity of b

Return value:
Cosine similarity
	'''
	norm_ab = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
	if norm_ab != 0:
		return np.dot(vec_a, vec_b) / norm_ab
	else:
		#The lowest value because it is not even possible to determine if the vector norm is similar to 0
		return -1


#Read dictionary
with open(fname_dict_index_t, 'rb') as data_file:
		dict_index_t = pickle.load(data_file)

#Matrix reading
matrix_x300 = io.loadmat(fname_matrix_x300)['matrix_x300']

# 'England'Cosine similarity calculation with
vec_England = matrix_x300[dict_index_t['England']]
distances = [cos_sim(vec_England, matrix_x300[i])
		for i in range(0, len(dict_index_t))]

#Show top 10
index_sorted = np.argsort(distances)
keys = list(dict_index_t.keys())
for index in index_sorted[-2:-12:-1]:		#Excluding myself who comes to the top
	print('{}\t{}'.format(keys[index], distances[index]))

Execution result:

Execution result


Scotland	0.6780631362432838
Australia	0.6439496692044923
Wales	0.6352223096061712
Italy	0.5993389833593241
Spain	0.5810143958505265
France	0.5711030646029182
Japan	0.5709618229888032
Germany	0.5377148103064543
Ireland	0.5374312543293124
Europe	0.4868884673753479

About execution result

Looking at the results, the country names are lined up, so it seems that the fact that "England" is a country is a characteristic. It's also surprising that British countries such as "Scotland", "Wales" and "Ireland" are in the Top 10. I'm not parsing sentences or using synonym dictionaries, but simply vectorizing the contextual words used around the word can present words that resemble human senses so far. Is interesting because it makes me feel various possibilities.

That's all for the 89th knock. If you have any mistakes, I would appreciate it if you could point them out.


Recommended Posts

100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 24
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 12
100 amateur language processing knocks: 14
100 amateur language processing knocks: 08
100 amateur language processing knocks: 42
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 98
100 amateur language processing knocks: 83
100 amateur language processing knocks: 95
100 amateur language processing knocks: 32
100 amateur language processing knocks: 96
100 amateur language processing knocks: 87
100 amateur language processing knocks: 72
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 05
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 37
100 amateur language processing knocks: 21
100 amateur language processing knocks: 68
100 amateur language processing knocks: 11
100 amateur language processing knocks: 90
100 amateur language processing knocks: 74
100 amateur language processing knocks: 66
100 amateur language processing knocks: 28
100 amateur language processing knocks: 64
100 amateur language processing knocks: 34
100 amateur language processing knocks: 36
100 amateur language processing knocks: 77
100 amateur language processing knocks: 01
100 amateur language processing knocks: 16
100 amateur language processing knocks: 27
100 amateur language processing knocks: 10