[PYTHON] 100 amateur language processing knocks: 95

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 10: Vector Space Law (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

95. Evaluation by WordSimilarity-353

Using the data created in> 94, calculate the Spearman correlation coefficient between the ranking of similarity output by each model and the ranking of human similarity judgment.

The finished code:

main.py


# coding: utf-8
import numpy as np

fname_input = 'combined_out.tab'

with open(fname_input, 'rt') as data_file:

	#Array of similarity
	human_score = []
	my_score = []
	N = 0

	for line in data_file:
		cols = line.split('\t')
		human_score.append(float(cols[2]))
		my_score.append(float(cols[3]))
		N += 1

#sort
human_index_sorted = np.argsort(human_score)
my_index_sorted = np.argsort(my_score)

#Arrangement of rank
human_order = [0] * N
my_order = [0] * N
for i in range(N):
	human_order[human_index_sorted[i]] = i
	my_order[my_index_sorted[i]] = i

#Spearman correlation coefficient calculation
total = 0
for i in range(N):
	total += pow(human_order[i] - my_order[i], 2)
result = 1 - (6 * total) / (pow(N, 3) - N)

print(result)

Execution result:

Results for word vector in Problem 85

Results for question 85 word vector


0.22645511508225769
Results for word vector in Problem 90

Results for question 90 word vector


0.5013384068756902

Spearman's rank correlation coefficient

Spearman's correlation coefficient is a value that indicates how much there is a correlation between two rankings like this one. Details will come out a lot if you google with "Spearman's rank correlation coefficient", so I will omit it, but you can find it with the following formula.

\rho = 1 - \frac{6 \sum D^2}{N^3 - N}

Where $ D $ is the difference between the two rankings for each piece of data, and $ N $ is the number of data. The maximum is 1, and the higher the value, the more correlated. In this problem, the larger the value, the closer the ranking is to human judgment.

To find $ D $, first for each line of "combined_out.tab" created in Problem 94, the ranking in human similarity and You need to find the rank in the word vector. If you simply sort, you will not know the original line and you will not be able to find the ranking difference, so I used it in Problem 75 Numpy. Find the sorted index with argsort () and use it to create a new one. I tried to make an array of ranks for each row.

The result is a word2vec overwhelming victory. word2vec is amazing.

main2.py (version advised by shiracamus)


# coding: utf-8

fname_input = 'combined_out.tab'

class Data:
    def __init__(self, human_score, my_score):
        self.human_score = human_score
        self.my_score = my_score

    def __repr__(self):
        return 'Data%s' % repr(self.__dict__)

#Data array creation
with open(fname_input) as data_file:
    def read_data():
        for line in data_file:
            word1, word2, human_score, my_score = line.split('\t')
            yield Data(float(human_score), float(my_score))
    data = list(read_data())

#Ranking
data_sorted_by_human_score = sorted(data, key=lambda data: data.human_score)
for order, d in enumerate(data_sorted_by_human_score):
    d.human_order = order

data_sorted_by_my_score = sorted(data, key=lambda data: data.my_score)
for order, d in enumerate(data_sorted_by_my_score):
    d.my_order = order

#Spearman correlation coefficient calculation
N = len(data)
total = sum((d.human_order - d.my_order) ** 2 for d in data)
result = 1 - (6 * total) / (N ** 3 - N)

print(result)

That's all for the 96th knock. If you have any mistakes, I would appreciate it if you could point them out.


Recommended Posts

100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 50
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 12
100 amateur language processing knocks: 14
100 amateur language processing knocks: 08
100 amateur language processing knocks: 42
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 98
100 amateur language processing knocks: 95
100 amateur language processing knocks: 32
100 amateur language processing knocks: 96
100 amateur language processing knocks: 87
100 amateur language processing knocks: 72
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 05
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 37
100 amateur language processing knocks: 21
100 amateur language processing knocks: 68
100 amateur language processing knocks: 11
100 amateur language processing knocks: 90
100 amateur language processing knocks: 74
100 amateur language processing knocks: 66
100 amateur language processing knocks: 28
100 amateur language processing knocks: 64
100 amateur language processing knocks: 34
100 amateur language processing knocks: 36
100 amateur language processing knocks: 77
100 amateur language processing knocks: 01
100 amateur language processing knocks: 16
100 amateur language processing knocks: 27
100 amateur language processing knocks: 10