This is the record of the 95th "Evaluation with WordSimilarity-353" of Language Processing 100 Knock 2015. Calculates the ** Spearman correlation coefficient ** for the previous knock result. The result of the self-made program is about 23%, and the result when using Gensim is 52%, which is also a big drain.
Link | Remarks |
---|---|
095.WordSimilarity-Rating at 353.ipynb | Answer program GitHub link |
100 amateur language processing knocks:95 | I am always indebted to you by knocking 100 language processing |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 0.25.3 |
In Chapter 10, we will continue to study word vectors from the previous chapter.
Use the data created in> 94 to calculate the Spearman correlation coefficient between the similarity ranking output by each model and the human similarity judgment ranking.
"Spearman's rank correlation coefficient" is ** ["Spearman's rank correlation coefficient" ) ”](Https://ja.wikipedia.org/wiki/%E3%82%B9%E3%83%94%E3%82%A2%E3%83%9E%E3%83%B3%E3%81 % AE% E9% A0% 86% E4% BD% 8D% E7% 9B% B8% E9% 96% A2% E4% BF% 82% E6% 95% B0) **, the phase using the ranking The number of relationships. Frequently heard ["Pearson's product moment correlation coefficient"](https://ja.wikipedia.org/wiki/%E3%83%94%E3%82%A2%E3%82%BD%E3%83 % B3% E3% 81% AE% E7% A9% 8D% E7% 8E% 87% E7% 9B% B8% E9% 96% A2% E4% BF% 82% E6% 95% B0) is not a ranking Use the value to get the correlation coefficient. Both take values in the range of 1 to -1, and 1 means that there is a strong positive correlation.
import pandas as pd
def calc_corr(file):
df = pd.read_table(file, header=None, usecols=[2, 3], names=['original', 'calculated'])
print(df.corr(method='spearman'))
calc_corr('./094.combine_1.txt')
calc_corr('./094.combine_2.txt')
The code is very short because I use pandas. Just specify spearman
in the method of the function corr
.
print(df.corr(method='spearman'))
This is the result of my own program. The intersection of ʻoriginaland
calculated` is the resulting value, which is 0.227916. It's low ...
original calculated
original 1.000000 0.227916
calculated 0.227916 1.000000
This is the result when using Gensim. It has risen considerably to 0.516526.
original calculated
original 1.000000 0.516526
calculated 0.516526 1.000000
Recommended Posts