[PYTHON] 100 language processing knock-95 (using pandas): Rating with WordSimilarity-353

This is the record of the 95th "Evaluation with WordSimilarity-353" of Language Processing 100 Knock 2015. Calculates the ** Spearman correlation coefficient ** for the previous knock result. The result of the self-made program is about 23%, and the result when using Gensim is 52%, which is also a big drain.

Reference link

Link Remarks
095.WordSimilarity-Rating at 353.ipynb Answer program GitHub link
100 amateur language processing knocks:95 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

95. Evaluation by WordSimilarity-353

Use the data created in> 94 to calculate the Spearman correlation coefficient between the similarity ranking output by each model and the human similarity judgment ranking.

Problem supplement (Spearman correlation coefficient)

"Spearman's rank correlation coefficient" is ** ["Spearman's rank correlation coefficient" ) ”](Https://ja.wikipedia.org/wiki/%E3%82%B9%E3%83%94%E3%82%A2%E3%83%9E%E3%83%B3%E3%81 % AE% E9% A0% 86% E4% BD% 8D% E7% 9B% B8% E9% 96% A2% E4% BF% 82% E6% 95% B0) **, the phase using the ranking The number of relationships. Frequently heard ["Pearson's product moment correlation coefficient"](https://ja.wikipedia.org/wiki/%E3%83%94%E3%82%A2%E3%82%BD%E3%83 % B3% E3% 81% AE% E7% A9% 8D% E7% 8E% 87% E7% 9B% B8% E9% 96% A2% E4% BF% 82% E6% 95% B0) is not a ranking Use the value to get the correlation coefficient. Both take values in the range of 1 to -1, and 1 means that there is a strong positive correlation.

Answer

Answer program [095.Evaluation by WordSimilarity-353.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /095.WordSimilarity-353%E3%81%A7%E3%81%AE%E8 % A9% 95% E4% BE% A1.ipynb)

import pandas as pd

def calc_corr(file):
    df = pd.read_table(file, header=None, usecols=[2, 3], names=['original', 'calculated'])
    print(df.corr(method='spearman'))

calc_corr('./094.combine_1.txt')

calc_corr('./094.combine_2.txt')

Answer commentary

The code is very short because I use pandas. Just specify spearman in the method of the function corr.

print(df.corr(method='spearman'))

This is the result of my own program. The intersection of ʻoriginalandcalculated` is the resulting value, which is 0.227916. It's low ...

            original  calculated
original    1.000000    0.227916
calculated  0.227916    1.000000

This is the result when using Gensim. It has risen considerably to 0.516526.

            original  calculated
original    1.000000    0.516526
calculated  0.516526    1.000000

Recommended Posts

100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing Knock-88: 10 Words with High Similarity
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
Processing datasets with pandas (1)
100 Language Processing Knock-52: Stemming
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 language processing knock-75 (using scikit-learn): weight of features
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
Process csv data with python (count processing using pandas)
100 language processing knocks-37 (using pandas): Top 10 most frequent words
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)