[PYTHON] 100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks

This is the record of the 93rd "Calculation of the accuracy rate of analogy tasks" in Language processing 100 knocks 2015. It's easy to win just by calculating the correct answer rate for the previous knock result. The result of the self-made program is about 25%, and the result when using Gensim is 58%, which is a big difference (there is some doubt as to how to calculate the correct answer rate).

Reference link

Link Remarks
093.Calculation of accuracy rate of analogy task.ipynb Answer program GitHub link
100 amateur language processing knocks:93 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

93. Calculation of the accuracy rate of analogy tasks

Use the data created in> 92 to find the correct answer rate for the analogy task of each model.

Answer

Answer program [093. Calculation of accuracy rate of analogy task.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83 % 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /093.%E3%82%A2%E3%83%8A%E3%83 % AD% E3% 82% B8% E3% 83% BC% E3% 82% BF% E3% 82% B9% E3% 82% AF% E3% 81% AE% E6% AD% A3% E8% A7% A3 % E7% 8E% 87% E3% 81% AE% E8% A8% 88% E7% AE% 97.ipynb)

import pandas as pd

def calc_accuracy(file):
    df = pd.read_table(file, header=None, usecols=[3, 4, 5], names=['word4', 'result', 'similarity'])
    print(df.info())
    print('Total records:', len(df))
    print('Available records:', (df['similarity'] != -1).sum())
    print('Correct records:', (df['word4'] == df['result']).sum())
    print('Accuracy', (df['word4'] == df['result']).sum() / (df['similarity'] != -1).sum())

calc_accuracy('092.analogy_word2vec_1.txt')

calc_accuracy('092.analogy_word2vec_2.txt')

Answer commentary

I think there is a smarter way to write it, but I don't really focus on time. The file is read and the correct answer rate is calculated. I wasn't sure what to do with the denominator, but if I couldn't find a word in the corpus, I excluded it from the denominator.

df = pd.read_table(file, header=None, usecols=[3, 4, 5], names=['word4', 'result', 'similarity'])
print(df.info())
print('Total records:', len(df))
print('Available records:', (df['similarity'] != -1).sum())
print('Correct records:', (df['word4'] == df['result']).sum())
print('Accuracy', (df['word4'] == df['result']).sum() / (df['similarity'] != -1).sum())

This is the result of my own program.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 3 columns):
word4         506 non-null object
result        462 non-null object
similarity    504 non-null float64
dtypes: float64(1), object(2)
memory usage: 12.0+ KB
None
Total records: 506
Available records: 462
Correct records: 114
Accuracy 0.24675324675324675

This is the result when using Gensim. I'm worried that "Available records" are decreasing. Certainly Gensim seems to have logic that Word2Vec does not target when it is infrequent ...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 3 columns):
word4         506 non-null object
result        400 non-null object
similarity    506 non-null float64
dtypes: float64(1), object(2)
memory usage: 12.0+ KB
None
Total records: 506
Available records: 400
Correct records: 231
Accuracy 0.5775

Recommended Posts

100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock-20 (using pandas): reading JSON data
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 language processing knock-29: Get the URL of the national flag image
100 language processing knock-76 (using scikit-learn): labeling
100 Language Processing Knock-59: Analysis of S-expressions
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-44: Visualization of Dependent Tree
100 Language Processing Knock-89: Analogy by Additive Constitutiveness
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 language processing knock-90 (using Gensim): learning with word2vec
100 Language Processing Knock (2020): 28
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
Calculate the memory sharing rate of Linux processes
[Pandas] Basics of processing date data using dt
100 Language Processing Knock-45: Extraction of verb case patterns
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-72 (using Stanford NLP): feature extraction
Drawing on Jupyter using the plot function of pandas
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knocks-37 (using pandas): Top 10 most frequent words
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization