[PYTHON] 100 language processing knock-77 (using scikit-learn): measurement of correct answer rate

This is the record of the 77th "Measurement of accuracy rate" of Language processing 100 knocks 2015. The content of the knock question is the measurement of the correct answer rate for the training data, but this time we dare to use the test data as in the previous time. Until now, I didn't post it to the block because it was basically the same as "Amateur language processing 100 knocks". , "Chapter 8: Machine Learning" has been taken seriously and changed to some extent. I will post. I mainly use scikit-learn.

Reference link

Link Remarks
077.Measurement of correct answer rate.ipynb Answer program GitHub link
100 amateur language processing knocks:77 I am always indebted to you by knocking 100 language processing
Getting started with Python with 100 knocks on language processing#77 -Machine learning, scikit-Measurement of correct answer rate with learn scikit-Knock result using learn

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
matplotlib 3.1.1
numpy 1.17.4
pandas 0.25.3
scikit-learn 0.21.3

Task

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

77. Measurement of correct answer rate

Create a program that receives output of> 76 and calculates the correct answer rate of the prediction, the correct answer rate for the correct example, the recall rate, and the F1 score.

This time, I ignored the part "Receive 76 outputs" and implemented it for the test data. Like last time, I thought that test data would be more useful than training data.

Answer

Answer program [077. Measurement of correct answer rate.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD%A6% E7% BF% 92 / 077.% E6% AD% A3% E8% A7% A3% E7% 8E% 87% E3% 81% AE% E8% A8% 88% E6% B8% AC.ipynb)

Basically [previous "076. Labeling.ipynb"](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD % A6% E7% BF% 92 / 076.% E3% 83% A9% E3% 83% 99% E3% 83% AB% E4% BB% 98% E3% 81% 91.ipynb) Correct answer rate and related indicators It's just about adding output logic.

import csv

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

#Classes for using word vectorization in GridSearchCV
class myVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
        self.method = method
        self.min_df = min_df
        self.max_df = max_df

    def fit(self, x, y=None):
        if self.method == 'tfidf':
            self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
        else:
            self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
        self.vectorizer.fit(x)
        return self

    def transform(self, x, y=None):
        return self.vectorizer.transform(x)
		
#Parameters for GridSearchCV
PARAMETERS = [
    {
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0003, 0.0004], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #I also tried 10 but the SCORE is low just because it is slow
        'classifier__solver': ['newton-cg', 'liblinear']},
    ]

#Read file
def read_csv_column(col):
    with open('./sentiment_stem.txt') as file:
        reader = csv.reader(file, delimiter='\t')
        header = next(reader)
        return [row[col] for row in reader]    
		
x_all = read_csv_column(1)
y_all = read_csv_column(0)
x_train, x_test, y_train, y_test = train_test_split(x_all, y_all)

def train(x_train, y_train, file):
    pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
    
    #clf stands for classification
    clf = GridSearchCV(
            pipline, # 
            PARAMETERS,           #Parameter set you want to optimize
            cv = 5)               #Number of cross-validations
    
    clf.fit(x_train, y_train)
    pd.DataFrame.from_dict(clf.cv_results_).to_csv(file)

    print('Grid Search Best parameters:', clf.best_params_)
    print('Grid Search Best validation score:', clf.best_score_)
    print('Grid Search Best training score:', clf.best_estimator_.score(x_train, y_train))    
    
    #Feature weight output
    output_coef(clf.best_estimator_)
    
    return clf.best_estimator_

#Feature weight output
def output_coef(estimator):
    vec = estimator.named_steps['vectorizer']
    clf = estimator.named_steps['classifier']

    coef_df = pd.DataFrame([clf.coef_[0]]).T.rename(columns={0: 'Coefficients'})
    coef_df.index = vec.vectorizer.get_feature_names()
    coef_sort = coef_df.sort_values('Coefficients')
    coef_sort[:10].plot.barh()
    coef_sort.tail(10).plot.barh()

def validate(estimator, x_test, y_test):
    
    for i, (x, y) in enumerate(zip(x_test, y_test)):
        y_pred = estimator.predict_proba([x])
        if y == np.argmax(y_pred).astype( str ):
            if y == '1':
                result = 'TP:The correct answer is Positive and the prediction is Positive'
            else:
                result = 'TN:The correct answer is Negative and the prediction is Negative'
        else:
            if y == '1':
                result = 'FN:The correct answer is Positive and the prediction is Negative'
            else:
                result = 'FP:The correct answer is Negative and the prediction is Positive'
        print(result, y_pred, x)
        if i == 29:
            break

    #TSV list output
    y_pred = estimator.predict(x_test)
    y_prob = estimator.predict_proba(x_test)

    results = pd.DataFrame([y_test, y_pred, y_prob.T[1], x_test]).T.rename(columns={ 0: 'Correct answer', 1 : 'Forecast', 2: 'Forecast確率(positive)', 3 :'Word string'})
    results.to_csv('./predict.txt' , sep='\t')

    print('\n', classification_report(y_test, y_pred))
    print('\n', confusion_matrix(y_test, y_pred))

estimator = train(x_train, y_train, 'gs_result.csv')
validate(estimator, x_test, y_test)

Answer commentary

I'm just using scikit-learn's classification_report, and I haven't written much about it. The result y_pred of the predict function used in the previous "labeling" is used.

y_pred = estimator.predict(x_test)

All you have to do now is pass it to the classification_report function with the original correct label y_test.

print('\n', classification_report(y_test, y_pred))

The precision rate, recall rate, F1 score, correct answer rate, etc. are output.

              precision    recall  f1-score   support

           0       0.75      0.73      0.74      1351
           1       0.73      0.75      0.74      1315

    accuracy                           0.74      2666
   macro avg       0.74      0.74      0.74      2666
weighted avg       0.74      0.74      0.74      2666

Since the parameters to be passed are the same, the confusion_matrix function also outputs the confusion matrix.

print('\n', confusion_matrix(y_test, y_pred))

A mixed matrix appears in a simple form. For details on the mixed matrix, see [Separate article "[For beginners] Explanation of evaluation indicators for classification problems in machine learning (correct answer rate, precision rate, recall rate, etc.)" (https://qiita.com/FukuharaYohei/items/be89a99c53586fa4e2e4) ).

 [[992 359]
 [329 986]]

Recommended Posts

100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 language processing knock-97 (using scikit-learn): k-means clustering
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 49 answer]
100 language processing knock 2020 [00 ~ 59 answer]
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-20 (using pandas): reading JSON data
100 language processing knock-98 (using pandas): Ward's method clustering
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 language processing knocks 2020 [00 ~ 89 answer]
100 Amateur Language Processing Knock: 07
Language processing 100 knocks 00 ~ 09 Answer
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 Amateur Language Processing Knock: 67
[Language processing 100 knocks 2020] Summary of answer examples by Python
100 language processing knock-92 (using Gensim): application to analogy data
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping