[PYTHON] 100 language processing knock-79 (using scikit-learn): precision-recall graph drawing

This is the record of the 79th "Compliance rate-Drawing recall rate graph" of Language processing 100 knocks 2015. A graph showing how the precision and recall rates, which are in a trade-off relationship, are related. The ROC curve is also output as a similar graph. Until now, I didn't post it to the block because it was basically the same as "Amateur language processing 100 knocks". , "Chapter 8: Machine Learning" has been taken seriously and changed to some extent. I will post. I mainly use scikit-learn.

Reference link

Link Remarks
079.Compliance rate-Drawing a recall graph.ipynb Answer program GitHub link
100 amateur language processing knocks:79 I am always indebted to you by knocking 100 language processing
Getting started with Python with 100 knocks on language processing#79 -Machine learning, scikit-Match rate with learn-Recall&Drawing a graph scikit-Knock result using learn

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
matplotlib 3.1.1
numpy 1.17.4
pandas 0.25.3
scikit-learn 0.21.3

Task

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

79. Conformance-Drawing recall graph

Draw a precision-recall graph by changing the classification threshold of the logistic regression model.

Not only the precision-recall rate graph, but also the ROC curve is output. Also, as a bonus, the learning curve is also output. This is a complete "bonus" regardless of the precision-recall graph or ROC curve.

Answer

Answer Program [079. Compliance Rate-Recall Rate Graph.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD% A6% E7% BF% 92 / 079.% E9% 81% A9% E5% 90% 88% E7% 8E% 87-% E5% 86% 8D% E7% 8F% BE% E7% 8E% 87% E3% 82% B0% E3% 83% A9% E3% 83% 95% E3% 81% AE% E6% 8F% 8F% E7% 94% BB.ipynb)

Basically [previous [077.Measurement of correct answer rate.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%] AD% A6% E7% BF% 92 / 077.% E6% AD% A3% E8% A7% A3% E7% 8E% 87% E3% 81% AE% E8% A8% 88% E6% B8% AC.ipynb ) With three graph output logics added.

import csv

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split, learning_curve
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

#Classes for using word vectorization in GridSearchCV
class myVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
        self.method = method
        self.min_df = min_df
        self.max_df = max_df

    def fit(self, x, y=None):
        if self.method == 'tfidf':
            self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
        else:
            self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
        self.vectorizer.fit(x)
        return self

    def transform(self, x, y=None):
        return self.vectorizer.transform(x)
		
#Parameters for GridSearchCV
PARAMETERS = [
    {
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0003, 0.0004], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #I also tried 10 but the SCORE is low just because it is slow
        'classifier__solver': ['newton-cg', 'liblinear']},
    ]

#Read file
def read_csv_column(col):
    with open('./sentiment_stem.txt') as file:
        reader = csv.reader(file, delimiter='\t')
        header = next(reader)
        return [row[col] for row in reader]    
		
x_all = read_csv_column(1)
y_all = read_csv_column(0)
x_train, x_test, y_train, y_test = train_test_split(x_all, y_all)

def train(x_train, y_train, file):
    pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
    
    #clf stands for classification
    clf = GridSearchCV(
            pipline, # 
            PARAMETERS,           #Parameter set you want to optimize
            cv = 5)               #Number of cross-validations
    
    clf.fit(x_train, y_train)
    pd.DataFrame.from_dict(clf.cv_results_).to_csv(file)

    print('Grid Search Best parameters:', clf.best_params_)
    print('Grid Search Best validation score:', clf.best_score_)
    print('Grid Search Best training score:', clf.best_estimator_.score(x_train, y_train))    
    
    #Feature weight output
    output_coef(clf.best_estimator_)
    
    return clf.best_estimator_

#Feature weight output
def output_coef(estimator):
    vec = estimator.named_steps['vectorizer']
    clf = estimator.named_steps['classifier']

    coef_df = pd.DataFrame([clf.coef_[0]]).T.rename(columns={0: 'Coefficients'})
    coef_df.index = vec.vectorizer.get_feature_names()
    coef_sort = coef_df.sort_values('Coefficients')
    coef_sort[:10].plot.barh()
    coef_sort.tail(10).plot.barh()

def validate(estimator, x_test, y_test):
    
    for i, (x, y) in enumerate(zip(x_test, y_test)):
        y_pred = estimator.predict_proba([x])
        if y == np.argmax(y_pred).astype( str ):
            if y == '1':
                result = 'TP:The correct answer is Positive and the prediction is Positive'
            else:
                result = 'TN:The correct answer is Negative and the prediction is Negative'
        else:
            if y == '1':
                result = 'FN:The correct answer is Positive and the prediction is Negative'
            else:
                result = 'FP:The correct answer is Negative and the prediction is Positive'
        print(result, y_pred, x)
        if i == 29:
            break

    #TSV list output
    y_pred = estimator.predict(x_test)
    y_prob = estimator.predict_proba(x_test)

    results = pd.DataFrame([y_test, y_pred, y_prob.T[1], x_test]).T.rename(columns={ 0: 'Correct answer', 1 : 'Forecast', 2: 'Forecast確率(positive)', 3 :'Word string'})
    results.to_csv('./predict.txt' , sep='\t')

    print('\n', classification_report(y_test, y_pred))
    print('\n', confusion_matrix(y_test, y_pred))

#Graph output
def output_graphs(estimator, x_all, y_all, x_test, y_test):
    
    #Learning curve output
    output_learning_curve(estimator, x_all, y_all)
    
    y_pred = estimator.predict_proba(x_test)
    
    #ROC curve output
    output_roc(y_test, y_pred)
    
    #Compliance rate-Recall rate graph output
    output_pr_curve(y_test, y_pred)

#Learning curve output
def output_learning_curve(estimator, x_all, y_all):
    training_sizes, train_scores, test_scores = learning_curve(estimator,
                                                               x_all, y_all, cv=5,
                                                               train_sizes=[0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
    plt.plot(training_sizes, train_scores.mean(axis=1), label="training scores")
    plt.plot(training_sizes, test_scores.mean(axis=1), label="test scores")
    plt.legend(loc="best")
    plt.show()

#ROC curve output
def output_roc(y_test, y_pred):
    # FPR, TPR(,Threshold)Calculate
    fpr, tpr, thresholds = roc_curve(y_test, y_pred[:,1], pos_label='1')

    #Also AUC
    auc_ = auc(fpr, tpr)

    #Plot ROC curve
    plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc_)
    plt.legend()
    plt.title('ROC curve')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.grid(True)
    plt.show()    

#Compliance rate-Recall rate graph output
def output_pr_curve(y_test, y_pred):
    #Matching rate and recall rate at a certain threshold,Get the threshold value
    precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred[:,1], pos_label='1')
    
    #0 to 1 0.Plot ○ in increments of 05
    for i in range(21):
        close_point = np.argmin(np.abs(thresholds - (i * 0.05)))
        plt.plot(precisions[close_point], recalls[close_point], 'o')

    #Compliance rate-Recall rate curve
    plt.plot(precisions, recalls)
    plt.xlabel('Precision')
    plt.ylabel('Recall')
 
    plt.show()

estimator = train(x_train, y_train, 'gs_result.csv')
validate(estimator, x_test, y_test)
output_graphs(estimator, x_all, y_all, x_test, y_test)

Answer commentary

Conformance-Recall Graph

The precision_recall_curve` of scikit-learn is used to receive the precision, recall, and threshold values and output them as a graph.

#Compliance rate-Recall rate graph output
def output_pr_curve(y_test, y_pred):
    #Matching rate and recall rate at a certain threshold,Get the threshold value
    precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred[:,1], pos_label='1')
    
    #0 to 1 0.Plot ○ in increments of 05
    for i in range(21):
        close_point = np.argmin(np.abs(thresholds - (i * 0.05)))
        plt.plot(precisions[close_point], recalls[close_point], 'o')

    #Compliance rate-Recall rate curve
    plt.plot(precisions, recalls)
    plt.xlabel('Precision')
    plt.ylabel('Recall')
 
    plt.show()

It is a graph of the output result.

image.png

You can see the following trade-offs.

--High Recall: Even if you are not confident, it is judged to be Positive, so the Precision will decrease as a result (I judged it to be Positive, but in fact, there are many mistakes called Negative). --High Precision (precision rate): Recall (recall rate) decreases as a result because it is judged as Positive only when you are confident (many mistakes are not judged as Negative even though it is Positive)

For details on the mixed matrix, see [Separate article "[For beginners] Explanation of evaluation indicators for classification problems in machine learning (correct answer rate, precision rate, recall rate, etc.)" (https://qiita.com/FukuharaYohei/items/be89a99c53586fa4e2e4) ).

ROC curve and AUC

Use the roc_curve function to get the False Positive rate, True Positive rate, and threshold. Also calculate the value of auc with ʻauc` function. Finally, output the graph (AUC value is shown in the legend).

def output_roc(y_test, y_pred):
    # FPR, TPR(,Threshold)Calculate
    fpr, tpr, thresholds = roc_curve(y_test, y_pred[:,1], pos_label='1')

    #Also AUC
    auc_ = auc(fpr, tpr)

    #Plot ROC curve
    plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc_)
    plt.legend()
    plt.title('ROC curve')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.grid(True)
    plt.show()

Even if you are not confident, you can see that by judging as Positive, the number of True Positives increases and the number of False Positives also increases. image.png

Learning curve

At the end, the meaning is different from the former two graphs, but I will write the learning curve as a bonus. I wanted to see if it was high bias or high variance, so I output it. About high bias and high variance Separate article "Coursera Machine Learning Introductory Course (6th week-various advice)" I wrote in (It's a rough article ...). learning_curve function with explanatory variables and labels, number of cross-validations (5 times), training data size Pass the list. This will return the correct answer rate for training and testing according to the training data size.

#Learning curve output
def output_learning_curve(estimator, x_all, y_all):
    training_sizes, train_scores, test_scores = learning_curve(estimator,
                                                               x_all, y_all, cv=5,
                                                               train_sizes=[0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
    plt.plot(training_sizes, train_scores.mean(axis=1), label="training scores")
    plt.plot(training_sizes, test_scores.mean(axis=1), label="test scores")
    plt.legend(loc="best")
    plt.show()

Looking at the output results, I thought that it was a high variance that still narrowed the difference between training and testing, so I tried the following, but the accuracy did not improve. Both are options for grid search.

--Reduce features (increase TdidfVectorizer / CountVectorizer parameter min_df) --Increase the value of regularization term C (increase the logistic regression parameter C)

image.png

Recommended Posts

100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 language processing knock-97 (using scikit-learn): k-means clustering
100 language processing knock-75 (using scikit-learn): weight of features
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 language processing knock-99 (using pandas): visualization by t-SNE
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data