[PYTHON] 100 language processing knock-74 (using scikit-learn): Prediction

This is the record of the 74th "Forecast" of Language Processing 100 Knock 2015. The polarity (negative / positive) is predicted (inferred) using the previously trained (trained) model, and the prediction probability is also calculated. Until now, I didn't post it to the block because it was basically the same as "Amateur language processing 100 knocks". , "Chapter 8: Machine Learning" has been taken seriously and changed to some extent. I will post. I mainly use scikit-learn.

Reference link

Link Remarks
074.Forecast.ipynb Answer program GitHub link
100 amateur language processing knocks:74 I am always indebted to you by knocking 100 language processing
Getting started with Python with 100 knocks on language processing#74 -Machine learning, scikit-Predict logistic regression with learn scikit-Knock result using learn

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
numpy 1.17.4
pandas 0.25.3
scikit-learn 0.21.3

Task

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

74. Forecast

Using the logistic regression model learned in> 73, implement a program that calculates the polarity label ("+1" for a positive example, "-1" for a negative example) of a given sentence and its prediction probability.

Answer

Answer Program [074. Forecast.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF % 92/074.% E4% BA% 88% E6% B8% AC.ipynb)

Basically [Previous "Answer Program (Analysis) 073_2. Learning (Training) .ipynb"](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6 Predicted to% A2% B0% E5% AD% A6% E7% BF% 92/073_2.% E5% AD% A6% E7% BF% 92 (% E8% A8% 93% E7% B7% B4) .ipynb) It's just a part added.

import csv

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

#Classes for using word vectorization in GridSearchCV
class myVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
        self.method = method
        self.min_df = min_df
        self.max_df = max_df

    def fit(self, x, y=None):
        if self.method == 'tfidf':
            self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
        else:
            self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
        self.vectorizer.fit(x)
        return self

    def transform(self, x, y=None):
        return self.vectorizer.transform(x)
		
#Parameters for GridSearchCV
PARAMETERS = [
    {
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0003, 0.0004], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #I also tried 10 but the SCORE is low just because it is slow
        'classifier__solver': ['newton-cg', 'liblinear']},
    ]

#Read file
def read_csv_column(col):
    with open('./sentiment_stem.txt') as file:
        reader = csv.reader(file, delimiter='\t')
        header = next(reader)
        return [row[col] for row in reader]    
		
x_all = read_csv_column(1)
y_all = read_csv_column(0)
x_train, x_test, y_train, y_test = train_test_split(x_all, y_all)

def train(x_train, y_train, file):
    pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
    
    #clf stands for classification
    clf = GridSearchCV(
            pipline, # 
            PARAMETERS,           #Parameter set you want to optimize
            cv = 5)               #Number of cross-validations
    
    clf.fit(x_train, y_train)
    pd.DataFrame.from_dict(clf.cv_results_).to_csv(file)

    print('Grid Search Best parameters:', clf.best_params_)
    print('Grid Search Best validation score:', clf.best_score_)
    print('Grid Search Best training score:', clf.best_estimator_.score(x_train, y_train))    

    return clf.best_estimator_
	
def validate(estimator, x_test, y_test):
    
    for i, (x, y) in enumerate(zip(x_test, y_test)):
        y_pred = estimator.predict_proba([x])
        if y == np.argmax(y_pred).astype( str ):
            if y == '1':
                result = 'TP:The correct answer is Positive and the prediction is Positive'
            else:
                result = 'TN:The correct answer is Negative and the prediction is Negative'
        else:
            if y == '1':
                result = 'FN:The correct answer is Positive and the prediction is Negative'
            else:
                result = 'FP:The correct answer is Negative and the prediction is Positive'
        print(result, y_pred, x)
        if i == 29:
            break

estimator = train(x_train, y_train, 'gs_result.csv')
validate(estimator, x_test, y_test)

Answer commentary

Data split

The function train_test_split is used to divide it into training data and test data. It is natural that the trained data will be more accurate, so separate the data that is not used for training for prediction. I studied in Past article "Coursera Machine Learning Introductory Course (Week 6-Various Advice)".

x_train, x_test, y_train, y_test = train_test_split(x_all, y_all)

Forecast

I am making predictions using the function predict_proba. There is a similar function predict, but in that case only the result (0 or 1) is returned, not the probability.

def validate(estimator, x_test, y_test):
    
    for i, (x, y) in enumerate(zip(x_test, y_test)):
        y_pred = estimator.predict_proba([x])
        if y == np.argmax(y_pred).astype( str ):
            if y == '1':
                result = 'TP:The correct answer is Positive and the prediction is Positive'
            else:
                result = 'TN:The correct answer is Negative and the prediction is Negative'
        else:
            if y == '1':
                result = 'FN:The correct answer is Positive and the prediction is Negative'
            else:
                result = 'FP:The correct answer is Negative and the prediction is Positive'
        print(result, y_pred, x)
        if i == 29:
            break

The result of outputting 30 lines is like this. For TP, TN, FP, FN, [Article "[For beginners] Explanation of evaluation indicators for classification problems of machine learning (correct answer rate, precision rate, recall rate, etc.)"](https://qiita.com/FukuharaYohei/ Please refer to items / be89a99c53586fa4e2e4).

TN:The correct answer is Negative and the prediction is Negative[[0.7839262 0.2160738]] restrain freak show mercenari obviou cerebr dull pretenti engag isl defi easi categor
FN:The correct answer is Positive and the prediction is Negative[[0.6469949 0.3530051]] chronicl man quest presid man singl handedli turn plane full hard bitten cynic journalist essenti campaign end extend public depart
TN:The correct answer is Negative and the prediction is Negative[[0.87843253 0.12156747]] insuffer movi mean make think existenti suffer instead put sleep
TN:The correct answer is Negative and the prediction is Negative[[0.90800564 0.09199436]] minut condens episod tv seri pitfal expect
TP:The correct answer is Positive and the prediction is Positive[[0.12240474 0.87759526]] absorb unsettl psycholog drama
TP:The correct answer is Positive and the prediction is Positive[[0.42977787 0.57022213]] rodriguez chop smart aleck film school brat imagin big kid
FN:The correct answer is Positive and the prediction is Negative[[0.59805784 0.40194216]] gangster movi capac surpris
TP:The correct answer is Positive and the prediction is Positive[[0.29473058 0.70526942]] confront stanc todd solondz take aim polit correct suburban famili
TP:The correct answer is Positive and the prediction is Positive[[0.21660554 0.78339446]] except act quietli affect cop drama
TP:The correct answer is Positive and the prediction is Positive[[0.47919199 0.52080801]] steer unexpectedli adam streak warm blood empathi dispar manhattan denizen especi hole
TN:The correct answer is Negative and the prediction is Negative[[0.67294895 0.32705105]] standard gun martial art clich littl new add
TN:The correct answer is Negative and the prediction is Negative[[0.66582407 0.33417593]] sweet gentl jesu screenwrit cut past everi bad action movi line histori
TP:The correct answer is Positive and the prediction is Positive[[0.41463847 0.58536153]] malcolm mcdowel cool paul bettani cool paul bettani play malcolm mcdowel cool
TP:The correct answer is Positive and the prediction is Positive[[0.33183064 0.66816936]] center humor constant ensembl give buoyant deliveri
TN:The correct answer is Negative and the prediction is Negative[[0.63371373 0.36628627]] let subtitl fool movi prove holli wood longer monopoli mindless action
TP:The correct answer is Positive and the prediction is Positive[[0.25740295 0.74259705]] taiwanes auteur tsai ming liang good news fall sweet melancholi spell uniqu director previou film
FN:The correct answer is Positive and the prediction is Negative[[0.57810652 0.42189348]] turntabl outsel electr guitar
FN:The correct answer is Positive and the prediction is Negative[[0.52506635 0.47493365]] movi stay afloat thank hallucinatori product design
TN:The correct answer is Negative and the prediction is Negative[[0.57268778 0.42731222]] non-mysteri mysteri
TP:The correct answer is Positive and the prediction is Positive[[0.07663805 0.92336195]] beauti piec count heart import humor
TN:The correct answer is Negative and the prediction is Negative[[0.86860199 0.13139801]] toothless dog alreadi cabl lose bite big screen
FP:The correct answer is Negative and the prediction is Positive[[0.4918716 0.5081284]] sandra bullock hugh grant make great team predict romant comedi get pink slip
TN:The correct answer is Negative and the prediction is Negative[[0.61861307 0.38138693]] movi comedi work better ambit say subject willing
FP:The correct answer is Negative and the prediction is Positive[[0.47041114 0.52958886]] like lead actor lot manag squeez laugh materi tread water best forgett effort
TP:The correct answer is Positive and the prediction is Positive[[0.26767592 0.73232408]] writer director juan carlo fresnadillo make featur debut fulli form remark assur
FP:The correct answer is Negative and the prediction is Positive[[0.40931838 0.59068162]] grand fart come director begin resembl crazi french grandfath
FP:The correct answer is Negative and the prediction is Positive[[0.43081731 0.56918269]] perform sustain intellig stanford anoth subtl humour bebe neuwirth older woman seduc oscar film founder lack empathi social milieu rich new york intelligentsia
TP:The correct answer is Positive and the prediction is Positive[[0.29555115 0.70444885]] perform uniformli good
TP:The correct answer is Positive and the prediction is Positive[[0.34561148 0.65438852]] droll well act charact drive comedi unexpect deposit feel
TP:The correct answer is Positive and the prediction is Positive[[0.31537766 0.68462234]] great participatori spectat sport

Recommended Posts

100 language processing knock-74 (using scikit-learn): Prediction
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 Language Processing Knock (2020): 28
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis