Japanese language processing by Python3 (5) Ensemble learning of different models by Voting Classifier

What is ensemble learning?

When there are multiple classifiers with high accuracy for classifying a given sentence into several classes, such as "sentiment analysis" in natural language processing, it is possible to create a stronger model by combining these in ensemble learning. There is also.

Ensemble learning A method of constructing a high-precision learner by combining a predictor that outputs a solution at random, that is, a weak learner that can predict with higher accuracy than a predictor with the worst prediction accuracy. Techniques such as bagging and boosting are well known. ([Crested Ibis Forest Wiki Ensemble Learning](http://ibisforest.org/index.php?%E3%82%A2%E3%83%B3%E3%82%B5%E3%83%B3%E3%83] % 96% E3% 83% AB% E5% AD% A6% E7% BF% 92))

In reality, it seems that it may be better for multiple experts to discuss policy proposals than to ask one expert for policy advice. Roughly speaking, the expert here is a learner (random forest or support vector machine), and combining the results (predicted values) obtained from multiple learners is ensemble learning. By the way, Random Forest itself is called an ensemble learner because Random Forest obtains the predicted value by majority from the results of multiple decision trees. This time, I will investigate Voting Classifier that can quickly combine multiple models with high accuracy that are conceptually different.

What is Voting Classifier?

A class in sklearn.ensemble implemented in scikit-learn v0.17.

The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

Voting Classifier determines the results of different types of learners (such as Random Forest, Logistic Regression and Gausian NB (naive-based classifier)) that already have a certain degree of accuracy by majority vote or probability average. The concept itself is very simple, but easy to use and surprisingly powerful.

Hard Vote A method of adopting the label that was decided by majority among the labels predicted when making predictions with multiple models. For a given input X, if the three learners make different decisions, "this is 1" and "this is 2", respectively, the majority "1" is taken here and classified as X-> 1. Will be done. Learner 1-> class 1 Learner 2-> class 1 Learner 3-> class 2

Weak Vote This weights the probabilities that each learner predicts to be in a class, and adds the sum to get the label with the highest average probability. See the official example for details. 1.11.5.2. Weighted Average Probabilities (Soft Voting)

Be careful with Voting Classifier

One thing to keep in mind is that the ** equally well performing model ** is a pitfall, and if there are models that don't work well here, voting may not improve the results. Why is my VotingClassifier accuracy less than my individual classifier?

Actually use

As you can see in the official docs, I'm doing a hard vote on the iris dataset using the Voting Classifier for the time being.

voting_classifier.py


from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

def executre_voting_classifier():
    #Load the iris dataset
    iris = datasets.load_iris()
    X = iris.data[:, [0,2]]
    y = iris.target

    #Set the classifier. Here we use logistic regression, a random forest classifier, and a Gaussian naive base.
    clf1 = LogisticRegression(random_state=1)
    clf2 = RandomForestClassifier(random_state=1)
    clf3 = GaussianNB()

    #Create an ensemble learner. voting='hard'I will set it to and decide the value by a simple majority vote.
    eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

    for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
        scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
        print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Output result:
    Accuracy: 0.92 (+/- 0.03) [Logistic Regression]
    Accuracy: 0.91 (+/- 0.05) [Random Forest]
    Accuracy: 0.91 (+/- 0.06) [naive Bayes]
    Accuracy: 0.93 (+/- 0.06) [Ensemble]

Certainly (subtly) improved. You can easily combine models with multiple parameters and perform a Grid Search on these parameters.

ensemble.py


clf1 = SVC(kernel='rbf', random_state=0, gamma=0.3, C=5 ,class_weight='balanced')
clf2 = LogisticRegression(C=5, random_state=0, class_weight='balanced')
clf3 = RandomForestClassifier(criterion='entropy', n_estimators=250, random_state = 1, max_depth = 20, n_jobs=2, class_weight='balanced')

eclf = VotingClassifier(estimators=[('svm', clf1), ('lr', clf2), ('rfc', clf3)], voting='hard')
    eclf.fit(X_train, y_train)

Like this, it seems to be effective when "There are some models that seem to be good to combine, but I wonder if there is an easy way to combine them".

Recommended Posts

Japanese language processing by Python3 (5) Ensemble learning of different models by Voting Classifier
[Language processing 100 knocks 2020] Summary of answer examples by Python
100 Language Processing Knock Chapter 1 by Python
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
[Learning memo] Basics of class by python
Grayscale by matrix-Reinventor of Python image processing-
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
Python: Deep Learning in Natural Language Processing: Basics
Python: Deep learning in natural language processing: Implementation of answer sentence selection system
Image processing by matrix Basics & Table of Contents-Reinventor of Python image processing-
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
[Chapter 4] Introduction to Python with 100 knocks of language processing
Python: Natural language processing
Communication processing by Python
Various processing of Python
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
[Python] Try to classify ramen shops by natural language processing
3. Natural language processing with Python 5-4. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (noun edition)]
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Image processing by python (Pillow)
Post processing of python (NG)
Straight line drawing by matrix-Inventor's original research of Python image processing-
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)