[PYTHON] Sample program and execution example of ensemble learning (Stacked generalization)

Introduction

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook.

When solving simple classification problems with Kaggle, I often use ML Wave's Ansembling-guide. There weren't many sites in Japanese, so I wrote a sample program as well.

table of contents

  1. Data generation
  2. Preparation of classifier for use in ensemble learning
  3. Implementation of ensemble learning
  4. Evaluation
  5. Reference

1. Data generation

If you have your own data, please ignore this.

Use make_classification of here to create 2000 samples of 3D 2 class data. Make_classification (flip_y = 0) is used to align the label data ratio to 1: 1.

After that, the generated data is divided into train and test in half.

make_classification.py


from sklearn.datasets import make_classification
import pandas as pd
import numpy as np

n_features = 3
n_samples = 2000
data = np.c_[make_classification(n_samples=n_samples, n_features=n_features, n_redundant=1, n_informative=2,n_clusters_per_class=2, n_classes=2, flip_y=0)]

train = test = np.empty((0,n_features+1), float)
for d in [data[data[:, n_features]==0], data[data[:, n_features]==1]]:
    np.random.shuffle(d)
    train = np.append(train, d[:(n_samples/4)], axis=0)
    test = np.append(test, d[(n_samples/4):], axis=0)
map(lambda x: np.random.shuffle(x), [train, test])

The contents of train and test look like this.

array([[-0.96155185, -0.49879683,  0.65487916,  1.        ],
       [-0.95225926, -1.00853786, -0.97598077,  0.        ],
       [-0.11578056,  2.51579129, -1.23724233,  0.        ],
       ..., 
       [-0.93715662,  0.41894292, -1.56002152,  0.        ],
       [-0.69759832, -0.20810317, -0.01283087,  0.        ],
       [ 0.31519506, -1.75498218,  0.89115054,  1.        ]])

Now you have 3 numerical data and label data.

2. Preparation of classifier for use in ensemble learning

Prepare a machine learning algorithm. Six types are used this time: RandomForest, KNN, ExtraTree, GradientBoosting, NaiveBays, and XGBoost. Other than XGBoost, scikit-learn is used, and you can simply import it. XGBoost can be built with pip install or git.

Create a list of classifiers to use in ensemble learning.

set_clfs.py


from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier

clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50),
        KNeighborsClassifier(n_neighbors=10, n_jobs=-1),
        GaussianNB(),
        XGBClassifier(learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1,
              gamma=0, subsample=0.8, colsample_bytree=0.5, objective= 'binary:logistic',
              scale_pos_weight=1, seed=0
             )
       ]

Let's try binary classification with KNN.

knn.py


from sklearn.metrics import accuracy_score

nbrs = KNeighborsClassifier().fit(train[:, :-1], train[:, -1])
print "Acc: ", accuracy_score(test[:, -1], nbrs.predict(test[:, :-1]))
Acc:  0.90

The accuracy is 90%. We will compare how this accuracy changes as a result of ensemble learning.

3. Implementation of ensemble learning

When dealing with simple (non-image or continuous value) classification problems in competitions such as Kaggle, it is rare to use a single learner. Ensemble learning, which combines multiple machine learnings, is used. As far as I know, the site that explains ensemble learning in the most detail is MLWave's this article.

These four are the main ensemble learnings. The linked article details actual use cases in Kaggle. Very kindly, I have published Framework to utilize these on github. This time, I will introduce Stacked generalization & blending. (On github, it's a program called blend_proba.py)

3.1 Example of using blend_proba ()

This function is a framework for solving binary classification by ensemble learning. The contents are very simple, so if you have time, check out the code. Here, we will learn by the following method.

Picture1.png

step 1

After importing blend_proba, execute the following program.

step1.py


import blend_proba as bp
[bp.blend_proba(clf, X_train=train[:, :-1], y=train[:, -1], X_test=test[:, :-1], save_preds="1", nfolds=3) for clf in clfs]

After execution, an npy file is generated in the executed directory.

1GB_0.303855837305_16482164617e7c9d188bc75bafc06a08_test.npy
1GB_0.303855837305_16482164617e7c9d188bc75bafc06a08_train.npy
1Ne_0.455167671362_cddd24af66706c9fa26f6601910c92c5_test.npy
1Ne_0.455167671362_cddd24af66706c9fa26f6601910c92c5_train.npy
1an_0.249015612417_825e1ad5956801c2225da656822caebb_test.npy
1an_0.249015612417_825e1ad5956801c2225da656822caebb_train.npy
1au_0.22545173232_4b57dac04bbc037494cb592143a1c09c_test.npy
1au_0.22545173232_4b57dac04bbc037494cb592143a1c09c_train.npy
1ra_0.207753858339_a0cb35c894f0ad378f6bb824e1019748_test.npy
1ra_0.207753858339_a0cb35c894f0ad378f6bb824e1019748_train.npy
1xt_0.270981174382_e130a295809821efc1db2f64c228169c_test.npy
1xt_0.270981174382_e130a295809821efc1db2f64c228169c_train.npy

By setting save_preds = "?", the prediction result named "? + 2nd to 3rd characters of classifier name + hash value + test (or train) .npy" is output by probability. In this case, since there are 6 classifiers, 12 npy files will be generated.

step 2 First, create a function that reads the file generated in step 1.

read_first_stage.py


import sys,os

def read_npy(tr_p, te_p):
    train_file_names = map(lambda x: tr_p + x, os.listdir(tr_p))
    test_file_names = map(lambda x: te_p + x, os.listdir(te_p))

    list_train, list_test = [], []
    for path_train, path_test in zip(train_file_names, test_file_names):
        frame_train, frame_test = np.load(path_train), np.load(path_test)
        list_train.append(frame_train)
        list_test.append(frame_test)
    l_train, l_test = list_train[0], list_test[0]
    for train_, test_ in zip(list_train[1:], list_test[1:]):
        l_train = np.concatenate([l_train, train_], axis=1)
        l_test = np.concatenate([l_test, test_], axis=1)
    return l_train, l_test

first_train, first_test = read_npy('./first/train/', './first/test/')
print first_train

Here is the result of reading the npy file of train data and concatenating it. The prediction result of binary classification is included for each learner, and train and test contain 12 variables.

array([[  1.07884407e-04,   9.99892116e-01,   0.00000000e+00, ...,
          9.93333333e-01,   2.50875433e-04,   9.99749125e-01],
       [  9.96784627e-01,   3.21540073e-03,   9.76666667e-01, ...,
          2.00000000e-02,   9.53099981e-01,   4.69000190e-02],
       [  5.11407852e-05,   9.99948859e-01,   5.33333333e-02, ...,
          9.06666667e-01,   1.66652470e-06,   9.99998333e-01],
       ..., 
       [  4.93575096e-01,   5.06424904e-01,   6.30000000e-01, ...,
          4.03333333e-01,   9.49199952e-01,   5.08000478e-02],
       [  3.96782160e-03,   9.96032178e-01,   2.66666667e-02, ...,
          9.46666667e-01,   2.46422552e-06,   9.99997536e-01],
       [  9.99466836e-01,   5.33185899e-04,   9.03333333e-01, ...,
          8.00000000e-02,   9.54109081e-01,   4.58909185e-02]])

The learning of step 2 is performed using the evaluation result of step 1.

step2.py


[bp.blend_proba(clf, X_train=first_train, y=train[:, -1], X_test=first_test, save_preds="2", nfolds=3) for clf in clfs]

After execution, 12 npy files will be generated.

2GB_0.37311622448_16482164617e7c9d188bc75bafc06a08_test.npy
2GB_0.37311622448_16482164617e7c9d188bc75bafc06a08_train.npy
2Ne_0.784523345103_cddd24af66706c9fa26f6601910c92c5_test.npy
2Ne_0.784523345103_cddd24af66706c9fa26f6601910c92c5_train.npy
2an_0.421335902473_825e1ad5956801c2225da656822caebb_test.npy
2an_0.421335902473_825e1ad5956801c2225da656822caebb_train.npy
2au_1.9348828025_4b57dac04bbc037494cb592143a1c09c_test.npy
2au_1.9348828025_4b57dac04bbc037494cb592143a1c09c_train.npy
2ra_0.292331269114_a0cb35c894f0ad378f6bb824e1019748_test.npy
2ra_0.292331269114_a0cb35c894f0ad378f6bb824e1019748_train.npy
2xt_0.451990280749_e130a295809821efc1db2f64c228169c_test.npy
2xt_0.451990280749_e130a295809821efc1db2f64c228169c_train.npy

step 3 Read the data generated in step 2 and learn with XGBoost.

step3.py


second_train, second_test = read_data('./second/train/', './second/test/')

clf = XGBClassifier(learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1,
                    gamma=0, subsample=0.8, colsample_bytree=0.5, objective= 'binary:logistic',
                    scale_pos_weight=1, seed=0
                   )

bp.blend_proba(clf, X_train=second_train, y=second_train[:, -1], X_test=second_test, save_test_only="3", nfolds=3)

A file called 3 + GB + hash value + _test.txt was generated. 3GB_0.338917307945_16482164617e7c9d188bc75bafc06a08_test.txt

check_ans.py


ans = np.loadtxt('./3GB_0.338917307945_16482164617e7c9d188bc75bafc06a08_test.txt')
print "Acc: ", accuracy_score(test[:, -1], ans)

The accuracy is as follows.

Acc:  0.90

It seems that the accuracy has not improved in particular.

4. Evaluation

As a result of several trials, the accuracy did not change (or decreased) compared to KNN. It is possible that the generated data was small in the first place and that there was a problem with the artificially generated data itself. As introduced in MLWave, Kaggle data often improves accuracy by a few percent when there are a large number of variables.

About Kaggle-Ensemble-Guide / correlations.py

In the sample code, there is a file called correlations.py. Looking inside, the correlation coefficient of the prediction result is calculated from each classification result. As mentioned in the MLWave article, the more you combine classifiers with less correlation, the better the prediction accuracy can be expected. (It may be a matter of course) While checking the correlation coefficient in this way

I will try various things to improve the accuracy.

About multi-class classification

MLWave also introduces multi-class classification using ensemble learning. In the article, we implement ensemble learning for each class and implement multi-class classification by summarizing the prediction results of each. (One-vs-the-rest)

5. Reference

Kaggle Ensemble Guide

Recommended Posts

Sample program and execution example of ensemble learning (Stacked generalization)
Ensemble learning and basket analysis
Differentiation of sort and generalization of sort
Significance of machine learning and mini-batch learning
[Machine learning] Summary and execution of model evaluation / indicators (w / Titanic dataset)
Machine learning algorithm (generalization of linear regression)
Example of using class variables and class methods
Meaning of deep learning models and parameters
Install the Python API of the autonomous driving simulator LGSVL and execute the sample program