Introduction

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook.

When solving simple classification problems with Kaggle, I often use ML Wave's Ansembling-guide. There weren't many sites in Japanese, so I wrote a sample program as well.

Data generation
Preparation of classifier for use in ensemble learning
Implementation of ensemble learning
Evaluation
Reference

1. Data generation

If you have your own data, please ignore this.

Use make_classification of here to create 2000 samples of 3D 2 class data. Make_classification (flip_y = 0) is used to align the label data ratio to 1: 1.

After that, the generated data is divided into train and test in half.

`make_classification.py`


from sklearn.datasets import make_classification
import pandas as pd
import numpy as np

n_features = 3
n_samples = 2000
data = np.c_[make_classification(n_samples=n_samples, n_features=n_features, n_redundant=1, n_informative=2,n_clusters_per_class=2, n_classes=2, flip_y=0)]

train = test = np.empty((0,n_features+1), float)
for d in [data[data[:, n_features]==0], data[data[:, n_features]==1]]:
    np.random.shuffle(d)
    train = np.append(train, d[:(n_samples/4)], axis=0)
    test = np.append(test, d[(n_samples/4):], axis=0)
map(lambda x: np.random.shuffle(x), [train, test])

The contents of train and test look like this.

array([[-0.96155185, -0.49879683,  0.65487916,  1.        ],
       [-0.95225926, -1.00853786, -0.97598077,  0.        ],
       [-0.11578056,  2.51579129, -1.23724233,  0.        ],
       ..., 
       [-0.93715662,  0.41894292, -1.56002152,  0.        ],
       [-0.69759832, -0.20810317, -0.01283087,  0.        ],
       [ 0.31519506, -1.75498218,  0.89115054,  1.        ]])

Now you have 3 numerical data and label data.

2. Preparation of classifier for use in ensemble learning

Prepare a machine learning algorithm. Six types are used this time: RandomForest, KNN, ExtraTree, GradientBoosting, NaiveBays, and XGBoost. Other than XGBoost, scikit-learn is used, and you can simply import it. XGBoost can be built with pip install or git.

Create a list of classifiers to use in ensemble learning.

Since n_jobs = -1, parallel calculation is performed. Please correct if necessary
Not used this time, but since XGBoost 0.6 the eval_matrix parameter is gone

`set_clfs.py`


from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier

clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50),
        KNeighborsClassifier(n_neighbors=10, n_jobs=-1),
        GaussianNB(),
        XGBClassifier(learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1,
              gamma=0, subsample=0.8, colsample_bytree=0.5, objective= 'binary:logistic',
              scale_pos_weight=1, seed=0
             )
       ]

Let's try binary classification with KNN.

`knn.py`


from sklearn.metrics import accuracy_score

nbrs = KNeighborsClassifier().fit(train[:, :-1], train[:, -1])
print "Acc: ", accuracy_score(test[:, -1], nbrs.predict(test[:, :-1]))

Acc:  0.90

The accuracy is 90%. We will compare how this accuracy changes as a result of ensemble learning.

3. Implementation of ensemble learning

When dealing with simple (non-image or continuous value) classification problems in competitions such as Kaggle, it is rare to use a single learner. Ensemble learning, which combines multiple machine learnings, is used. As far as I know, the site that explains ensemble learning in the most detail is MLWave's this article.

Voting
Averaging
Rank averaging
Stacked generalization & blending

These four are the main ensemble learnings. The linked article details actual use cases in Kaggle. Very kindly, I have published Framework to utilize these on github. This time, I will introduce Stacked generalization & blending. (On github, it's a program called blend_proba.py)

3.1 Example of using blend_proba ()

This function is a framework for solving binary classification by ensemble learning. The contents are very simple, so if you have time, check out the code. Here, we will learn by the following method.

step 1

First of all, we will train the generated data with RandomForest, KNN, ExtraTree, NaiveBays, GradientBoosting, XGBoost respectively.
Using the classifier learned from the training data, the testing data is evaluated at the same time.

After importing blend_proba, execute the following program.

`step1.py`


import blend_proba as bp
[bp.blend_proba(clf, X_train=train[:, :-1], y=train[:, -1], X_test=test[:, :-1], save_preds="1", nfolds=3) for clf in clfs]

After execution, an npy file is generated in the executed directory.

1GB_0.303855837305_16482164617e7c9d188bc75bafc06a08_test.npy
1GB_0.303855837305_16482164617e7c9d188bc75bafc06a08_train.npy
1Ne_0.455167671362_cddd24af66706c9fa26f6601910c92c5_test.npy
1Ne_0.455167671362_cddd24af66706c9fa26f6601910c92c5_train.npy
1an_0.249015612417_825e1ad5956801c2225da656822caebb_test.npy
1an_0.249015612417_825e1ad5956801c2225da656822caebb_train.npy
1au_0.22545173232_4b57dac04bbc037494cb592143a1c09c_test.npy
1au_0.22545173232_4b57dac04bbc037494cb592143a1c09c_train.npy
1ra_0.207753858339_a0cb35c894f0ad378f6bb824e1019748_test.npy
1ra_0.207753858339_a0cb35c894f0ad378f6bb824e1019748_train.npy
1xt_0.270981174382_e130a295809821efc1db2f64c228169c_test.npy
1xt_0.270981174382_e130a295809821efc1db2f64c228169c_train.npy

By setting save_preds = "?", the prediction result named "? + 2nd to 3rd characters of classifier name + hash value + test (or train) .npy" is output by probability. In this case, since there are 6 classifiers, 12 npy files will be generated.

step 2 First, create a function that reads the file generated in step 1.

__ Here, it is assumed that the npy file generated in step 1 is in'./first/train/','./first/test/'. __
Enter the prediction result (continuous value from 0 to 1) of the binary classification obtained in step 1.

`read_first_stage.py`


import sys,os

def read_npy(tr_p, te_p):
    train_file_names = map(lambda x: tr_p + x, os.listdir(tr_p))
    test_file_names = map(lambda x: te_p + x, os.listdir(te_p))

    list_train, list_test = [], []
    for path_train, path_test in zip(train_file_names, test_file_names):
        frame_train, frame_test = np.load(path_train), np.load(path_test)
        list_train.append(frame_train)
        list_test.append(frame_test)
    l_train, l_test = list_train[0], list_test[0]
    for train_, test_ in zip(list_train[1:], list_test[1:]):
        l_train = np.concatenate([l_train, train_], axis=1)
        l_test = np.concatenate([l_test, test_], axis=1)
    return l_train, l_test

first_train, first_test = read_npy('./first/train/', './first/test/')
print first_train

Here is the result of reading the npy file of train data and concatenating it. The prediction result of binary classification is included for each learner, and train and test contain 12 variables.

array([[  1.07884407e-04,   9.99892116e-01,   0.00000000e+00, ...,
          9.93333333e-01,   2.50875433e-04,   9.99749125e-01],
       [  9.96784627e-01,   3.21540073e-03,   9.76666667e-01, ...,
          2.00000000e-02,   9.53099981e-01,   4.69000190e-02],
       [  5.11407852e-05,   9.99948859e-01,   5.33333333e-02, ...,
          9.06666667e-01,   1.66652470e-06,   9.99998333e-01],
       ..., 
       [  4.93575096e-01,   5.06424904e-01,   6.30000000e-01, ...,
          4.03333333e-01,   9.49199952e-01,   5.08000478e-02],
       [  3.96782160e-03,   9.96032178e-01,   2.66666667e-02, ...,
          9.46666667e-01,   2.46422552e-06,   9.99997536e-01],
       [  9.99466836e-01,   5.33185899e-04,   9.03333333e-01, ...,
          8.00000000e-02,   9.54109081e-01,   4.58909185e-02]])

The learning of step 2 is performed using the evaluation result of step 1.

`step2.py`


[bp.blend_proba(clf, X_train=first_train, y=train[:, -1], X_test=first_test, save_preds="2", nfolds=3) for clf in clfs]

After execution, 12 npy files will be generated.

2GB_0.37311622448_16482164617e7c9d188bc75bafc06a08_test.npy
2GB_0.37311622448_16482164617e7c9d188bc75bafc06a08_train.npy
2Ne_0.784523345103_cddd24af66706c9fa26f6601910c92c5_test.npy
2Ne_0.784523345103_cddd24af66706c9fa26f6601910c92c5_train.npy
2an_0.421335902473_825e1ad5956801c2225da656822caebb_test.npy
2an_0.421335902473_825e1ad5956801c2225da656822caebb_train.npy
2au_1.9348828025_4b57dac04bbc037494cb592143a1c09c_test.npy
2au_1.9348828025_4b57dac04bbc037494cb592143a1c09c_train.npy
2ra_0.292331269114_a0cb35c894f0ad378f6bb824e1019748_test.npy
2ra_0.292331269114_a0cb35c894f0ad378f6bb824e1019748_train.npy
2xt_0.451990280749_e130a295809821efc1db2f64c228169c_test.npy
2xt_0.451990280749_e130a295809821efc1db2f64c228169c_train.npy

step 3 Read the data generated in step 2 and learn with XGBoost.

__ Here, it is assumed that the npy file generated in step 2 is in'./second/train/','./second/test/'. __
By using blend_proba (save_test_only = "3") instead of save_preds, a file output with 0, 1 instead of probability is generated.

`step3.py`


second_train, second_test = read_data('./second/train/', './second/test/')

clf = XGBClassifier(learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1,
                    gamma=0, subsample=0.8, colsample_bytree=0.5, objective= 'binary:logistic',
                    scale_pos_weight=1, seed=0
                   )

bp.blend_proba(clf, X_train=second_train, y=second_train[:, -1], X_test=second_test, save_test_only="3", nfolds=3)

A file called 3 + GB + hash value + _test.txt was generated. 3GB_0.338917307945_16482164617e7c9d188bc75bafc06a08_test.txt

`check_ans.py`


ans = np.loadtxt('./3GB_0.338917307945_16482164617e7c9d188bc75bafc06a08_test.txt')
print "Acc: ", accuracy_score(test[:, -1], ans)

The accuracy is as follows.

Acc:  0.90

It seems that the accuracy has not improved in particular.

4. Evaluation

As a result of several trials, the accuracy did not change (or decreased) compared to KNN. It is possible that the generated data was small in the first place and that there was a problem with the artificially generated data itself. As introduced in MLWave, Kaggle data often improves accuracy by a few percent when there are a large number of variables.

About Kaggle-Ensemble-Guide / correlations.py

In the sample code, there is a file called correlations.py. Looking inside, the correlation coefficient of the prediction result is calculated from each classification result. As mentioned in the MLWave article, the more you combine classifiers with less correlation, the better the prediction accuracy can be expected. (It may be a matter of course) While checking the correlation coefficient in this way

Algorithm combination
Preparation of original data
Adjusting the score function of the classifier

I will try various things to improve the accuracy.

About multi-class classification

MLWave also introduces multi-class classification using ensemble learning. In the article, we implement ensemble learning for each class and implement multi-class classification by summarizing the prediction results of each. (One-vs-the-rest)

5. Reference

Kaggle Ensemble Guide

[PYTHON] Sample program and execution example of ensemble learning (Stacked generalization)