This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook.
When solving simple classification problems with Kaggle, I often use ML Wave's Ansembling-guide. There weren't many sites in Japanese, so I wrote a sample program as well.
If you have your own data, please ignore this.
Use make_classification of here to create 2000 samples of 3D 2 class data. Make_classification (flip_y = 0) is used to align the label data ratio to 1: 1.
After that, the generated data is divided into train and test in half.
make_classification.py
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
n_features = 3
n_samples = 2000
data = np.c_[make_classification(n_samples=n_samples, n_features=n_features, n_redundant=1, n_informative=2,n_clusters_per_class=2, n_classes=2, flip_y=0)]
train = test = np.empty((0,n_features+1), float)
for d in [data[data[:, n_features]==0], data[data[:, n_features]==1]]:
np.random.shuffle(d)
train = np.append(train, d[:(n_samples/4)], axis=0)
test = np.append(test, d[(n_samples/4):], axis=0)
map(lambda x: np.random.shuffle(x), [train, test])
The contents of train and test look like this.
array([[-0.96155185, -0.49879683, 0.65487916, 1. ],
[-0.95225926, -1.00853786, -0.97598077, 0. ],
[-0.11578056, 2.51579129, -1.23724233, 0. ],
...,
[-0.93715662, 0.41894292, -1.56002152, 0. ],
[-0.69759832, -0.20810317, -0.01283087, 0. ],
[ 0.31519506, -1.75498218, 0.89115054, 1. ]])
Now you have 3 numerical data and label data.
Prepare a machine learning algorithm. Six types are used this time: RandomForest, KNN, ExtraTree, GradientBoosting, NaiveBays, and XGBoost. Other than XGBoost, scikit-learn is used, and you can simply import it. XGBoost can be built with pip install or git.
Create a list of classifiers to use in ensemble learning.
set_clfs.py
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50),
KNeighborsClassifier(n_neighbors=10, n_jobs=-1),
GaussianNB(),
XGBClassifier(learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1,
gamma=0, subsample=0.8, colsample_bytree=0.5, objective= 'binary:logistic',
scale_pos_weight=1, seed=0
)
]
Let's try binary classification with KNN.
knn.py
from sklearn.metrics import accuracy_score
nbrs = KNeighborsClassifier().fit(train[:, :-1], train[:, -1])
print "Acc: ", accuracy_score(test[:, -1], nbrs.predict(test[:, :-1]))
Acc: 0.90
The accuracy is 90%. We will compare how this accuracy changes as a result of ensemble learning.
When dealing with simple (non-image or continuous value) classification problems in competitions such as Kaggle, it is rare to use a single learner. Ensemble learning, which combines multiple machine learnings, is used. As far as I know, the site that explains ensemble learning in the most detail is MLWave's this article.
These four are the main ensemble learnings. The linked article details actual use cases in Kaggle. Very kindly, I have published Framework to utilize these on github. This time, I will introduce Stacked generalization & blending. (On github, it's a program called blend_proba.py)
This function is a framework for solving binary classification by ensemble learning. The contents are very simple, so if you have time, check out the code. Here, we will learn by the following method.
step 1
After importing blend_proba, execute the following program.
step1.py
import blend_proba as bp
[bp.blend_proba(clf, X_train=train[:, :-1], y=train[:, -1], X_test=test[:, :-1], save_preds="1", nfolds=3) for clf in clfs]
After execution, an npy file is generated in the executed directory.
1GB_0.303855837305_16482164617e7c9d188bc75bafc06a08_test.npy
1GB_0.303855837305_16482164617e7c9d188bc75bafc06a08_train.npy
1Ne_0.455167671362_cddd24af66706c9fa26f6601910c92c5_test.npy
1Ne_0.455167671362_cddd24af66706c9fa26f6601910c92c5_train.npy
1an_0.249015612417_825e1ad5956801c2225da656822caebb_test.npy
1an_0.249015612417_825e1ad5956801c2225da656822caebb_train.npy
1au_0.22545173232_4b57dac04bbc037494cb592143a1c09c_test.npy
1au_0.22545173232_4b57dac04bbc037494cb592143a1c09c_train.npy
1ra_0.207753858339_a0cb35c894f0ad378f6bb824e1019748_test.npy
1ra_0.207753858339_a0cb35c894f0ad378f6bb824e1019748_train.npy
1xt_0.270981174382_e130a295809821efc1db2f64c228169c_test.npy
1xt_0.270981174382_e130a295809821efc1db2f64c228169c_train.npy
By setting save_preds = "?", the prediction result named "? + 2nd to 3rd characters of classifier name + hash value + test (or train) .npy" is output by probability. In this case, since there are 6 classifiers, 12 npy files will be generated.
step 2 First, create a function that reads the file generated in step 1.
read_first_stage.py
import sys,os
def read_npy(tr_p, te_p):
train_file_names = map(lambda x: tr_p + x, os.listdir(tr_p))
test_file_names = map(lambda x: te_p + x, os.listdir(te_p))
list_train, list_test = [], []
for path_train, path_test in zip(train_file_names, test_file_names):
frame_train, frame_test = np.load(path_train), np.load(path_test)
list_train.append(frame_train)
list_test.append(frame_test)
l_train, l_test = list_train[0], list_test[0]
for train_, test_ in zip(list_train[1:], list_test[1:]):
l_train = np.concatenate([l_train, train_], axis=1)
l_test = np.concatenate([l_test, test_], axis=1)
return l_train, l_test
first_train, first_test = read_npy('./first/train/', './first/test/')
print first_train
Here is the result of reading the npy file of train data and concatenating it. The prediction result of binary classification is included for each learner, and train and test contain 12 variables.
array([[ 1.07884407e-04, 9.99892116e-01, 0.00000000e+00, ...,
9.93333333e-01, 2.50875433e-04, 9.99749125e-01],
[ 9.96784627e-01, 3.21540073e-03, 9.76666667e-01, ...,
2.00000000e-02, 9.53099981e-01, 4.69000190e-02],
[ 5.11407852e-05, 9.99948859e-01, 5.33333333e-02, ...,
9.06666667e-01, 1.66652470e-06, 9.99998333e-01],
...,
[ 4.93575096e-01, 5.06424904e-01, 6.30000000e-01, ...,
4.03333333e-01, 9.49199952e-01, 5.08000478e-02],
[ 3.96782160e-03, 9.96032178e-01, 2.66666667e-02, ...,
9.46666667e-01, 2.46422552e-06, 9.99997536e-01],
[ 9.99466836e-01, 5.33185899e-04, 9.03333333e-01, ...,
8.00000000e-02, 9.54109081e-01, 4.58909185e-02]])
The learning of step 2 is performed using the evaluation result of step 1.
step2.py
[bp.blend_proba(clf, X_train=first_train, y=train[:, -1], X_test=first_test, save_preds="2", nfolds=3) for clf in clfs]
After execution, 12 npy files will be generated.
2GB_0.37311622448_16482164617e7c9d188bc75bafc06a08_test.npy
2GB_0.37311622448_16482164617e7c9d188bc75bafc06a08_train.npy
2Ne_0.784523345103_cddd24af66706c9fa26f6601910c92c5_test.npy
2Ne_0.784523345103_cddd24af66706c9fa26f6601910c92c5_train.npy
2an_0.421335902473_825e1ad5956801c2225da656822caebb_test.npy
2an_0.421335902473_825e1ad5956801c2225da656822caebb_train.npy
2au_1.9348828025_4b57dac04bbc037494cb592143a1c09c_test.npy
2au_1.9348828025_4b57dac04bbc037494cb592143a1c09c_train.npy
2ra_0.292331269114_a0cb35c894f0ad378f6bb824e1019748_test.npy
2ra_0.292331269114_a0cb35c894f0ad378f6bb824e1019748_train.npy
2xt_0.451990280749_e130a295809821efc1db2f64c228169c_test.npy
2xt_0.451990280749_e130a295809821efc1db2f64c228169c_train.npy
step 3 Read the data generated in step 2 and learn with XGBoost.
step3.py
second_train, second_test = read_data('./second/train/', './second/test/')
clf = XGBClassifier(learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1,
gamma=0, subsample=0.8, colsample_bytree=0.5, objective= 'binary:logistic',
scale_pos_weight=1, seed=0
)
bp.blend_proba(clf, X_train=second_train, y=second_train[:, -1], X_test=second_test, save_test_only="3", nfolds=3)
A file called 3 + GB + hash value + _test.txt was generated. 3GB_0.338917307945_16482164617e7c9d188bc75bafc06a08_test.txt
check_ans.py
ans = np.loadtxt('./3GB_0.338917307945_16482164617e7c9d188bc75bafc06a08_test.txt')
print "Acc: ", accuracy_score(test[:, -1], ans)
The accuracy is as follows.
Acc: 0.90
It seems that the accuracy has not improved in particular.
As a result of several trials, the accuracy did not change (or decreased) compared to KNN. It is possible that the generated data was small in the first place and that there was a problem with the artificially generated data itself. As introduced in MLWave, Kaggle data often improves accuracy by a few percent when there are a large number of variables.
In the sample code, there is a file called correlations.py. Looking inside, the correlation coefficient of the prediction result is calculated from each classification result. As mentioned in the MLWave article, the more you combine classifiers with less correlation, the better the prediction accuracy can be expected. (It may be a matter of course) While checking the correlation coefficient in this way
I will try various things to improve the accuracy.
MLWave also introduces multi-class classification using ensemble learning. In the article, we implement ensemble learning for each class and implement multi-class classification by summarizing the prediction results of each. (One-vs-the-rest)
Recommended Posts