[PYTHON] Classification / regression by stacking (scikit-learn)

Stacking is one way to combine multiple machine learning models, but I tried using Python's scikit-learn StackingClassifier and StackingRegressor.

StackingClassifier

Classification by stacking

Let's use breast cancer data to see the performance of the classification model.

from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

There is room for improvement because I haven't tuned all the parameters, but for the time being

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

estimators = [
        ('svc', make_pipeline(StandardScaler(), SVC())),
        ('rf', RandomForestClassifier()),
        ('mlp', MLPClassifier(max_iter=10000))
        ]
clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(max_iter=10000)
)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.972027972027972

Performance of a single classification model

As a comparison, let's calculate the accuracy rate of a single classification model.

make_pipeline(StandardScaler(), SVC()).fit(X_train, y_train).score(X_test, y_test)
0.965034965034965
RandomForestClassifier().fit(X_train, y_train).score(X_test, y_test)
0.951048951048951
MLPClassifier(max_iter=10000).fit(X_train, y_train).score(X_test, y_test)
0.9090909090909091
LogisticRegression(max_iter=10000).fit(X_train, y_train).score(X_test, y_test)
0.958041958041958

The result is that it is better to combine them than to use them alone.

However, if you recalculate from train_test_split, the performance of a single classification model may be better depending on how it is split.

For performance comparison, I think it is better to repeat the calculation many times without fixing the random seed and check how stable the performance is.

StackingRegressor

Regression by stacking

Let's use diabetes data to see the performance of the regression model.

from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

There is room for improvement here as well because the parameters have not been tuned.

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.ensemble import StackingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.cross_decomposition import PLSRegression

estimators = [
        ('svr', make_pipeline(StandardScaler(), SVR())),
        ('rf', RandomForestRegressor()),
        ('mlp', MLPRegressor(max_iter=10000))
        ]
clf = StackingRegressor(
    estimators=estimators,
    final_estimator=PLSRegression(),
)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.4940607294168183

Performance of a single regression model

For comparison, let's calculate the R2 value of a single regression model.

make_pipeline(StandardScaler(), SVR()).fit(X_train, y_train).score(X_test, y_test)
0.17571936903725216
RandomForestRegressor().fit(X_train, y_train).score(X_test, y_test)
0.46261715392586217
MLPRegressor(max_iter=10000).fit(X_train, y_train).score(X_test, y_test)
0.4936782755875562
PLSRegression().fit(X_train, y_train).score(X_test, y_test)
0.4927059150604132

Again, the result is that it is better to combine them than to use them alone.

However, if you recalculate from train_test_split, the performance of a single regression model may be better depending on how it is split.

For performance comparison, I think it is better to repeat the calculation many times without fixing the random seed and check how stable the performance is.

Recommended Posts

Classification / regression by stacking (scikit-learn)
Multi-label classification by random forest with scikit-learn
Difference between regression and classification
Pokemon classification by topic model
[Python] Linear regression with scikit-learn
Supervised machine learning (classification / regression)
Machine learning stacking template (regression)
Robust linear regression with scikit-learn
[Translation] scikit-learn 0.18 User Guide 1.15. Isotonic regression
Classification and regression in machine learning
Plot of regression line by residual plot