[PYTHON] Ensemble learning summary! !! (With implementation)

Introduction

Ensemble learning is very important in machine learning. Write an article that makes it easy to understand and review the content.

What is ensemble learning?

A method of creating one learning model by fusing multiple models (learners).

Prediction accuracy should be better if you combine multiple models rather than training with just one model! A method born from the idea.

What does it mean to improve prediction accuracy?

Minimize the error between the predicted value and the actual value.

The important keywords for actually checking the prediction accuracy are "bias" and "variance"!

bias

○ Average of error between actual value and predicted value. ・ If the value is small, the prediction accuracy is high. ・ If the value is large, the prediction accuracy is low.

Variance

○ A value that indicates how scattered the predicted values ​​are. ・ If the value is small, the predicted value is settled. (Maybe overfitting) ・ If the value is large, the predicted values ​​are scattered.

Bias and variance are a trade-off relationship! !!

・ If the prediction accuracy is high, overfitting is likely to occur. ・ If the predicted values ​​are scattered, the prediction accuracy is low.

It is important to adjust the balance between these two!

Typical method of ensemble learning

① Bagging

Extract different data (bootstrap method) to create multiple different models (weak learners). After that, the average of the created multiple models is used as the final model.

○ Features ・ Variance can be reduced. ・ Learning time is short due to parallel processing. (Multiple data is extracted by the bootstrap method and multiple data are learned at the same time)

○ Representative model Random forest

② Boosting

We will learn the same data many times and build a more accurate model.

○ Features ・ Bias can be reduced. (Better accuracy than bagging can be expected) ・ Learning time is long due to serial processing. (Repeatly build a model that improves the result of the model)

○ Representative model XGBoost / LightGBM

③ Stacking

Create a model by combining multiple models.

Specifically, the flow is that the values ​​predicted by "multiple regression analysis", "random forest", and "LightGBM" are used as features and predicted by multiple regression analysis.

In other words, the three predicted values ​​predicted by the three models are the input values ​​for multiple regression analysis.

You may be wondering what model to combine, but it is common to combine a decision tree system (random forest, XGBoost, etc.) and a regression system (multiple regression analysis). (By combining models of different strains, there is a possibility that features that cannot be discovered by themselves may be supplemented.)

○ Features ・ Prediction accuracy is improved. (Basically, better accuracy than the single model) ・ It becomes difficult to interpret and analyze the results. ・ Learning time becomes longer

Stacking implementation

Data preprocessing is omitted. (Because of simple data preprocessing, the result may be different.)

As a flow, the predicted values ​​of multiple regression analysis, random forest, and LightGBM are subjected to multiple regression analysis to be the final predicted values. (Metamodel)

Data reading

python.py


import pandas as pd
import numpy as np

#Data reading
df = pd.read_csv('train.tsv', delimiter = '\t')
df = df.drop(['id', 'dteday', 'yr', 'atemp'],axis =1)


###Preprocess the data if necessary.


#Explanatory variable
train = df.drop('cnt', axis=1)
#Objective variable
test = df['cnt']

#Divide the data into three
from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test = train_test_split(train, test, test_size=0.2, random_state=0)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=0)
print(X_train.shape)
print(X_valid.shape)
print(X_test.shape)
print(y_train.shape)
print(y_valid.shape)
print(y_test.shape)

(5532, 8)
(1384, 8)
(1729, 8)
(5532,)
(1384,)
(1729,)

Predicted value in the first stage

pyhon.py


from sklearn.linear_model import LinearRegression #Multiple regression analysis
from sklearn.ensemble import RandomForestRegressor #Random forest
import lightgbm as lgb #LightGBM
#Evaluation(Average squared error)
from sklearn.metrics import mean_squared_error

#Model instance
model_1 = LinearRegression()
model_2 = RandomForestRegressor()
model_3 = lgb.LGBMRegressor()

#Model learning
model_1.fit(X_train, y_train)
model_2.fit(X_train, y_train)
model_3.fit(X_train, y_train)

#Creating Predicted Values
pred_1 = model_1.predict(X_test)
pred_2 = model_2.predict(X_test)
pred_3 = model_3.predict(X_test)

#Check the prediction accuracy of each model by mean square error
print ("Average squared error in multiple regression analysis: {:.4f}".format(mean_squared_error(y_test, pred_1)))
print ("Random forest mean squared error: {:.4f}".format(mean_squared_error(y_test, pred_2)))
print ("LightGBM mean squared error: {:.4f}".format(mean_squared_error(y_test, pred_3)))


Average squared error in multiple regression analysis: 6825.7104
Random forest mean squared error: 4419.4774
LightGBM mean squared error: 4043.2921

Stacking implementation

python.py


#First stage forecast
first_pred_1 = model_1.predict(X_valid)
first_pred_2 = model_2.predict(X_valid)
first_pred_3 = model_3.predict(X_valid)

#Summarize the predicted values ​​of the first stage (features of metamodel)
stack_pred = np.column_stack((first_pred_1,first_pred_2,first_pred_3))

#Creating a metamodel
meta_model = LinearRegression()
#First-stage forecast answer= y_valid
meta_model.fit(stack_pred, y_valid)

#Check the stacking accuracy with the value predicted in advance
stack_test_pred = np.column_stack((pred_1, pred_2, pred_3))
meta_test_pred = meta_model.predict(stack_test_pred)
print ("Metamodel mean squared error: {:.4f}".format(mean_squared_error(y_test, meta_test_pred)))

Metamodel mean squared error: 4030.9495

in conclusion

The prediction accuracy is slightly better than the model alone.

○ Improvement points ・ Adjustment of parameters of each model ・ Change or increase the model of the first stage ・ Increase the number of versions with different parameters while keeping the number of models ex Random Forest n_estimators = 50, n_estimators = 100, n_estimators = 1000, etc.

Recommended Posts

Ensemble learning summary! !! (With implementation)
Machine learning algorithm classification and implementation summary
Machine learning tutorial summary
Learning Python with ChemTHEATER 03
"Object-oriented" learning with python
Learning Python with ChemTHEATER 05-1
Machine learning ⑤ AdaBoost Summary
Learning Python with ChemTHEATER 02
Learning Python with ChemTHEATER 01
What is ensemble learning?
Stackful coroutine implementation summary
Summary for learning RAPIDS
7-line interpreter implementation summary
Site summary to learn machine learning with English video
Summary of the basic flow of machine learning with Python
Machine learning learned with Pokemon
Try deep learning with TensorFlow
CNN implementation with just numpy
Play with reinforcement learning with MuZero
Ensemble learning and basket analysis
Reinforcement learning starting with Python
Random forest (implementation / parameter summary)
About learning with google colab
Machine learning with Python! Preparation
Deep Kernel Learning with Pyro
Try Deep Learning with FPGA
Machine learning ② Naive Bayes Summary
Machine learning article summary (self-authored)
Linux fastest learning with AWS
Unity IAP implementation method summary
Machine learning Minesweeper with PyTorch
Beginning with Python machine learning
Python Iteration Learning with Cheminformatics
Machine learning ④ K-nearest neighbor Summary
Try machine learning with Kaggle
Deep reinforcement learning 2 Implementation of reinforcement learning
[Learning memo] Django command summary
Generate Pokemon with Deep Learning
Try Deep Learning with FPGA-Select Cucumbers
Implementation of Bulk Update with mongo-go-driver
Reinforcement learning 13 Try Mountain_car with ChainerRL.
Make ASCII art with deep learning
[PyTorch Tutorial ⑤] Learning PyTorch with Examples (Part 2)
Summary of basic implementation by PyTorch
I tried machine learning with liblinear
Machine learning ① SVM (Support Vector Machine) Summary
Machine learning with python (1) Overall classification
Try deep learning with TensorFlow Part 2
Machine learning summary by Python beginners
Input / output with Python (Python learning memo ⑤)
Basic machine learning procedure: ④ Classifier learning + ensemble learning
[Chainer] Learning XOR with multi-layer perceptron
Machine learning ③ Summary of decision tree
Explore the maze with reinforcement learning
1D-CNN, 2D-CNN scratch implementation summary by Pytorch
Play with Pythonista UI implementation [Action implementation]
Perceptron learning experiment learned with Python
Solve three-dimensional PDEs with deep learning.
Try machine learning with scikit-learn SVM
Score-CAM implementation with keras. Comparison with Grad-CAM
Deep learning learned by implementation 1 (regression)