[PYTHON] What is ensemble learning?

What is ensemble learning?

-** A technique to combine multiple learners ** to get better predictions. In most cases you will get better results than using a single model alone.

--Specifically, combine the predicted values of multiple predictors by processing such as ** "take the average value" ** or ** "take the majority vote" **.

--In recent years, ** "boostering" and "random forest" **, which have been attracting attention in the field of data analysis, are also types of ensemble learning.

Bagging

It is meaningless to say "combine multiple learners (predictors)" or to combine models trained with ** the same data ** and ** the same algorithm **.

That said, there is usually only one piece of data.

There, I use a technique called ** "bootstrap" **.

-** Bootstrap **… Randomly (with duplication) n data are sampled from the training data.

Generate N bootstrap data sets of size n from the training data.

When N prediction models are created using these data, each prediction value is yn (x).

Therefore, the final predicted value of the model using bagging is as follows.

image.png

Stacking

For the previous bagging, we considered a simple average of N predicted values. In other words, the predicted values here are evaluated equally, and ** the importance of each model cannot be considered. ** **

--For stacking, the ** weighted average ** of each predicted value is used as the final predicted value, and the importance of each model is taken into consideration.

Therefore, the final predicted value is image.png

pumping

--Pumping is a method for ** searching for the best model from multiple predictors **.

--Generate N models using the bootstrap data set, apply them to the original training data, and select the model that minimizes the prediction error as the best model.

――It seems that there is no merit at first glance compared to bagging and stacking, but ** When an undesired solution is obtained using poor quality data, it is better to use a bootstrap data set excluding those data. A solution may be obtained **.

Random forest

-Random forest is a method that uses a "decision tree" as the base learner for the "bagging" mentioned above. The specific algorithm is as follows (1) Extract N bootstrap data sets from the training data. (2) Use these data to generate N trees Tn. --At this time, only m features are randomly selected from p features. (3) Take ** average ** in the case of regression and ** majority vote ** in the case of classification and use it as the final predicted value.

Why use a decision tree for the base learner?

--The basic idea of bagging is to reduce errors by combining multiple models with large variance and small bias.

(1) Large variance / small bias → Complex model (decision tree, nearest neighbor method) (2) Small variance / large bias → Simple model (linear regression)

--The decision tree is an ideal model with large variance and small bias as a base learner for bagging ** (overfitting can be corrected by averaging multiple models.) **

--Other merits such as "high speed", "regardless of variable data type", and "invariant to scaling".

Why use only some features?

――In ensemble learning, the lower the correlation between ** models, the higher the accuracy of the final predicted value. ** **

→ It is meaningless to collect many similar models, and the performance is higher when the models learned with different data are combined.

--In addition to bootstrapping, the correlation between models is lowered by changing the variables used for training in each model.

What is boosting?

-"Boosting" is one of the ensemble learning methods.

--Train the base learner ** sequentially **. (Generate the next learner based on the previous learner) Techniques such as bagging and stacking ultimately combine multiple base learners to produce predictive values. (There is no relationship between the learners before and after)

-** Two methods called "AdaBoost" ** and ** "gradient boosting" ** are typical.

-The algorithm realized by the library ** "Xgboost" **, which is quite popular in ** Kaggle **.

AdaBoost

--A weighted data set ** is used in the training of each learner. --Give greater weight ** to data points that the previous learner ** misclassified. --At first, give equal weight to all. --The predicted values of those base learners are finally combined to obtain the final predicted value.

What is Gradient Boosting?

--Fit to the ** "residual" ** of the previous learner. --A "decision tree" is often used as a base learner. -Algorithm implemented by ** xgboost **

image.png

Experiment ①

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

moons=make_moons(n_samples=200,noise=0.2,random_state=0)

X=moons[0]
y=moons[1]

from matplotlib.colors import ListedColormap

def plot_decision_boundary(model,X,y):
    _x1 = np.linspace(X[:,0].min()-0.5,X[:,0].max()+0.5,100)
    _x2 = np.linspace(X[:,1].min()-0.5,X[:,1].max()+0.5,100)
    x1,x2 = np.meshgrid(_x1,_x2)
    X_new=np.c_[x1.ravel(),x2.ravel()]
    y_pred=model.predict(X_new).reshape(x1.shape)
    custom_cmap=ListedColormap(["mediumblue","orangered"])
    plt.contourf(x1,x2,y_pred,cmap=custom_cmap,alpha=0.3)
    
def plot_dataset(X,y):
    plt.plot(X[:,0][y==0],X[:,1][y==0],"bo",ms=15)
    plt.plot(X[:,0][y==1],X[:,1][y==1],"r^",ms=15)
    plt.xlabel("$x_1$",fontsize=30)
    plt.ylabel("$x_2$",fontsize=30,rotation=0)

plt.figure(figsize=(12,8))
plot_dataset(X,y)
plt.show()

image.png

Decision tree analysis

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier #(scikit-Decision tree analysis with learn(CART method))

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

tree_clf=DecisionTreeClassifier().fit(X_train,y_train) #default no upper limit

plt.figure(figsize=(12,8))
plot_decision_boundary(tree_clf,X,y)
plot_dataset(X,y)
plt.show()

image.png

Random forest

from sklearn.ensemble import RandomForestClassifier

random_forest=RandomForestClassifier(n_estimators=100,random_state=0).fit(X_train,y_train)
#The default value is 10. Specify the number of decision trees used for bagging.

plt.figure(figsize=(12,8))
plot_decision_boundary(random_forest,X,y)
plot_dataset(X,y)
plt.show()

image.png

Experiment ②

from sklearn.datasets import load_iris

iris=load_iris()
X_iris=iris.data
y_iris=iris.target

random_forest_iris=RandomForestClassifier(random_state=0).fit(X_iris,y_iris)

#How important each feature is
random_forest_iris.feature_importances_

plt.figure(figsize=(12,8))
plt.barh(range(iris.data.shape[1]),random_forest_iris.feature_importances_,height=0.5)
plt.yticks(range(iris.data.shape[1]),iris.feature_names,fontsize=20)
plt.xlabel("Feature importance",fontsize=30)
plt.show()

image.png

Experiment ③

The dataset used was Kaggle's Titanic. https://www.kaggle.com/c/titanic

import pandas as pd

df=pd.read_csv("train.csv")
df["Age"]=df["Age"].fillna(df["Age"].mean())
df["Embarked"]=df["Embarked"].fillna(df["Embarked"].mode()[0])#Mode

from sklearn.preprocessing import LabelEncoder

cat_features=["Sex","Embarked"]

for col in cat_features:
    lbl = LabelEncoder()
    df[col]=lbl.fit_transform(list(df[col].values))

X=df.drop(columns=["PassengerId","Survived","Name","Ticket","Cabin"])
y=df["Survived"]

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

tree=DecisionTreeClassifier().fit(X_train,y_train)
print(tree.score(X_test,y_test))

rnd_forest=RandomForestClassifier(n_estimators=500,max_depth=5,random_state=0).fit(X_train,y_train)

print(rnd_forest.score(X_test,y_test))

image.png

Submission form (output as submisson.csv)

#Submission form
test_df=pd.read_csv("test.csv")
test_df["Age"]=test_df["Age"].fillna(test_df["Age"].mean())
test_df["Fare"]=test_df["Fare"].fillna(test_df["Fare"].mean())
test_df["Embarked"]=test_df["Embarked"].fillna(test_df["Embarked"].mode()[0])#Mode

for col in cat_features:
    lbl = LabelEncoder()
    test_df[col]=lbl.fit_transform(list(test_df[col].values))

X_pred=test_df.drop(columns=["PassengerId","Name","Ticket","Cabin"])
ID=test_df["PassengerId"]

prediction=rnd_forest.predict(X_pred)

submisson=pd.DataFrame({
    "PassengerId":ID,
    "Survived":prediction
})

submisson.to_csv("submisson.csv",index=False)




Recommended Posts

What is ensemble learning?
What is machine learning?
What is namespace
What is Django? .. ..
What is dotenv?
What is POSIX?
What is Linux
What is klass?
What is SALOME?
What is Linux?
What is python
What is hyperopt?
What is Linux
What is pyvenv
What is __call__
What is Linux
What is Python
Python learning basics ~ What is type conversion? ~
[Machine learning] What is the LP norm?
[Python] What is Pipeline ...
What is Calmar Ratio?
What is a terminal?
[PyTorch Tutorial ①] What is PyTorch?
What is hyperparameter tuning?
What is a hacker?
What is JSON? .. [Note]
What is Linux for?
What is a pointer?
What is TCP / IP?
What is Python's __init__.py?
What is an iterator?
What is UNIT-V Linux?
[Python] What is virtualenv
[Part 1] What is optimization? --Study materials for learning mathematical optimization
What is Minisum or Minimax?
What is Linux? [Command list]
Ensemble learning and basket analysis
What is Logistic Regression Analysis?
What is the activation function?
Ensemble learning summary! !! (With implementation)
What is the Linux kernel?
What is an instance variable?
What is a Context Switch?
What is Google Cloud Dataflow?
[Python] Python and security-① What is Python?
What is a super user?
Competitive programming is what (bonus)
[Python] * args ** What is kwrgs?
What is a system call
[Definition] What is a framework?
What is the interface for ...
What is Project Euler 3 Acceleration?
What is a callback function?
What is the Callback function?
What is a python map?
What is your "Tanimoto coefficient"?
Python Basic Course (1 What is Python)
[Python] What is a zip function?
[Python] What is a with statement?
What is labeling in financial forecasting?
What is Reduced Rank Ridge Regression?