[PYTHON] Gradient boosting tree modeling with xgboost

What is a gradient boosting tree?

One of the algorithms often used in data analysis competitions. Abbreviated as GBDT.

G ・ ・ ・ Gradient = Gradient descent method B ・ ・ ・ Boosting = One of the ensemble methods D ・ ・ ・ Decision T ・ ・ ・ Tree

In other words, it is a method that combines "Gradient", "Boosting (ensemble)", and "Decision Tree".

Gradient descent

An algorithm that updates the weights little by little to find the point where the error gradient is minimized. Think that "the error becomes smaller = the prediction becomes more accurate".

Boosting

One of the ensemble methods to create a model by combining multiple models. Models of the same type are combined in series, and the models are trained while correcting the predicted values. A strong learner (high accuracy) can be created by combining multiple weak learners (those whose prediction accuracy is not very high).

Decision tree

A method of analyzing data using a tree diagram. For example, when predicting "whether to buy ice cream"

"Temperature above 30 ° C" => will buy "Temperature below 30 ° C" => Would not buy

Prepare the condition and make a prediction.

Gradient boosting tree features

--Features are numerical values
The feature amount needs to be a numerical value in order to judge whether the branch of the decision tree is larger or smaller than the feature amount. --Can handle missing values
Since it is judged by the branch of the decision tree, it can be used without complementing the missing values. --Reflects the interaction between variables
Since the branch is repeated, the interaction between variables is reflected. --No need to scale features
Scaling such as standardization is not required because the judgment is based only on the magnitude relationship of the features.

Gradient boosting flow

  1. Calculate the average of the objective variables
  2. Calculate the error
  3. Build a decision tree
  4. Use the ensemble to find new predictions
  5. Calculate the error again
  6. Repeat 3 ~ 5
  7. Make a final forecast

The accuracy is improved by correcting the difference between the predicted value and the objective variable in the next decision tree.

Implementation procedure

This time, binary classification will be performed.

Loading the library

import xgboost as xgb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

%matplotlib inline

Data reading

df = pd.read_csv('hoge.csv')
df.head()    #Confirmation of reading

Select a feature

#This time from feature X"foo", From the objective variable Y"bar"Get rid of

X = df.drop(['foo', 'bar'], axis=1)
y = df['bar']
X.head()    #Confirm that it was removed

Separation of training data and evaluation data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, shuffle=True)

test_size: Divide the specified ratio as evaluation data (30% for 0.3) random_state: Specify the seed when generating random numbers shuffle: Whether to sort randomly when splitting data

Convert to DMatrix format

xgboost requires the dataset to be in DMatrix format.

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

Set learning parameters

xgb_params = {
    #Binary classification
    'objective': 'binary:logistic',
    #Specification of evaluation index logloss
    'eval_metric': 'logloss',
}

Learn model

bst = xgb.train(xgb_params,
                dtrain,
                #Number of learning rounds
                num_boost_round=100,
                #If improvement cannot be expected even after turning a certain round, stop learning
                early_stopping_rounds=10,
               )

Execution of forecast

    y_pred = bst.predict(dtest)

Accuracy verification

    acc = accuracy_score(y_test, y_pred)
    print('Accuracy:', acc)

This will output the accuracy, so adjust the parameters as needed.


With the above flow, it should be possible to analyze with a gradient boosting tree. Analysis of production data is omitted here. In addition, xgboost has a convenient function such as visualizing the weight of features, so please check it out.

reference

Python: Try using XGBoost https://blog.amedama.jp/entry/2019/01/29/235642

Intuitively understand the mechanism and procedure of GBDT with figures and concrete examples https://www.acceluniverse.com/blog/developers/2019/12/gbdt.html

Kaggle Master Explains Gradient Boosting https://qiita.com/woody_egg/items/232e982094cd3c80b3ee

Books: Daisuke Kadowaki, Takashi Sakata, Keisuke Hosaka, Yuji Hiramatsu (2019) "Technology for Data Analysis to Win with Kaggle" Gijutsu-Hyoronsha

Recommended Posts

Gradient boosting tree modeling with xgboost
LightGBM Hands-on-Another Gradient Boosting Library
Gradient color selection with matplotlib