A confusing story with two ways to implement XGBoost in Python + overall notes

Introduction

XGBoost is a method of GBDT and can be implemented in python. However, when I examined the implementation example, I was confused because there were multiple ways to describe it even though I used the same library. Therefore, the purpose of this article is to try to do the same thing in each notation with the meaning of the author's memorandum. (Please note that detailed explanation of XGBoost will be omitted.)

Implementation environment

pip install xgboost

Implementation details

Implementation ~ Common part ~

Data set loading

import_boston_datasets.py


#Import the library to be used this time first
import pandas as pd
import numpy as np
import xgboost
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

boston = load_boston()

df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)

#Attach the objective variable to the end of the dataframe as price to display them all together.
df_boston['PRICE'] = boston.target
print(df_boston.head())
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Creation of train data and test data

make_train_test.py


#Divide the data into features x and objective variable y
x = df_boston.loc[:,'CRIM':'LSTAT']
y = df_boston['PRICE']

#7:Divided into train data and test data at 3
trainX, testX, trainY, testY = train_test_split(x, y, test_size=0.3)

#Try to output each shape
print('x.shape = {}'.format(x.shape))
print('y.shape = {}'.format(y.shape))
print('trainX.shape = {}'.format(trainX.shape))
print('trainY.shape = {}'.format(trainY.shape))
print('testX.shape = {}'.format(testX.shape))
print('testY.shape = {}'.format(testY.shape))
# x.shape = (506, 13)
# y.shape = (506,)
# trainX.shape = (354, 13)
# trainY.shape = (354,)
# testX.shape = (152, 13)
# testY.shape = (152,)

Implementation ~ Training ~

Method 1

Method 1 is a method using an interface with ** scikit-learn compatible API **. I will describe it from here, which you may be familiar with. First of all, from the simplest implementation without specifying any parameters.

regression1-1.py


#Since it is a regression, use XGB Regressor
reg = xgb.XGBRegressor()

#eval_Set verification data in set
reg.fit(trainX, trainY,
        eval_set=[(trainX, trainY),(testX, testY)])
#[0]	validation_0-rmse:21.5867	validation_1-rmse:21.7497
#[1]	validation_0-rmse:19.5683	validation_1-rmse:19.7109
#[2]	validation_0-rmse:17.7456	validation_1-rmse:17.8998
#abridgement
#[97]	validation_0-rmse:1.45198	validation_1-rmse:2.7243
#[98]	validation_0-rmse:1.44249	validation_1-rmse:2.72238
#[99]	validation_0-rmse:1.43333	validation_1-rmse:2.7233

#Predictive execution
predY = reg.predict(testX)

#Display of MSE
print(mean_squared_error(testY, predY))
#7.4163707577050655

If you specify the parameters a little more properly, it will be as follows.

regression1-2.py


reg = xgb.XGBRegressor(#Specifying the objective function The initial value is also the square error
                       objective='reg:squarederror',
                       #Number of learning rounds early_Since stopping is used, specify more
                       n_estimators=50000,
                       #What to use for booster The initial value is also gbtree
                       booster='gbtree',
                       #Learning rate
                       learning_rate=0.01,
                       #Maximum depth of the tree
                       max_depth=6,
                       #Seed value
                       random_state=2525)

#Prepare variables to display the learning process
evals_result = {}
reg.fit(trainX, trainY,
        eval_set=[(trainX, trainY),(testX, testY)],
        #Evaluation index used for learning
        eval_metric='rmse',
        #Specify the number of rounds to stop learning if the objective function does not improve
        early_stopping_rounds=15,
        #Specify the above variable using the callback API to record the learning process
        callbacks=[xgb.callback.record_evaluation(evals_result)])

#[1]	validation_0-rmse:19.5646	validation_1-rmse:19.7128
#[2]	validation_0-rmse:17.7365	validation_1-rmse:17.9048
#[3]	validation_0-rmse:16.0894	validation_1-rmse:16.2733
#abridgement
#[93]	validation_0-rmse:0.368592	validation_1-rmse:2.47429
#[94]	validation_0-rmse:0.3632	validation_1-rmse:2.47945
#[95]	validation_0-rmse:0.356932	validation_1-rmse:2.48028
#Stopping. Best iteration:
#[80]	validation_0-rmse:0.474086	validation_1-rmse:2.46597

predY = reg.predict(testX)
print(mean_squared_error(testY, predY))
#6.080995445035289

The loss has improved a little compared to when nothing was specified. There are other parameters that can be specified, but for details XGBoost Python API Reference See ** Scikit-Learn API ** in.

However, there are parameters that are not written in this API, which is often confusing. For example, if you output the model when nothing is specified

ex.py


reg = xgb.XGBRegressor()
print(reg)
#XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
#             colsample_bynode=1, colsample_bytree=1, gamma=0,
#             importance_type='gain', learning_rate=0.1, max_delta_step=0,
#             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
#             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
#             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
#             silent=None, subsample=1, verbosity=1)

You can check the initial value of the model like this. But at the bottom

silent=None

Even if you check with the API, such a parameter does not exist in the first place. Some sites describe it as a parameter for the presence or absence of output during training, but specifying it did not change anything in particular.

Method 2

Next is the second method. This is the ** original API ** of the xgboost library. Therefore, there is a slight difference in the handling of datasets.

regression2-1.py


#xgb.Make it usable in the original API by DMatrix
#feature_You do not have to specify the names, but it is safe to add them because it will be convenient later.
dtrain = xgb.DMatrix(trainX, label=trainY, feature_names = x.columns)
dtest = xgb.DMatrix(testX, label=testY, feature_names = x.columns)

reg = xgb.train(#Specify an empty list because it trains with the initial value without specifying parameters on purpose
                params=[],
                dtrain=dtrain,
                #Set verification data in eval
                evals=[(dtrain, 'train'), (dtest, 'eval')])
#[0]	train-rmse:17.1273	eval-rmse:17.3433
#[1]	train-rmse:12.3964	eval-rmse:12.7432
#[2]	train-rmse:9.07831	eval-rmse:9.44546
#[3]	train-rmse:6.6861	eval-rmse:7.16429
#[4]	train-rmse:5.03358	eval-rmse:5.70227
#[5]	train-rmse:3.88521	eval-rmse:4.7088
#[6]	train-rmse:3.03311	eval-rmse:4.09655
#[7]	train-rmse:2.44077	eval-rmse:3.6657
#[8]	train-rmse:2.0368	eval-rmse:3.40768
#[9]	train-rmse:1.72258	eval-rmse:3.29363

predY = reg.predict(dtest)
print(mean_squared_error(testY, predY))
#10.847961069710934

You can see that the number of learning rounds is set to be quite small by default. That's not a better result than Method 1. Let's set the parameters in the same way and execute it.

regression2-2.py


#xgb.Make it usable in the original API by DMatrix
dtrain = xgb.DMatrix(trainX, label=trainY, feature_names = x.columns)
dtest = xgb.DMatrix(testX, label=testY, feature_names = x.columns)

#Xgb first_Set the parameters as params
xgb_params = {#Objective function
              'objective': 'reg:squarederror',
              #Evaluation index used for learning
              'eval_metric': 'rmse',
              #What to use for booster
              'booster': 'gbtree',
              #learning_Synonymous with rate
              'eta': 0.1,
              #Maximum depth of the tree
              'max_depth': 6,
              #random_Synonymous with state
              'seed': 2525}

#Prepare variables to get the learning process
evals_result = {}
reg = xgb.train(#Use the learning parameters set above
                params=xgb_params,
                dtrain=dtrain,
                #Number of learning rounds
                num_boost_round=50000,
                #Number of rounds of early stopping u
                early_stopping_rounds=15,
                #Verification data
                evals=[(dtrain, 'train'), (dtest, 'eval')],
                #Set the variables prepared above
                evals_result=evals_result)
#[1]	train-rmse:19.5646	eval-rmse:19.7128
#[2]	train-rmse:17.7365	eval-rmse:17.9048
#[3]	train-rmse:16.0894	eval-rmse:16.2733
#abridgement
#[93]	train-rmse:0.368592	eval-rmse:2.47429
#[94]	train-rmse:0.3632	eval-rmse:2.47945
#[95]	train-rmse:0.356932	eval-rmse:2.48028
#Stopping. Best iteration:
#[80]	train-rmse:0.474086	eval-rmse:2.46597

predY = reg.predict(dtest)
print(mean_squared_error(testY, predY))
#6.151798278561384

The loss displayed during the learning process is exactly the same as in Method 1. However, if you do predict and output MSE after that, the value will be different ... I don't know the cause, so I will pass it through. As you can see by comparing the codes, the names of the parameters and the places where they are written are slightly different, so be careful. For example, if the learning rate eta is set as learning_rate as in method 1, execution is possible but the value is not reflected. About this parameter XGBoost Parameters When XGBoost Python API Reference Check out the ** Learning API **.

Implementation ~ Graph display of learning process ~

Since the value is stored in ** evals_result ** prepared at the time of learning, the transition is graphed using it.

Method 1

plot_validation1.py


#plot the loss transition for train data
plt.plot(evals_result['validation_0']['rmse'], label='train rmse')
#plot loss transition for test data
plt.plot(evals_result['validation_1']['rmse'], label='eval rmse')
plt.grid()
plt.legend()
plt.xlabel('rounds')
plt.ylabel('rmse')
plt.show()

img.png

Method 2

For some reason, the name when stored in evals_result is different from that of method 1.

plot_validation2.py


#plot the loss transition for train data
plt.plot(evals_result['train']['rmse'], label='train rmse')
#plot loss transition for test data
plt.plot(evals_result['eval']['rmse'], label='eval rmse')
plt.grid()
plt.legend()
plt.xlabel('rounds')
plt.ylabel('rmse')
plt.savefig("img.png ", bbox_inches='tight')
plt.show()

img.png

Implementation ~ Display of Feature Importance ~

xgboost has a method called ** xgb.plot_importance () ** that plots FeatureImportance, which can be used in both methods 1 and 2.

plot_importance.py


xgb.plot_importance(reg)

img.png

You can also specify importance_type as an argument. API description and importance_type If you refer to this article

You can use three of them. (It seems that you can also use the total value for gain and cover instead of the average value to read the API) The initial value seems to be weight, for example, if you specify gain and output it, it will be as follows.

plot_importance.py


xgb.plot_importance(reg,importance_type='gain')

img.png

There is also a method called ** xgb.get_score () **, which allows you to get the values used in the graph as a dictionary. However, this method is only available in Method 2 **, and I'm not sure if there is a way to do something similar in Method 1 ... I'm worried about the inconsistencies around here.

print_importance.py


print(reg.get_score(importance_type='weight'))
#{'LSTAT': 251,
# 'RM': 363,
# 'CRIM': 555,
# 'DIS': 295,
# 'B': 204,
# 'INDUS': 81,
# 'NOX': 153,
# 'AGE': 290,
# 'PTRATIO': 91,
# 'RAD': 41,
# 'ZN': 36,
# 'TAX': 91,
# 'CHAS': 13}

print(reg.get_score(importance_type='gain'))
#{'LSTAT': 345.9503342748026,
# 'RM': 67.2338906183525,
# 'CRIM': 9.066383988597524,
# 'DIS': 20.52948739887609,
# 'B': 5.704856272869067,
# 'INDUS': 6.271976581219753,
# 'NOX': 17.48982672038596,
# 'AGE': 3.396609941187381,
# 'PTRATIO': 15.018738197646142,
# 'RAD': 5.182013825021951,
# 'ZN': 2.7426182845938896,
# 'TAX': 12.025571026275834,
# 'CHAS': 1.172155851074923}

I think that feature_names was specified when creating the DMatrix, but if you do not specify it, the name of the feature will not be displayed on the graph etc., so it will be very difficult to understand. It is good to set it.

Other features of each method

Method 1

I think the most convenient point of Method 1 is that parameter search is possible using GridSerchCV of sklearn. Method 1 was used in all the articles that used XGBoost for parameter search. The code of Randomized Search CV is posted for reference.

randomized_search.py


from sklearn.model_selection import RandomizedSearchCV

params = {
          'n_estimators':[50000],
          'objective':['reg:squarederror'],
          'eval_metric': ['rmse'],
          'booster': ['gbtree'],
          'learning_rate':[0.1,0.05,0.01],
          'max_depth':[5,7,10,15],
          'random_state':[2525]
         }

mod = xgb.XGBRegressor()
#n_I do a random search for iter
rds = RandomizedSearchCV(mod,params,random_state=2525,scoring='r2',n_jobs=1,n_iter=50)
rds.fit(trainX,
        trainY,
        eval_metric='rmse',
        early_stopping_rounds=15,
        eval_set=[(testX, testY)])
print(rds.best_params_)
#{'seed': 2525,
# 'objective': 'reg:squarederror',
# 'n_estimators': 50000,
# 'max_depth': 5,
# 'learning_rate': 0.1,
# 'eval_metric': 'rmse',
# 'booster': 'gbtree'}

Method 2

To be honest, I feel that the original API is unfamiliar and you don't have to bother to use it. However, there may be other such examples, just as some methods can only be used with this method above. Is the basic implemented by method 1? I think it would be better to implement this one here if you can't do this.

in conclusion

This time, while investigating XGBoost, I have summarized the unclear points due to the mixture of methods as easily as possible. I hope that someone in a similar situation will get here and solve the problem. There are many other useful methods in the xgboost library, such as a method that can display the generated tree as a diagram, so it is recommended that you take a quick look at the API and implement it in various ways. This is the first Qiita article in my life, and I think it's not enough, but thank you for reading this far.

Reference article

Python: Try using XGBoost How to use xgboost: Multi-class classification by iris data Using XGBoost with Python Xgboost: How to calculate importance_type of feature_importance xgboost: Effective machine learning model for table data

Recommended Posts

A confusing story with two ways to implement XGBoost in Python + overall notes
A story about trying to implement a private variable in Python.
[Note] A story about trying to override a class method with two underscores in Python 3 series.
I want to easily implement a timeout in python
I tried to implement a pseudo pachislot in Python
I want to work with a robot in python.
I tried to implement a one-dimensional cellular automaton in Python
A story about how to specify a relative path in python.
6 ways to string objects in Python
5 Ways to Create a Python Chatbot
A story about adding a REST API to a daemon made with Python
How to create a heatmap with an arbitrary domain in Python
Implement a deterministic finite automaton in Python to determine multiples of 3
I tried to implement a misunderstood prisoner's dilemma game in Python
Two ways to display multiple graphs in one image with matplotlib
A story that didn't work when I tried to log in with the Python requests module
I tried to implement PLSA in Python
Spiral book in Python! Python with a spiral book! (Chapter 14 ~)
Try logging in to qiita with Python
[Small story] How to save matplotlib graphs in a batch with Jupyter
I tried to implement permutation in Python
I tried to implement PLSA in Python 2
How to work with BigQuery in Python
How to get a stacktrace in python
A memo corresponding to Django's runserver moss in Python 2.7.11 entered with Homebrew
I tried to implement ADALINE in Python
I tried to implement PPO in Python
To work with timestamp stations in Python
I tried to implement a card game of playing cards in Python
Various ways to read the last line of a csv file in Python
I tried to implement merge sort in Python with as few lines as possible
[Python] Created a class to play sin waves in the background with pyaudio
A story that turned light blue in 4 months after starting AtCoder with python
How to drop Google Docs in one folder in a .txt file with python
Searching for an efficient way to write a Dockerfile in Python with poetry
I tried to implement what seems to be a Windows snipping tool in Python
A story that I was addicted to when I made SFTP communication with python
Type notes to Python scripts for running PyTorch model in C ++ with libtorch
How to get a list of files in the same directory with python
[Python] List Comprehension Various ways to create a list
How to read a CSV file with Python 2/3
Try to implement Oni Maitsuji Miserable in python
Send a message to LINE with Python (LINE Notify)
3 ways to parse time strings in python [Note]
[Python] Get the files in a folder with Python
Try to calculate a statistical problem in Python
[REAPER] How to play with Reascript in Python
How to clear tuples in a list (Python)
To execute a Python enumerate function in JavaScript
How to embed a variable in a python string
How to implement Discord Slash Command in Python
Convert PDFs to images in bulk with Python
[Python3] A story stuck with time zone conversion
I want to create a window in Python
How to implement shared memory in Python (mmap.mmap)
How to create a JSON file in Python
Try to draw a life curve with python
I want to make a game with Python
Personal notes to doc Python code in Sphinx
Create a virtual environment with conda in Python
Try to make a "cryptanalysis" cipher with Python