1. Purpose

Gradient boosting such as xgboost and LGBM is often used in competitions like Kaggle. However, I felt that there were few articles and sites that could be used as reference for these / I had a lot of trouble when implementing them myself, so this time I would like to describe what I tried about xgboost and the meaning of each parameter. Purpose.

Reference books This time, it is basically implemented according to this book. "Winning with Kaggle Data Analysis Technology" https://www.amazon.co.jp/Kaggle%E3%81%A7%E5%8B%9D%E3%81%A4%E3%83%87%E3%83%BC%E3%82%BF%E5%88%86%E6%9E%90%E3%81%AE%E6%8A%80%E8%A1%93-%E9%96%80%E8%84%87-%E5%A4%A7%E8%BC%94/dp/4297108437

2. Benefits of gradient boosting

・ No need to complete missing values -There is no problem even if there are redundant features (even if there are explanatory variables with high correlation, they can be used as they are)

-The difference from Random Forest is that the trees are made in series.

With the above features, gradient boosting seems to be often used.

3. Try using xgboost

This time, I will implement it using Kaggle's Houseprice.

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

(1) Pretreatment

(I) Import

import numpy as np
import pandas as pd

#For data division
from sklearn.model_selection import train_test_split

#XGBoost
import xgboost as xgb

(Ii) Data reading / combining

#Data read
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

#Data join
df_train["TrainFlag"] = True
df_test["TrainFlag"] = False

df_all = df_train.append(df_test)
df_all.index = df_all["Id"]
df_all.drop("Id", axis = 1, inplace = True)

(Iii) Dummy variable

df_all = pd.get_dummies(df_all, drop_first=True)

(Iv) Data division

#df_Divide all into training data and test data again
df_train = df_all[df_all["TrainFlag"] == True]
df_train = df_train.drop(["TrainFlag"], axis = 1)

df_test = df_all[df_all["TrainFlag"] == False]
df_test = df_test.drop(["TrainFlag"], axis = 1)
df_test = df_test.drop(["SalePrice"], axis = 1)

#Data split
y = df_train["SalePrice"].values
X = df_train.drop("SalePrice", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

(2) Try using xg boost

(I) xgboost data creation

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_test, label=y_test)
dtest = xgb.DMatrix(df_test.values)

To use xgboost, it is necessary to process its own data with "xgb.DMatrix".

※important point※

The third line says "df_test.values". Note that this is because df_test is in DataFrame format, so if you do not set it as a value (Numpy array) in .values, an error will occur in later model training.

(Ii) Parameter setting

params = {
        'objective': 'reg:squarederror','silent':1, 'random_state':1234, 
        #Indicators for learning(RMSE)
        'eval_metric': 'rmse',
    }
num_round = 500
watchlist = [(dtrain, 'train'), (dvalid, 'eval')]#Set training data as dtrain and test data for evaluation as dvalid

* Explanation of parameters *

-Objective: Specify the loss function to be minimized. The default is linear. -Silent: Specify how to record the model execution log. The default is 0 (output log). -Eval_metric: An evaluation index for data. There are rmse and logloss. -Num_round: Maximum number of learnings.

(Iii) Model training

model = xgb.train(params,
                    dtrain,#Training data
                    num_round,#Set number of learnings
                    early_stopping_rounds=20,
                    evals=watchlist,
                    )

-Early_stopping_rounds: Means that learning is stopped if the accuracy does not improve 20 times in a row. The maximum number of learnings is set in num_round, but if the accuracy is not improved by the number of early_stopping_rounds set here before that, it will be stopped.

(Iv) Forecast

#Forecast
prediction_XG = model.predict(dtest, ntree_limit = model.best_ntree_limit)

#Rounding decimals
prediction_XG = np.round(prediction_XG)

・ About ntree_limit You can set the number of trees with the best precision by setting model.best_ntree_limit.

(V) Creating a file for submission

submission = pd.DataFrame({"id": df_test.index, "SalePrice": prediction_XG})

That is all!

4. Conclusion

What did you think. You can implement it by copying the code, but I feel that it is very important to know what each code means because it is a rough idea.

I hope it will help you to deepen your understanding.

[PYTHON] [Kaggle] Try using xg boost