Gradient boosting such as xgboost and LGBM is often used in competitions like Kaggle. However, I felt that there were few articles and sites that could be used as reference for these / I had a lot of trouble when implementing them myself, so this time I would like to describe what I tried about xgboost and the meaning of each parameter. Purpose.
・ No need to complete missing values -There is no problem even if there are redundant features (even if there are explanatory variables with high correlation, they can be used as they are)
-The difference from Random Forest is that the trees are made in series.
With the above features, gradient boosting seems to be often used.
This time, I will implement it using Kaggle's Houseprice.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
import numpy as np
import pandas as pd
#For data division
from sklearn.model_selection import train_test_split
#XGBoost
import xgboost as xgb
#Data read
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
#Data join
df_train["TrainFlag"] = True
df_test["TrainFlag"] = False
df_all = df_train.append(df_test)
df_all.index = df_all["Id"]
df_all.drop("Id", axis = 1, inplace = True)
df_all = pd.get_dummies(df_all, drop_first=True)
#df_Divide all into training data and test data again
df_train = df_all[df_all["TrainFlag"] == True]
df_train = df_train.drop(["TrainFlag"], axis = 1)
df_test = df_all[df_all["TrainFlag"] == False]
df_test = df_test.drop(["TrainFlag"], axis = 1)
df_test = df_test.drop(["SalePrice"], axis = 1)
#Data split
y = df_train["SalePrice"].values
X = df_train.drop("SalePrice", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_test, label=y_test)
dtest = xgb.DMatrix(df_test.values)
To use xgboost, it is necessary to process its own data with "xgb.DMatrix".
The third line says "df_test.values". Note that this is because df_test is in DataFrame format, so if you do not set it as a value (Numpy array) in .values, an error will occur in later model training.
params = {
'objective': 'reg:squarederror','silent':1, 'random_state':1234,
#Indicators for learning(RMSE)
'eval_metric': 'rmse',
}
num_round = 500
watchlist = [(dtrain, 'train'), (dvalid, 'eval')]#Set training data as dtrain and test data for evaluation as dvalid
-Objective: Specify the loss function to be minimized. The default is linear. -Silent: Specify how to record the model execution log. The default is 0 (output log). -Eval_metric: An evaluation index for data. There are rmse and logloss. -Num_round: Maximum number of learnings.
model = xgb.train(params,
dtrain,#Training data
num_round,#Set number of learnings
early_stopping_rounds=20,
evals=watchlist,
)
-Early_stopping_rounds: Means that learning is stopped if the accuracy does not improve 20 times in a row. The maximum number of learnings is set in num_round, but if the accuracy is not improved by the number of early_stopping_rounds set here before that, it will be stopped.
#Forecast
prediction_XG = model.predict(dtest, ntree_limit = model.best_ntree_limit)
#Rounding decimals
prediction_XG = np.round(prediction_XG)
・ About ntree_limit You can set the number of trees with the best precision by setting model.best_ntree_limit.
submission = pd.DataFrame({"id": df_test.index, "SalePrice": prediction_XG})
That is all!
What did you think. You can implement it by copying the code, but I feel that it is very important to know what each code means because it is a rough idea.
I hope it will help you to deepen your understanding.
Recommended Posts