Gradient boosting such as xgboost and LGBM is often used in competitions like Kaggle. However, I felt that there were few articles and sites that could be used as reference for these / I had a lot of trouble when implementing it myself, so this time I would like to describe what I tried about LGBM and the meaning of each parameter. Purpose.
Reference books This time, it is basically implemented according to this book. "Winning with Kaggle Data Analysis Technology" https://www.amazon.co.jp/Kaggle%E3%81%A7%E5%8B%9D%E3%81%A4%E3%83%87%E3%83%BC%E3%82%BF%E5%88%86%E6%9E%90%E3%81%AE%E6%8A%80%E8%A1%93-%E9%96%80%E8%84%87-%E5%A4%A7%E8%BC%94/dp/4297108437
In the previous article, we implemented xgboost, and this time it is the LGBM version of it. [Kaggle] Try using xgboost
・ No need to complete missing values -There is no problem even if there are redundant features (even if there are explanatory variables with high correlation, they can be used as they are)
-The difference from Random Forest is that the trees are made in series.
With the above features, gradient boosting seems to be often used.
I will try to implement it again using Kaggle's Houseprice.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
import numpy as np
import pandas as pd
#For data division
from sklearn.model_selection import train_test_split
#XGBoost
import xgboost as xgb
#Data read
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
#Data join
df_train["TrainFlag"] = True
df_test["TrainFlag"] = False
df_all = df_train.append(df_test)
df_all.index = df_all["Id"]
df_all.drop("Id", axis = 1, inplace = True)
df_all = pd.get_dummies(df_all, drop_first=True)
#df_Divide all into training data and test data again
df_train = df_all[df_all["TrainFlag"] == True]
df_train = df_train.drop(["TrainFlag"], axis = 1)
df_test = df_all[df_all["TrainFlag"] == False]
df_test = df_test.drop(["TrainFlag"], axis = 1)
df_test = df_test.drop(["SalePrice"], axis = 1)
#Data split
y = df_train["SalePrice"].values
X = df_train.drop("SalePrice", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test)
-To use LGBM, it is necessary to process the data with lgb.Dataset. -In xgboost, df_test (original test data) also needed to be processed with ".Dataset", but LGBM does not. Please note that the usage of data around here is slightly different between xgboost and LGBM.
params = {
#Regression problem
'random_state':1234, 'verbose':0,
#Indicators for learning(RMSE)
'metrics': 'rmse',
}
num_round = 100
See below for details https://lightgbm.readthedocs.io/en/latest/Parameters.html
・ Verbosity: How much information is displayed during learning. The default is 1. -Measurements: How to measure miscalculation functions. -Num_round: Maximum number of learnings.
model = lgb.train(params, lgb_train, num_boost_round = num_round)
#Forecast
prediction_LG = model.predict(df_test)
#Rounding decimals
prediction_LG = np.round(prediction_LG)
submission = pd.DataFrame({"id": df_test.index, "SalePrice": prediction_LG})
That is all!
What did you think. Although LGBM is famous, it seems that beginners take time to implement it.
I've introduced a simple code so that you can understand how to implement it as easily as possible. Also, you can implement it by copying the code, but I feel that it is very important to know what each code means because it is a rough idea.
I hope it will help you to deepen your understanding.
Recommended Posts