1. Purpose

Gradient boosting such as xgboost and LGBM is often used in competitions like Kaggle. However, I felt that there were few articles and sites that could be used as reference for these / I had a lot of trouble when implementing it myself, so this time I would like to describe what I tried about LGBM and the meaning of each parameter. Purpose.

Reference books This time, it is basically implemented according to this book. "Winning with Kaggle Data Analysis Technology" https://www.amazon.co.jp/Kaggle%E3%81%A7%E5%8B%9D%E3%81%A4%E3%83%87%E3%83%BC%E3%82%BF%E5%88%86%E6%9E%90%E3%81%AE%E6%8A%80%E8%A1%93-%E9%96%80%E8%84%87-%E5%A4%A7%E8%BC%94/dp/4297108437
In the previous article, we implemented xgboost, and this time it is the LGBM version of it. [Kaggle] Try using xgboost

2. Benefits of gradient boosting

・ No need to complete missing values -There is no problem even if there are redundant features (even if there are explanatory variables with high correlation, they can be used as they are)

-The difference from Random Forest is that the trees are made in series.

With the above features, gradient boosting seems to be often used.

3. Try using LGBM

I will try to implement it again using Kaggle's Houseprice.

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

(1) Pretreatment

(I) Import

import numpy as np
import pandas as pd

#For data division
from sklearn.model_selection import train_test_split

#XGBoost
import xgboost as xgb

(Ii) Data reading / combining

#Data read
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

#Data join
df_train["TrainFlag"] = True
df_test["TrainFlag"] = False

df_all = df_train.append(df_test)
df_all.index = df_all["Id"]
df_all.drop("Id", axis = 1, inplace = True)

(Iii) Dummy variable

df_all = pd.get_dummies(df_all, drop_first=True)

(Iv) Data division

#df_Divide all into training data and test data again
df_train = df_all[df_all["TrainFlag"] == True]
df_train = df_train.drop(["TrainFlag"], axis = 1)

df_test = df_all[df_all["TrainFlag"] == False]
df_test = df_test.drop(["TrainFlag"], axis = 1)
df_test = df_test.drop(["SalePrice"], axis = 1)

#Data split
y = df_train["SalePrice"].values
X = df_train.drop("SalePrice", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

(2) Try using LGBM

(I) LGBM data creation

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test)

※important point

-To use LGBM, it is necessary to process the data with lgb.Dataset. -In xgboost, df_test (original test data) also needed to be processed with ".Dataset", but LGBM does not. Please note that the usage of data around here is slightly different between xgboost and LGBM.

(Ii) Parameter setting

params = {
        #Regression problem
        'random_state':1234, 'verbose':0,
        #Indicators for learning(RMSE)
        'metrics': 'rmse',
    }
num_round = 100

* Brief explanation of parameters

See below for details https://lightgbm.readthedocs.io/en/latest/Parameters.html

・ Verbosity: How much information is displayed during learning. The default is 1. -Measurements: How to measure miscalculation functions. -Num_round: Maximum number of learnings.

(Iii) Model training

model = lgb.train(params, lgb_train, num_boost_round = num_round)

(Iv) Forecast

#Forecast
prediction_LG = model.predict(df_test)

#Rounding decimals
prediction_LG = np.round(prediction_LG)

(V) Creating a file for submission

submission = pd.DataFrame({"id": df_test.index, "SalePrice": prediction_LG})

That is all!

4. Conclusion

What did you think. Although LGBM is famous, it seems that beginners take time to implement it.

I've introduced a simple code so that you can understand how to implement it as easily as possible. Also, you can implement it by copying the code, but I feel that it is very important to know what each code means because it is a rough idea.

I hope it will help you to deepen your understanding.

[PYTHON] [Kaggle] Try using LGBM