[PYTHON] LightGBM (Automatic mounting and parameter adjustment (Optuna))

Introduction

This article summarizes LightGBM implementation and automatic parameter adjustment (Optuna).

What is LightGBM

LightGBM is a machine learning of gradient boosting that combines decision tree and ensemble learning boosting. (A framework that improves XGBoost.)

XGBoost Release: 2014 LightGBM Release: 2016

Features of LightGBM

① High prediction accuracy In general, it has the highest prediction accuracy along with XGBoost in machine learning excluding deep learning.

② The time required for model training is relatively short It costs less than XGBoost, which boasts the same prediction accuracy. (The reason why LightGBM is called "Light".)

③ Easy to overfit Due to the complex decision tree structure, overfitting is likely to occur if the parameters are not adjusted appropriately.

Implementation

This time, we will focus on the evaluation of [SIGNATE] automobiles. Link below. https://signate.jp/competitions/122

Data preprocessing

Read the data and change "String" to "Numeric".

pyhon.py


import pandas as pd
import numpy as np

#Data reading
df = pd.read_csv('train.tsv', delimiter = '\t')
df = df.drop('id', axis = 1)

#Explanatory variable
df = df.replace({'buying': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'maint': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'doors': {'2': 2, '3': 3, '4': 4, '5': 5, '5more': 6}})
df = df.replace({'persons': {'2': 2, '4': 4, 'more': 6}})
df = df.replace({'lug_boot': {'small': 1, 'med': 2, 'big': 3}})
df = df.replace({'safety': {'low': 1, 'med': 2, 'high': 3}})

#Objective variable
df = df.replace({'class': {'unacc': 0, 'acc': 1, 'good': 2, 'vgood': 3}}) 

Classified into training data and evaluation data

python.py


from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state = 0)

#Explain training data Variable data(X_train)And objective variable data(y_train)Divided into
X_train = train_set.drop('class', axis=1)
y_train = train_set['class']

#Explain the evaluation data Variable data(X_train)And objective variable data(y_train)Divided into
X_test = test_set.drop('class', axis=1)
y_test = test_set['class']

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(691, 6)
(173, 6)
(691,)
(173,)

LightGBM implementation

Convert to LightGBM dataset

python.py


import lightgbm as lgb

#Training data
lgb_train = lgb.Dataset(X_train, y_train)
#Evaluation data
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

Model learning

python.py


#Parameter setting
parms = {
    'task': 'train', #For training
    'boosting': 'gbdt', #Gradient boosting decision tree
    'objective': 'multiclass', #Purpose: Multi-value classification
    'num_class': 4, #Number of classes to classify
    'metric': 'multi_error', #Evaluation index: Correct answer rate
    'num_iterations': 1000, #Learn 1000 times
    'verbose': -1 #Hide learning information
}

#Model learning
model = lgb.train(parms,
                 #Training data
                 train_set=lgb_train
                 #Evaluation data
                 valid_sets=lgb_eval,
                 early_stopping_rounds=100)

Check the result

python.py


#Predicting results
y_pred = model.predict(X_test)
#Predicted probability to integer
y_pred = np.argmax(y_pred, axis=1) 

from sklearn import metrics
print(metrics.classification_report(y_test, y_pred))


#result

precision    recall  f1-score   support

           0       1.00      0.99      1.00       114
           1       0.93      0.98      0.95        42
           2       0.75      0.67      0.71         9
           3       1.00      1.00      1.00         8

    accuracy                           0.97       173
   macro avg       0.92      0.91      0.91       173
weighted avg       0.97      0.97      0.97       173

Correct answer rate: 97%

Parameter adjustment with Optuna

Next, use "Optuna" to optimize the parameters.

What is Optuna

Optuna is a software framework for automating parameter optimization. While automatically performing trial and error regarding parameter values, it automatically discovers parameter values ​​that exhibit excellent performance. (It uses a type of Bayesian optimization algorithm called Tree-structured Parzen Estimator.)

Click here for details ① Homepage https://preferred.jp/ja/projects/optuna/ ② Document https://optuna.readthedocs.io/en/stable/index.html

Implementation of Optuna

The following 7 parameters are automatically optimized. lambda_l1 lambda_l2 num_leaves feature_fraction bagging_fraction bagging_freq min_child_samples

Let's implement it.

python.py


#Import LightGBM via optuna
from optuna.integration import lightgbm as gbm

#Parameters to fix
params = {
    "boosting_type": "gbdt",
    'objective': 'multiclass',
    'num_class': 4,
    'metric': 'multi_error',
    "verbosity": -1,
}

#Parameter search in Optuna
model = lgb.train(params, lgb_train, 
                  valid_sets=[lgb_train, lgb_eval],
                  verbose_eval=100,
                  early_stopping_rounds=100,
                 )

#Display of optimal parameters
best_params = model.params
print("Best params:", best_params)


Best params: {
'objective': 'multiclass','num_class': 4, 'metric': 'multi_error', 
'verbosity': -1, 'boosting_type': 'gbdt', 'feature_pre_filter': False,
'lambda_l1': 0.0, 'lambda_l2': 0.0, 'num_leaves': 31, 'feature_fraction': 
0.8999999999999999, 'bagging_fraction': 1.0, 'bagging_freq': 0, 
'min_child_samples': 20, 'num_iterations': 1000, 'early_stopping_round': 100
}

Check the result

python.py


Y_pred = model.predict(X_test, num_iteration=model.best_iteration)
y_pred = np.argmax(Y_pred, axis=1)

from sklearn import metrics
print(metrics.classification_report(y_test, y_pred))


precision    recall  f1-score   support

           0       1.00      0.99      1.00       114
           1       0.95      0.98      0.96        42
           2       0.78      0.78      0.78         9
           3       1.00      1.00      1.00         8

    accuracy                           0.98       173
   macro avg       0.93      0.94      0.93       173
weighted avg       0.98      0.98      0.98       173

in conclusion

Correct answer rate improved from 97% to 98%!

The prediction accuracy is higher than the previous Random Forest! !!

Recommended Posts

LightGBM (Automatic mounting and parameter adjustment (Optuna))
LightGBM (Automatic mounting and parameter adjustment (Optuna))
optuna, keras and titanic