This article summarizes LightGBM implementation and automatic parameter adjustment (Optuna).
LightGBM is a machine learning of gradient boosting that combines decision tree and ensemble learning boosting. (A framework that improves XGBoost.)
XGBoost Release: 2014 LightGBM Release: 2016
① High prediction accuracy In general, it has the highest prediction accuracy along with XGBoost in machine learning excluding deep learning.
② The time required for model training is relatively short It costs less than XGBoost, which boasts the same prediction accuracy. (The reason why LightGBM is called "Light".)
③ Easy to overfit Due to the complex decision tree structure, overfitting is likely to occur if the parameters are not adjusted appropriately.
This time, we will focus on the evaluation of [SIGNATE] automobiles. Link below. https://signate.jp/competitions/122
Read the data and change "String" to "Numeric".
pyhon.py
import pandas as pd
import numpy as np
#Data reading
df = pd.read_csv('train.tsv', delimiter = '\t')
df = df.drop('id', axis = 1)
#Explanatory variable
df = df.replace({'buying': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'maint': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'doors': {'2': 2, '3': 3, '4': 4, '5': 5, '5more': 6}})
df = df.replace({'persons': {'2': 2, '4': 4, 'more': 6}})
df = df.replace({'lug_boot': {'small': 1, 'med': 2, 'big': 3}})
df = df.replace({'safety': {'low': 1, 'med': 2, 'high': 3}})
#Objective variable
df = df.replace({'class': {'unacc': 0, 'acc': 1, 'good': 2, 'vgood': 3}})
Classified into training data and evaluation data
python.py
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state = 0)
#Explain training data Variable data(X_train)And objective variable data(y_train)Divided into
X_train = train_set.drop('class', axis=1)
y_train = train_set['class']
#Explain the evaluation data Variable data(X_train)And objective variable data(y_train)Divided into
X_test = test_set.drop('class', axis=1)
y_test = test_set['class']
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(691, 6)
(173, 6)
(691,)
(173,)
Convert to LightGBM dataset
python.py
import lightgbm as lgb
#Training data
lgb_train = lgb.Dataset(X_train, y_train)
#Evaluation data
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
Model learning
In the case of binary classification 'objective': 'binary', 'metric':'binary_error' #Evaluation index: Correct answer rate
In the case of regression 'objective': 'regression', 'metric': 'rmse'
python.py
#Parameter setting
parms = {
'task': 'train', #For training
'boosting': 'gbdt', #Gradient boosting decision tree
'objective': 'multiclass', #Purpose: Multi-value classification
'num_class': 4, #Number of classes to classify
'metric': 'multi_error', #Evaluation index: Correct answer rate
'num_iterations': 1000, #Learn 1000 times
'verbose': -1 #Hide learning information
}
#Model learning
model = lgb.train(parms,
#Training data
train_set=lgb_train
#Evaluation data
valid_sets=lgb_eval,
early_stopping_rounds=100)
Check the result
python.py
#Predicting results
y_pred = model.predict(X_test)
#Predicted probability to integer
y_pred = np.argmax(y_pred, axis=1)
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred))
#result
precision recall f1-score support
0 1.00 0.99 1.00 114
1 0.93 0.98 0.95 42
2 0.75 0.67 0.71 9
3 1.00 1.00 1.00 8
accuracy 0.97 173
macro avg 0.92 0.91 0.91 173
weighted avg 0.97 0.97 0.97 173
Correct answer rate: 97%
Next, use "Optuna" to optimize the parameters.
Optuna is a software framework for automating parameter optimization. While automatically performing trial and error regarding parameter values, it automatically discovers parameter values that exhibit excellent performance. (It uses a type of Bayesian optimization algorithm called Tree-structured Parzen Estimator.)
Click here for details ① Homepage https://preferred.jp/ja/projects/optuna/ ② Document https://optuna.readthedocs.io/en/stable/index.html
The following 7 parameters are automatically optimized. lambda_l1 lambda_l2 num_leaves feature_fraction bagging_fraction bagging_freq min_child_samples
Let's implement it.
python.py
#Import LightGBM via optuna
from optuna.integration import lightgbm as gbm
#Parameters to fix
params = {
"boosting_type": "gbdt",
'objective': 'multiclass',
'num_class': 4,
'metric': 'multi_error',
"verbosity": -1,
}
#Parameter search in Optuna
model = lgb.train(params, lgb_train,
valid_sets=[lgb_train, lgb_eval],
verbose_eval=100,
early_stopping_rounds=100,
)
#Display of optimal parameters
best_params = model.params
print("Best params:", best_params)
Best params: {
'objective': 'multiclass','num_class': 4, 'metric': 'multi_error',
'verbosity': -1, 'boosting_type': 'gbdt', 'feature_pre_filter': False,
'lambda_l1': 0.0, 'lambda_l2': 0.0, 'num_leaves': 31, 'feature_fraction':
0.8999999999999999, 'bagging_fraction': 1.0, 'bagging_freq': 0,
'min_child_samples': 20, 'num_iterations': 1000, 'early_stopping_round': 100
}
Check the result
python.py
Y_pred = model.predict(X_test, num_iteration=model.best_iteration)
y_pred = np.argmax(Y_pred, axis=1)
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred))
precision recall f1-score support
0 1.00 0.99 1.00 114
1 0.95 0.98 0.96 42
2 0.78 0.78 0.78 9
3 1.00 1.00 1.00 8
accuracy 0.98 173
macro avg 0.93 0.94 0.93 173
weighted avg 0.98 0.98 0.98 173
Correct answer rate improved from 97% to 98%!
The prediction accuracy is higher than the previous Random Forest! !!