[PYTHON] I tried learning LightGBM with Yellowbrick

Introduction

I tried Yellowbrick a little before [^ 1], but I just tried to move it based on the sample code of Yellowbrick, so I tried to find out what Yellowbrick can do. This time, I will build LightGBM, which is often used in kaggle, using Yellowbrick, and even save the model. However, since there are cases where Yellowbrick cannot perform preprocessing such as creating features and detailed model accuracy evaluation, it is not dealt with.

environment

The execution environment is as follows.

$sw_vers
ProductName:	Mac OS X
ProductVersion:	10.13.6
BuildVersion:	17G8037
$python3 --version
Python 3.7.4

The installation of Yellowbrick is described in [^ 1], so it will be omitted. For the installation of LightGBM, refer to here [^ 2].

Model building

Preparation

Import the library to be used this time.

import pandas as pd
import numpy as np

import yellowbrick
from yellowbrick.datasets import load_bikeshare
from yellowbrick.model_selection import LearningCurve,ValidationCurve,FeatureImportances
from yellowbrick.regressor import ResidualsPlot


import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from joblib import dump, load

For data, use load_bikeshare prepared by Yellowbrick.

# Load data
X, y = load_bikeshare()
print(X.head())

There are 12 explanatory variables, all of which are numerical data. The objective variable is the number of shared bikes rented. This time, I will put this data into LightGBM as it is and make a model.

   season  year  month  hour  holiday  weekday  workingday  weather  temp  \
0       1     0      1     0        0        6           0        1  0.24   
1       1     0      1     1        0        6           0        1  0.22   
2       1     0      1     2        0        6           0        1  0.22   
3       1     0      1     3        0        6           0        1  0.24   
4       1     0      1     4        0        6           0        1  0.24   

   feelslike  humidity  windspeed  
0     0.2879      0.81        0.0  
1     0.2727      0.80        0.0  
2     0.2727      0.80        0.0  
3     0.2879      0.75        0.0  
4     0.2879      0.75        0.0  

Divide the data for training and validation before training. The split ratio is set to 8: 2 for texto.

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The model uses LightGBM as mentioned above. However, since Yellowbrick is a library like an extended version of scikit-learn, LightGBM also uses scikit-learn's API [^ 3].

# Model
model = lgb.LGBMRegressor()

Try tuning

Now, let's use Yellowbrick's ValidationCurve to determine the hyperparameters. This time, I will try to investigate the relationship between the values of max_depth, n_estimators, and num_leaves and the precision. See here [^ 5] for the API specifications of ValidationCurve.

Specify the model, the parameter name to be checked, and the parameter range in the argument of ValidationCurve as follows. cv can set the number of cross-validation divisions and the generator. This time, the number of cross-validation divisions is set to 5. The last scoring specifies the index to see the accuracy, and of the index [^ 4] defined by scikit-learn, neg_mean_squared_error is set.

visualizer = ValidationCurve(
    model, param_name="max_depth",
    param_range=np.arange(1, 11), cv=5, scoring='neg_mean_squared_error'
)
visualizer.fit(X_train, y_train)
visualizer.show()

The output is as shown in the figure below, and the vertical axis is neg_mean_squared_error. This index literally multiplies the average squared error by (-1), indicating that the upper side (closer to 0) in the figure has higher accuracy. Looking at the Cross Validation Score, if max_depth is 6 or more, the accuracy will hardly change, so set max_depth to 6.

out01.png

Next, let's examine n_estimators in the same way. The program is as follows.

visualizer = ValidationCurve(
    model, param_name="n_estimators",
    param_range=np.arange(100, 1100, 100), cv=5, scoring='neg_mean_squared_error'
)
visualizer.fit(X_train, y_train)
visualizer.show()

The output is as shown in the figure below. Looking at the Cross Validation Score, if the nestimators are 600 or higher, the accuracy is almost the same, so set n_estimators to 600.

out02.png

Finally, check num_leaves in the same way. The program is as follows.

visualizer = ValidationCurve(
    model, param_name="num_leaves",
    param_range=np.arange(2, 54, 4), cv=5, scoring='neg_mean_squared_error'
)
visualizer.fit(X_train, y_train)
visualizer.show()

This output is as shown in the figure below. Looking at the Cross Validation Score, num_leaves is 20 or more and the accuracy has hardly changed, so set it to 20.

out03.png

As described above, it was possible to easily tune the parameters with ValidationCurve. Define the model again.

# Model
model = lgb.LGBMRegressor(
    boosting_type='gbdt', 
    num_leaves=20, 
    max_depth=6, 
    n_estimators=600, 
    random_state=1234, 
    importance_type='gain')

Try to draw a learning curve

To see if the model is underfit or overfit Let's look at the accuracy of the model while changing the amount of training data. It can be easily visualized by using LearningCurve.

visualizer = LearningCurve(model, cv=5, scoring='neg_mean_squared_error')
visualizer.fit(X_train, y_train)
visualizer.show()

The result is shown in the figure below. It can be seen that the accuracy of the Cross Validation Score improves as the amount of data increases. Even if the amount of data can be increased, the accuracy will not improve dramatically.

out04.png

Importance of features

It is also easy to display explanatory variables in order of importance when predicting the number of shared bikes rented.

visualizer = FeatureImportances(model)
visualizer.fit(X_train, y_train)
visualizer.show()

The result is shown in the figure below. The most effective time was the time of day. Other variables are displayed with this importance set to 100.

out05.png

Model accuracy

As shown in the previous article [^ 1], check the accuracy with the residual distribution.

visualizer = ResidualsPlot(model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()

The output is as shown in the figure below. Looking at the scatter plot, there are places where the predicted value is off, but the $ R ^ 2 $ value is 0.9 or more, Looking at the residual distribution in the histogram, the accuracy seems to be good because there is a peak around the residual of 0.

out06.png

It seems that the score displayed in the figure cannot be changed to an index other than $ R ^ 2 $, so when I calculated the RMSE again, it was about 38. Now that we have a model like that, we will end the model construction here.

model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
# The rmse of prediction is: 38.82245025441572

Save model

Save the built model. When I looked it up with Scikit-learn, it described how to save it with joblib [^ 6], so save it in the same way.

dump(model, 'lightgbm.joblib') 

at the end

In the above program, LightGBM sample code [^ 7] is also referred to. The impression is that it is convenient because you can easily check the accuracy of the model and the plot that can be used for verification with Yellowbrick. On the other hand, it feels strange to make the Visualizer "fit" every time ...

Recommended Posts

I tried learning LightGBM with Yellowbrick
I tried machine learning with liblinear
[Kaggle] I tried ensemble learning using LightGBM
I tried deep learning
I tried to use lightGBM, xgboost with Boruta
[Mac] I tried reinforcement learning with OpenAI Baselines
I tried fp-growth with python
I tried Learning-to-Rank with Elasticsearch!
I tried clustering with PyCaret
I tried gRPC with Python
I tried scraping with python
I tried to move machine learning (ObjectDetection) with TouchDesigner
Mayungo's Python Learning Episode 1: I tried printing with print
I tried summarizing sentences with summpy
I tried web scraping with python.
I tried moving food with SinGAN
Mayungo's Python Learning Episode 3: I tried to print numbers with print
I tried face detection with MTCNN
I tried reinforcement learning using PyBrain
I tried deep learning using Theano
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
I tried sentence generation with GPT-2
I tried to divide with a deep learning language model
I tried face recognition with OpenCV
PySpark learning record ② Kaggle I tried the Titanic competition with PySpark binding
I tried to make deep learning scalable with Spark × Keras × Docker
I tried deep reinforcement learning (Double DQN) for tic-tac-toe with ChainerRL
Mayungo's Python Learning Episode 7: I tried printing with if, elif, else
I tried multiple regression analysis with polynomial regression
I tried sending an SMS with Twilio
I tried using Amazon SQS with django-celery
I tried to implement Autoencoder with TensorFlow
I tried linebot with flask (anaconda) + heroku
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried scraping Yahoo News with Python
I tried using Selenium with Headless chrome
I tried sending an email with python.
I tried non-photorealistic rendering with Python + opencv
I tried a functional language with Python
I tried batch normalization with PyTorch (+ note)
I tried recursion with Python ② (Fibonacci sequence)
I tried implementing DeepPose with PyTorch PartⅡ
I tried to implement CVAE with PyTorch
I tried playing with the image with Pillow
Mayungo's Python Learning Episode 8: I tried input
I tried to solve TSP with QAOA
I tried simple image recognition with Jupyter
I tried CNN fine tuning with Resnet
I tried natural language processing with transformers.
#I tried something like Vlookup with Python # 2
I tried to implement deep learning that is not deep with only NumPy
I tried scraping
I tried PyQ
Mayungo's Python Learning Episode 2: I tried to put out characters with variables
I tried AutoKeras
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried django-slack
I tried Django
I tried spleeter