Introduction

I tried Yellowbrick a little before [^ 1], but I just tried to move it based on the sample code of Yellowbrick, so I tried to find out what Yellowbrick can do. This time, I will build LightGBM, which is often used in kaggle, using Yellowbrick, and even save the model. However, since there are cases where Yellowbrick cannot perform preprocessing such as creating features and detailed model accuracy evaluation, it is not dealt with.

environment

The execution environment is as follows.

$sw_vers
ProductName:	Mac OS X
ProductVersion:	10.13.6
BuildVersion:	17G8037

$python3 --version
Python 3.7.4

The installation of Yellowbrick is described in [^ 1], so it will be omitted. For the installation of LightGBM, refer to here [^ 2].

Model building

Preparation

Import the library to be used this time.

import pandas as pd
import numpy as np

import yellowbrick
from yellowbrick.datasets import load_bikeshare
from yellowbrick.model_selection import LearningCurve,ValidationCurve,FeatureImportances
from yellowbrick.regressor import ResidualsPlot


import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from joblib import dump, load

For data, use load_bikeshare prepared by Yellowbrick.

# Load data
X, y = load_bikeshare()
print(X.head())

There are 12 explanatory variables, all of which are numerical data. The objective variable is the number of shared bikes rented. This time, I will put this data into LightGBM as it is and make a model.

   season  year  month  hour  holiday  weekday  workingday  weather  temp  \
0       1     0      1     0        0        6           0        1  0.24   
1       1     0      1     1        0        6           0        1  0.22   
2       1     0      1     2        0        6           0        1  0.22   
3       1     0      1     3        0        6           0        1  0.24   
4       1     0      1     4        0        6           0        1  0.24   

   feelslike  humidity  windspeed  
0     0.2879      0.81        0.0  
1     0.2727      0.80        0.0  
2     0.2727      0.80        0.0  
3     0.2879      0.75        0.0  
4     0.2879      0.75        0.0

Divide the data for training and validation before training. The split ratio is set to 8: 2 for texto.

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The model uses LightGBM as mentioned above. However, since Yellowbrick is a library like an extended version of scikit-learn, LightGBM also uses scikit-learn's API [^ 3].

# Model
model = lgb.LGBMRegressor()

Try tuning

Now, let's use Yellowbrick's ValidationCurve to determine the hyperparameters. This time, I will try to investigate the relationship between the values of max_depth, n_estimators, and num_leaves and the precision. See here [^ 5] for the API specifications of ValidationCurve.

Specify the model, the parameter name to be checked, and the parameter range in the argument of ValidationCurve as follows. cv can set the number of cross-validation divisions and the generator. This time, the number of cross-validation divisions is set to 5. The last scoring specifies the index to see the accuracy, and of the index [^ 4] defined by scikit-learn, neg_mean_squared_error is set.

visualizer = ValidationCurve(
    model, param_name="max_depth",
    param_range=np.arange(1, 11), cv=5, scoring='neg_mean_squared_error'
)
visualizer.fit(X_train, y_train)
visualizer.show()

The output is as shown in the figure below, and the vertical axis is neg_mean_squared_error. This index literally multiplies the average squared error by (-1), indicating that the upper side (closer to 0) in the figure has higher accuracy. Looking at the Cross Validation Score, if max_depth is 6 or more, the accuracy will hardly change, so set max_depth to 6.

Next, let's examine n_estimators in the same way. The program is as follows.

visualizer = ValidationCurve(
    model, param_name="n_estimators",
    param_range=np.arange(100, 1100, 100), cv=5, scoring='neg_mean_squared_error'
)
visualizer.fit(X_train, y_train)
visualizer.show()

The output is as shown in the figure below. Looking at the Cross Validation Score, if the nestimators are 600 or higher, the accuracy is almost the same, so set n_estimators to 600.

Finally, check num_leaves in the same way. The program is as follows.

visualizer = ValidationCurve(
    model, param_name="num_leaves",
    param_range=np.arange(2, 54, 4), cv=5, scoring='neg_mean_squared_error'
)
visualizer.fit(X_train, y_train)
visualizer.show()

This output is as shown in the figure below. Looking at the Cross Validation Score, num_leaves is 20 or more and the accuracy has hardly changed, so set it to 20.

As described above, it was possible to easily tune the parameters with ValidationCurve. Define the model again.

# Model
model = lgb.LGBMRegressor(
    boosting_type='gbdt', 
    num_leaves=20, 
    max_depth=6, 
    n_estimators=600, 
    random_state=1234, 
    importance_type='gain')

Try to draw a learning curve

To see if the model is underfit or overfit Let's look at the accuracy of the model while changing the amount of training data. It can be easily visualized by using LearningCurve.

visualizer = LearningCurve(model, cv=5, scoring='neg_mean_squared_error')
visualizer.fit(X_train, y_train)
visualizer.show()

The result is shown in the figure below. It can be seen that the accuracy of the Cross Validation Score improves as the amount of data increases. Even if the amount of data can be increased, the accuracy will not improve dramatically.

Importance of features

It is also easy to display explanatory variables in order of importance when predicting the number of shared bikes rented.

visualizer = FeatureImportances(model)
visualizer.fit(X_train, y_train)
visualizer.show()

The result is shown in the figure below. The most effective time was the time of day. Other variables are displayed with this importance set to 100.

Model accuracy

As shown in the previous article [^ 1], check the accuracy with the residual distribution.

visualizer = ResidualsPlot(model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()

The output is as shown in the figure below. Looking at the scatter plot, there are places where the predicted value is off, but the $ R ^ 2 $ value is 0.9 or more, Looking at the residual distribution in the histogram, the accuracy seems to be good because there is a peak around the residual of 0.

It seems that the score displayed in the figure cannot be changed to an index other than $ R ^ 2 $, so when I calculated the RMSE again, it was about 38. Now that we have a model like that, we will end the model construction here.

model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
# The rmse of prediction is: 38.82245025441572

Save model

Save the built model. When I looked it up with Scikit-learn, it described how to save it with joblib [^ 6], so save it in the same way.

dump(model, 'lightgbm.joblib')

at the end

In the above program, LightGBM sample code [^ 7] is also referred to. The impression is that it is convenient because you can easily check the accuracy of the model and the plot that can be used for verification with Yellowbrick. On the other hand, it feels strange to make the Visualizer "fit" every time ...

[PYTHON] I tried learning LightGBM with Yellowbrick