Introduction

When I ran scikit-learn `` `plot_partial_dependence``` on a model built with Lightgbm, I got a NotFittedError and it didn't work. It is a memorandum about the countermeasure.

environment

The environment is as follows.

$sw_vers
ProductName:	Mac OS X
ProductVersion:	10.13.6
BuildVersion:	17G14042

Since I was working on Jupyterlab (Version 0.35.4), I will also list the version of the python kernel.

Python 3.7.3 (default, Mar 27 2019, 16:54:48) 
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.

What i did

Model building

Prepare the data and model for forecasting. The data used was the Boston dataset provided by scikit-learn.

import pandas as pd
import sklearn.datasets as skd

data = skd.load_boston()

df_X = pd.DataFrame(data.data, columns=data.feature_names)
df_y = pd.DataFrame(data.target, columns=['y'])

As shown below, since all columns are non-null float type with 506 rows and 13 columns data, we will create a model as it is.

df_X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB

In LightGBM, build the model with the hyperparameters almost the default.

For Training API

Booster is returned when the model is built with the `train``` function provided by the Training API. In conclusion, passing Booster as is to `plot_partial_dependence``` of scikit-learn will result in an error.

I would like to see what kind of error occurs. First, let's train the model.

import lightgbm as lgb
from sklearn.model_selection import train_test_split

df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(df_X, df_y, test_size=0.2, random_state=4)

lgb_train = lgb.Dataset(df_X_train, df_y_train)
lgb_eval = lgb.Dataset(df_X_test, df_y_test)

# Booster

params = {
    'seed':4,
    'objective': 'regression',
    'metric':'rmse'}

lgbm_booster = lgb.train(params,
                lgb_train,
                valid_sets=lgb_eval,
                num_boost_round=200,
                early_stopping_rounds=20,
                verbose_eval=50)

(abridgement)

Training until validation scores don't improve for 20 rounds
[50]	valid_0's rmse: 3.58803
[100]	valid_0's rmse: 3.39545
[150]	valid_0's rmse: 3.31867
[200]	valid_0's rmse: 3.28222
Did not meet early stopping. Best iteration is:
[192]	valid_0's rmse: 3.27283

Pass the trained model lgbm_booster to `` `plot_partial_dependence```.

from sklearn.inspection import plot_partial_dependence
plot_partial_dependence(lgbm_booster, df_X_train, ['CRIM', 'ZN'])

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-7-9a44e58d2f50> in <module>
      1 from sklearn.inspection import plot_partial_dependence
----> 2 plot_partial_dependence(lgbm, df_X_train, ['CRIM', 'ZN'])

(Omitted)

/anaconda3/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py in partial_dependence(estimator, X, features, response_method, percentiles, grid_resolution, method)
    305     if not (is_classifier(estimator) or is_regressor(estimator)):
    306         raise ValueError(
--> 307             "'estimator' must be a fitted regressor or classifier."
    308         )
    309 


ValueError: 'estimator' must be a fitted regressor or classifier.

That's why I was able to confirm that a Value Error appears.

For Scikit-learn API

Now let's train the model with Scikit-learn API. Use LGBMRegressor for regression problems.

lgbm_sk = lgb.LGBMRegressor(objective=params['objective'],
                           random_state=params['seed'],
                           metric=params['metric'])
lgbm_sk.fit(df_X_train, df_y_train)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              metric='rmse', min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31,
              objective='regression', random_state=4, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0)

Pass the trained model lgbm_sk to plot_partial_dependence```.

plot_partial_dependence(lgbm_sk, df_X_train, ['CRIM', 'ZN'])

---------------------------------------------------------------------------

NotFittedError                            Traceback (most recent call last)

<ipython-input-9-d0724528e406> in <module>
----> 1 plot_partial_dependence(lgbm_sk, df_X_train, ['CRIM', 'ZN'])

(Omitted)

/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
    965 
    966     if not attrs:
--> 967         raise NotFittedError(msg % {'name': type(estimator).__name__})
    968 
    969 


NotFittedError: This LGBMRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

This is how the Not Fitted Error occurred. The versions of scikit-learn and lightgbm used here are `0.22.2.post1``` and `2.3.0```, respectively.

from sklearn import __version__ as sk_ver
print(lgb.__version__)
print(sk_ver)

2.3.0
0.22.2.post1

When I was investigating the NotFittedError that occurred here, Issue on Github reported this error. I'm using scikit-learn's `check_is_fitted``` function to check if the model is fitted, but the attribute to check with check_is_fitted``` is lightgbm ``` fit`` It seems to be happening because it is not set with . It seems that it can be solved by upgrading the version, so we will upgrade lightgbm. As of January 14, 2021, the latest version was `` `3.1.1```, so I specified this version.

conda install -c conda-forge lightgbm=3.1.1

Check the version.

from sklearn import __version__ as sk_ver
print(lgb.__version__)
print(sk_ver)

3.1.1
0.23.1

Let the model be trained again.

lgbm_sk = lgb.LGBMRegressor(objective=params['objective'],
                           random_state=params['seed'],
                           metric=params['metric'])
lgbm_sk.fit(df_X_train, df_y_train)

LGBMRegressor(metric='rmse', objective='regression', random_state=4)

lgbm_skToplot_partial_dependencePass to.

plot_partial_dependence(lgbm_sk, df_X_train, ['CRIM', 'ZN'])

This time, it could be executed without any error, and the result shown below was obtained.

Next, in addition to scikit-learn, use `` `pdpbox``` to issue partial dependence.

from pdpbox import pdp, get_dataset, info_plots
fig = plt.figure(figsize=(14,5))

pdp_goals = pdp.pdp_isolate(model=lgbm_sk, dataset=df_X_train, model_features=df_X_train.columns, feature='LSTAT')
# plot it
pdp.pdp_plot(pdp_goals,'LSTAT')
plt.show()

I got the output as shown below. It looks better than scikit-learn.

Summary

NotFittedError that occurred when passing Lightgbm to `` `plot_partial_dependence``` of scikit-learn was solved by upgrading Lightgbm.

[PYTHON] NotFittedError handling memo that appeared when executing Partial dependence in LightGBM