When I ran scikit-learn `` `plot_partial_dependence``` on a model built with Lightgbm, I got a NotFittedError and it didn't work. It is a memorandum about the countermeasure.
The environment is as follows.
$sw_vers
ProductName: Mac OS X
ProductVersion: 10.13.6
BuildVersion: 17G14042
Since I was working on Jupyterlab (Version 0.35.4), I will also list the version of the python kernel.
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.
Prepare the data and model for forecasting. The data used was the Boston dataset provided by scikit-learn.
import pandas as pd
import sklearn.datasets as skd
data = skd.load_boston()
df_X = pd.DataFrame(data.data, columns=data.feature_names)
df_y = pd.DataFrame(data.target, columns=['y'])
As shown below, since all columns are non-null float type with 506 rows and 13 columns data, we will create a model as it is.
df_X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
CHAS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
In LightGBM, build the model with the hyperparameters almost the default.
Booster is returned when the model is built with the `train``` function provided by the Training API. In conclusion, passing Booster as is to
`plot_partial_dependence``` of scikit-learn will result in an error.
I would like to see what kind of error occurs. First, let's train the model.
import lightgbm as lgb
from sklearn.model_selection import train_test_split
df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(df_X, df_y, test_size=0.2, random_state=4)
lgb_train = lgb.Dataset(df_X_train, df_y_train)
lgb_eval = lgb.Dataset(df_X_test, df_y_test)
# Booster
params = {
'seed':4,
'objective': 'regression',
'metric':'rmse'}
lgbm_booster = lgb.train(params,
lgb_train,
valid_sets=lgb_eval,
num_boost_round=200,
early_stopping_rounds=20,
verbose_eval=50)
(abridgement)
Training until validation scores don't improve for 20 rounds
[50] valid_0's rmse: 3.58803
[100] valid_0's rmse: 3.39545
[150] valid_0's rmse: 3.31867
[200] valid_0's rmse: 3.28222
Did not meet early stopping. Best iteration is:
[192] valid_0's rmse: 3.27283
Pass the trained model lgbm_booster
to `` `plot_partial_dependence```.
from sklearn.inspection import plot_partial_dependence
plot_partial_dependence(lgbm_booster, df_X_train, ['CRIM', 'ZN'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-9a44e58d2f50> in <module>
1 from sklearn.inspection import plot_partial_dependence
----> 2 plot_partial_dependence(lgbm, df_X_train, ['CRIM', 'ZN'])
(Omitted)
/anaconda3/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py in partial_dependence(estimator, X, features, response_method, percentiles, grid_resolution, method)
305 if not (is_classifier(estimator) or is_regressor(estimator)):
306 raise ValueError(
--> 307 "'estimator' must be a fitted regressor or classifier."
308 )
309
ValueError: 'estimator' must be a fitted regressor or classifier.
That's why I was able to confirm that a Value Error appears.
Now let's train the model with Scikit-learn API. Use LGBMRegressor for regression problems.
lgbm_sk = lgb.LGBMRegressor(objective=params['objective'],
random_state=params['seed'],
metric=params['metric'])
lgbm_sk.fit(df_X_train, df_y_train)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=-1,
metric='rmse', min_child_samples=20, min_child_weight=0.001,
min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31,
objective='regression', random_state=4, reg_alpha=0.0,
reg_lambda=0.0, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0)
Pass the trained model lgbm_sk
to
plot_partial_dependence```.
plot_partial_dependence(lgbm_sk, df_X_train, ['CRIM', 'ZN'])
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-9-d0724528e406> in <module>
----> 1 plot_partial_dependence(lgbm_sk, df_X_train, ['CRIM', 'ZN'])
(Omitted)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
965
966 if not attrs:
--> 967 raise NotFittedError(msg % {'name': type(estimator).__name__})
968
969
NotFittedError: This LGBMRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
This is how the Not Fitted Error occurred.
The versions of scikit-learn and lightgbm used here are `0.22.2.post1``` and
`2.3.0```, respectively.
from sklearn import __version__ as sk_ver
print(lgb.__version__)
print(sk_ver)
2.3.0
0.22.2.post1
When I was investigating the NotFittedError that occurred here, Issue on Github reported this error.
I'm using scikit-learn's `check_is_fitted``` function to check if the model is fitted, but the attribute to check with
check_is_fitted``` is lightgbm ``` fit`` It seems to be happening because it is not set with
.
It seems that it can be solved by upgrading the version, so we will upgrade lightgbm.
As of January 14, 2021, the latest version was `` `3.1.1```, so I specified this version.
conda install -c conda-forge lightgbm=3.1.1
Check the version.
from sklearn import __version__ as sk_ver
print(lgb.__version__)
print(sk_ver)
3.1.1
0.23.1
Let the model be trained again.
lgbm_sk = lgb.LGBMRegressor(objective=params['objective'],
random_state=params['seed'],
metric=params['metric'])
lgbm_sk.fit(df_X_train, df_y_train)
LGBMRegressor(metric='rmse', objective='regression', random_state=4)
lgbm_sk
Toplot_partial_dependence
Pass to.
plot_partial_dependence(lgbm_sk, df_X_train, ['CRIM', 'ZN'])
This time, it could be executed without any error, and the result shown below was obtained.
Next, in addition to scikit-learn, use `` `pdpbox``` to issue partial dependence.
from pdpbox import pdp, get_dataset, info_plots
fig = plt.figure(figsize=(14,5))
pdp_goals = pdp.pdp_isolate(model=lgbm_sk, dataset=df_X_train, model_features=df_X_train.columns, feature='LSTAT')
# plot it
pdp.pdp_plot(pdp_goals,'LSTAT')
plt.show()
I got the output as shown below. It looks better than scikit-learn.
NotFittedError that occurred when passing Lightgbm to `` `plot_partial_dependence``` of scikit-learn was solved by upgrading Lightgbm.