[PYTHON] LightGBM predict contributes to the predicted value

Introduction

I used the `` `predict``` function when predicting with a model made with LightGBM.

pred = model.predict(data)It's like that.


 Suddenly, when I looked at the [Official Document](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.predict), there was a parameter `` `pred_contrib``` in the argument of `` `predict```, and I could get the contribution to prediction using SHAP It was written, so I tried it.

# environment

 The environment is as follows.

```bash
$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.13.6
BuildVersion:   17G14042

Since I was working on Jupyterlab (Version 0.35.4), I will also list the version of the python kernel.

Python 3.7.3 (default, Mar 27 2019, 16:54:48) 
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.

Model building

Prepare the data and model for forecasting. The data used was the Boston dataset provided by scikit-learn.

import pandas as pd
import sklearn.datasets as skd

data = skd.load_boston()

df_X = pd.DataFrame(data.data, columns=data.feature_names)
df_y = pd.DataFrame(data.target, columns=['y'])

As shown below, since all columns are non-null float type with 506 rows and 13 columns data, we will create a model as it is.

df_X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB

In LightGBM, build the model with the hyperparameters almost the default. In addition, since SHAP to be used later goes to see the parameter setting value of `objective``` set in the model, there is no problem by default, but objective``` is added to `` params It is clearly stated (otherwise an error will occur in the later `` `explainer.shap_values).

import lightgbm as lgb
from sklearn.model_selection import train_test_split

df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(df_X, df_y, test_size=0.2, random_state=4)

lgb_train = lgb.Dataset(df_X_train, df_y_train)
lgb_eval = lgb.Dataset(df_X_test, df_y_test)

params = {
    'seed':4,
    'objective': 'regression',
    'metric':'rmse'}

lgbm = lgb.train(params,
                lgb_train,
                valid_sets=lgb_eval,
                num_boost_round=200,
                early_stopping_rounds=20,
                verbose_eval=50)
Training until validation scores don't improve for 20 rounds
[50]	valid_0's rmse: 3.58803
[100]	valid_0's rmse: 3.39545
[150]	valid_0's rmse: 3.31867
[200]	valid_0's rmse: 3.28222
Did not meet early stopping. Best iteration is:
[192]	valid_0's rmse: 3.27283

Prepare the forecast data to be used later.

#Data to predict
data_for_pred = pd.DataFrame([df_X_test.iloc[0, :]])
print(data_for_pred)
      CRIM    ZN  INDUS  CHAS    NOX     RM    AGE     DIS  RAD    TAX  \
8  0.21124  12.5   7.87   0.0  0.524  5.631  100.0  6.0821  5.0  311.0   

   PTRATIO       B  LSTAT  
8     15.2  386.63  29.93  

Predict the contribution of the predicted value with predict

predictYou can predict by passing the data to.

#Ordinary prediction
print(lgbm.predict(data_for_pred))
[16.12018486]

Now run `predict``` with `pred_contrib = True``` as an argument.

# pred_contrib=True
print(lgbm.predict(data=data_for_pred, pred_contrib=True))
[[ 8.11013815e-01  1.62335755e-03 -6.90242856e-02  9.22244470e-03
   4.92616768e-01 -3.16444968e+00 -1.22276730e+00 -1.11934703e-01
   2.56615903e-02 -1.99428008e-01  1.25166390e+00  3.43507676e-02
  -4.03663118e+00  2.22982674e+01]]

In this way, with this data, we get a two-dimensional array with 1 row and 14 columns. In the official documentation

Note that unlike the shap package, with pred_contrib we return a matrix with an extra column, where the last column is the expected value.

Since it is written, the contribution of each feature is in the 1st to 13th columns, and the expected value is in the 14th column. Let's check this area with SHAP.

Confirm with SHAP

If you pass the created model to TreeExplainer and output `shap_values``` and expected_value```, it surely matches the value output by `` predict. In other words, it was confirmed that the same information used in the figure output by `` `force_plot can be obtained by `` `predict```.

import shap

explainer = shap.TreeExplainer(lgbm)
shap_values = explainer.shap_values(data_for_pred)
print('shap_values: ', shap_values)
print('expected_value: ', explainer.expected_value)

shap.force_plot(base_value=explainer.expected_value, shap_values=shap_values[0,:], features=data_for_pred.iloc[0,:], matplotlib=True)
shap_values:  [[ 8.11013815e-01  1.62335755e-03 -6.90242856e-02  9.22244470e-03
   4.92616768e-01 -3.16444968e+00 -1.22276730e+00 -1.11934703e-01
   2.56615903e-02 -1.99428008e-01  1.25166390e+00  3.43507676e-02
  -4.03663118e+00]]
expected_value:  22.29826737657883

output_7_1.png

By the way, what is `expected_value``` here? I thought, and when I looked at SHAP's [Document](https://github.com/slundberg/shap), `expected_value``` was

the average model output over the training dataset we passed

And that. Hmmm, maybe the average value of the objective variable of the training data? I thought, and when I put out mean, it surely matched.

print(df_y_train.mean())
y    22.298267
dtype: float64

Summary

So, I found that LightGBM's predict gives shap_value and expected_value. However, I thought it should be noted that the predicted value itself is not output when pred_contrib = True is set. So

pred = model.predict(data)
contrib = model.predict(data, pred_contrib=True)

#When pred is larger than the mean value of the objective variable of the training data: Output the reason why the prediction was pushed up
#When pred is smaller than the mean value of the objective variable of the training data: Output the reason why the prediction was pushed down

I wonder if it should be used like that.

(Jupyter can convert notebook files to markdown. I didn't know. I would like to utilize it from this time onwards.)

Recommended Posts

LightGBM predict contributes to the predicted value
Use numpy's .flatten () [0] to retrieve the value
I tried to predict the price of ETF
python beginners tried to predict the number of criminals
Return one-hot encoded features to the original category value
[DanceDanceRevolution] Is it possible to predict the difficulty level (foot) from the value of the groove radar?
The road to Pythonista
The road to Djangoist
Switch the setting value of setting.py according to the development environment
I want to initialize if the value is empty (python)
Memo to get the value on the html-javascript side with jupyter
How to get the last (last) value in a list in Python
Extract the value closest to a value from a Python list element