[PYTHON] I implemented a new gradient boosting NG Boost that can handle uncertainty

Articles sent by data scientists from the manufacturing industry
This time, I implemented NG Boost, which is a derivative algorithm of gradient boosting that is often used in data analysis, and can make predictions considering uncertainty.

Introduction

Predictive model of machine learning using scikit-learn has been written in a past article, so please refer to that. This time I will write only NG Boost.

What is NG Boost?

NGBoost (Natural Gradient Boosting) is a method that can handle "prediction uncertainty". In the case of a general regression model, one predicted value is output, but with NGBoost, the probability of that value can also be predicted. The basic idea is to express the probability distribution of the output at each input in the form of a parameter set and calculate this parameter by the gradient boosting method. NGBoost can be used not only for regression problems but also for classification problems. Please read Original Paper for details. I don't understand everything perfectly, so I would like to use it in the field while understanding the theory.

Implementation of NG Boost

This time, we will build a forecast model using the price data of Boston houses published in the UCI Machine Learning Repository.

item	Overview
data set	・ Boston house-price
number of samples	・ 506 pieces
Number of columns	・ 14 pieces

First install the library. Please refer to user guide for detailed usage.

pip install --upgrade git+https://github.com/stanfordmlgroup/ngboost.git

The python code is below.

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

#Data set reading
boston = load_boston()

#Creating a data frame
#Storage of explanatory variables
df = pd.DataFrame(boston.data, columns = boston.feature_names)

#Add objective variable
df['MEDV'] = boston.target

#Check the contents of the data
df.head()

スクリーンショット 2020-11-09 20.33.13.png

The specific model learning is as follows.

#Library import
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#Create training data and evaluation data
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:, 0:13], df.iloc[:, 13], test_size=0.2, random_state=2)

#Standardize data
sc = StandardScaler()
sc.fit(x_train) #Standardized with training data
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)

#Library for score calculation
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

#Library import
from ngboost import NGBRegressor

#Model learning
ngb = NGBRegressor()
ngb.fit(x_train_std, y_train)

#Forecast
pred_ngb = ngb.predict(x_test_std)

#Evaluation
#Coefficient of determination(R2)
r2_ngb = r2_score(y_test, pred_ngb)

#Mean absolute error(MAE)
mae_ngb = mean_absolute_error(y_test, pred_ngb)

print("R2 : %.3f" % r2_ngb)
print("MAE : %.3f" % mae_ngb)

#Variable importance
print("feature_importances = ", ngb.feature_importances_)

The output result is as follows.

[iter 0] loss=3.6377 val_loss=0.0000 scale=1.0000 norm=6.6433
[iter 100] loss=2.7355 val_loss=0.0000 scale=2.0000 norm=5.1141
[iter 200] loss=2.1841 val_loss=0.0000 scale=2.0000 norm=3.4826
[iter 300] loss=1.9234 val_loss=0.0000 scale=1.0000 norm=1.5236
[iter 400] loss=1.7831 val_loss=0.0000 scale=1.0000 norm=1.4034
R2 : 0.907
MAE : 2.066
feature_importances =  [[0.07639064 0.00286589 0.03962475 0.01478072 0.05049657 0.20370851
  0.06774932 0.14828321 0.02071867 0.06616878 0.06283506 0.07555015
  0.17082773]
 [0.0834246  0.00451744 0.04685921 0.00659447 0.04612649 0.22176486
  0.05597659 0.14181822 0.0302414  0.0725739  0.07465938 0.08480453
  0.1306389 ]]

I would like to show the predicted value and the measured value in a scatter plot.

#Library import
import matplotlib.pyplot as plt
%matplotlib inline

plt.xlabel("pred_ngb")
plt.ylabel("y_test")
plt.scatter(pred_ngb, y_test)

plt.show()

Since the data used are the same, I would like to compare the method MAE constructed using scikit-learn in the past.

Method	MAE
NGBoost	2.066
SVR	2.904
GBDT	2.097
RF	2.122
ElasticNet	3.080
Lasso	3.071
Ridge	3.093

I haven't done any parameter tuning, so I can't say that NG Boost is the best, but it looks good.

at the end

Thank you for reading to the end. This time, I just implemented NG Boost using the library, so I will try using it in practice in the future.

If you have a request for correction, we would appreciate it if you could contact us.