[PYTHON] I tried to implement various methods for machine learning (prediction model) using scikit-learn.

Articles sent by data scientists from the manufacturing industry
This time, we implemented and organized various prediction models using scikit-learn.

Introduction

This time, I implemented a machine learning prediction model using scikit-learn. I also summarized the points when using each method.

A series of steps for building a predictive model

The flow of building a forecast model is summarized below. There are important things in each phase, but the details will be organized separately.

(1) Arrangement of issues: Clarify business issues to be solved (2) Data collection: Organize available data and evaluate whether the goal can be achieved. (3) Basic data aggregation: Visualize the characteristics of the data to be analyzed and analyze the basic aggregation together. (4) Data preprocessing: Cleans data by removing dust hidden in the data (5) Extraction of features: Remove unnecessary features and use only necessary explanatory variables (6) Data normalization: Data normalization to match the scale of features (7) Selection of method: Select an appropriate method according to the data (8) Model learning: Learn the data rules by the selected method (9) Model verification / evaluation: Confirm the prediction accuracy of the learned method and evaluate the validity of the model.

About scikit-learn

scikit-learn is a Python machine learning library.

Data collection

This time, we will build a forecast model using the price data of Boston houses published in the UCI Machine Learning Repository.

item	Overview
data set	・ Boston house-price
number of samples	・ 506 pieces
Number of columns	・ 14 pieces

The python code is below.

#Import required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

#Data set reading
boston = load_boston()

#Creating a data frame
#Storage of explanatory variables
df = pd.DataFrame(boston.data, columns = boston.feature_names)

#Add objective variable
df['MEDV'] = boston.target

#Check the contents of the data
df.head()

スクリーンショット 2020-11-09 20.33.13.png

The explanation of each column name is omitted. ・ Explanatory variables: 13 ・ Objective variable: 1 (MEDV)

Basic aggregation of data

Since there are 13 explanatory variables this time, we will use a multivariate association diagram to efficiently see the relationships between each explanatory variable and objective variable. This time, I would like to utilize a library called seaborn for visualization. First, create a multivariate association diagram.

#Import required libraries
import seaborn as sns

#Multivariate association diagram
sns.pairplot(df, size=1.0)

スクリーンショット 2020-11-09 20.52.30.png

At first glance, RM (average number of rooms per dwelling unit) and MEDV (house price) seem to have a positive correlation. I will analyze it in a little more detail even if I narrow it down to two.

#Relationship between RM (average number of rooms per dwelling unit) and MEDV (house price)
sns.regplot('RM','MEDV',data = df)

スクリーンショット 2020-11-09 21.12.36.png

Looking at the relationship in detail in this way, it seems that there is a correlation between RM (average number of rooms per dwelling unit) and MEDV (house price).

Next, I would like to find the correlation coefficient matrix.

#Calculate the correlation coefficient matrix
df.corr()

スクリーンショット 2020-11-09 21.26.30.png

Data preprocessing

In preprocessing, it is necessary to remove dust (outliers, outliers, missing values) hidden in the data. Preprocessing is important in data analysis, but this time we will only check for missing values.

#Confirmation of missing values
df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

Since there are no missing values in the price data of Boston houses, we will analyze it as it is.

Data normalization

This time, feature engineering is excluded (it should be done). Next, in building the linear regression model, the data is divided into training data and evaluation data. After that, normalization is performed to match the scale of the explanatory variables.

#Library import
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#Create training data and evaluation data
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:, 0:13], df.iloc[:, 13],
                                                    test_size=0.2, random_state=1)

#Standardize data
sc = StandardScaler()
sc.fit(x_train) #Standardized with training data
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)

Once this is done, we will select the method and build a predictive model. This time, I decided to implement the following method.

Linear regression (multiple regression)
Ridge regression
Lasso regression
Elastic Net regression
Random Forest regression
GBDT (gradient boosting tree)
SVR (Support Vector Machine)

1. About linear regression

The general prediction formula for linear regression is as follows.

\begin{eqnarray}
y = \sum_{i=1}^{n}(w_{i}x_{i})+b=w_{1}x_{1}+w_{2}x_{2}+・ ・ ・+w_{n}x_{n}+b
\end{eqnarray}

$ w_i $: Weight for explanatory variable $ x_i $ (regression coefficient) $ b $: Bias (intercept)

#Library import
from sklearn.linear_model import LinearRegression

#Library for score calculation
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

#Model learning
lr = LinearRegression()
lr.fit(x_train_std, y_train)

#Forecast
pred_lr = lr.predict(x_test_std)

#Evaluation
#Coefficient of determination(R2)
r2_lr = r2_score(y_test, pred_lr)

#Average absolute error(MAE)
mae_lr = mean_absolute_error(y_test, pred_lr)

print("R2 : %.3f" % r2_lr)
print("MAE : %.3f" % mae_lr)

#Regression coefficient
print("Coef = ", lr.coef_)
#Intercept
print("Intercept =", lr.intercept_)

The output result is as follows.

R2 : 0.779
MAE : 3.113
Coef =  [-0.93451207  0.85487686 -0.10446819  0.81541757 -1.90731862  2.54650028
  0.25941464 -2.92654009  2.80505451 -1.95699832 -2.15881929  1.09153332
 -3.91941941]
Intercept = 22.44133663366339

Judging only by the numbers of the evaluation index is not good, so I will try to show the predicted value and the measured value in a scatter plot.

#Library import
import matplotlib.pyplot as plt
%matplotlib inline

plt.xlabel("pred_lr")
plt.ylabel("y_test")
plt.scatter(pred_lr, y_test)

plt.show()

スクリーンショット 2020-11-10 20.52.36.png

Looking at this result, I don't think we have made such a strange prediction. Actually, we will investigate in detail here to improve the accuracy, but this time we will try another method.

2. About Ridge regression

Ridge regression is the loss function of linear regression with a regularization term. The loss function of linear regression is as follows.

\begin{eqnarray}
L = (\boldsymbol{y} - X\boldsymbol{w})^{T}(\boldsymbol{y}-X\boldsymbol{w})
\end{eqnarray}

$ \ boldsymbol {y} $: Vectorized measured value of the objective variable $ \ boldsymbol {w} $: Vectorized regression coefficients $ X $: Matrix of measured values of $ n $ number of samples and $ m $ number of explanatory variables

In Ridge regression, the loss function changes as follows.

\begin{eqnarray}
L = (\boldsymbol{y} - X\boldsymbol{w})^{T}(\boldsymbol{y}-X\boldsymbol{w}) + λ||\boldsymbol{w}||_{2}^{2}
\end{eqnarray}

Ridge regression makes it regular by adding the square of the L2 norm of the weight $ \ boldsymbol {w} $ as described above.

The python code is below. It's easy with scikit-learn.

#Library import
from sklearn.linear_model import Ridge

#Model learning
ridge = Ridge(alpha=10)
ridge.fit(x_train_std, y_train)

#Forecast
pred_ridge = ridge.predict(x_test_std)

#Evaluation
#Coefficient of determination(R2)
r2_ridge = r2_score(y_test, pred_ridge)

#Average absolute error(MAE)
mae_ridge = mean_absolute_error(y_test, pred_ridge)

print("R2 : %.3f" % r2_ridge)
print("MAE : %.3f" % mae_ridge)

#Regression coefficient
print("Coef = ", ridge.coef_)

The regularization parameters are set appropriately (default of scikit-learn is alpha = 1.0) The output result is as follows.

R2 : 0.780
MAE : 3.093
Coef =  [-0.86329633  0.7285083  -0.27135102  0.85108307 -1.63780795  2.6270911
  0.18222203 -2.64613645  2.17038535 -1.42056563 -2.05032997  1.07266175
 -3.76668388]

I would like to show the predicted value and the measured value in a scatter plot.

plt.xlabel("pred_ridge")
plt.ylabel("y_test")
plt.scatter(pred_ridge, y_test)

plt.show()

スクリーンショット 2020-11-10 21.18.04.png

It's not much different from linear regression because we haven't tuned or selected variables.

3. About Lasso regression

The Lasso regression and the Ridge regression have different regularization terms. In Lasso regression, the loss function changes as follows.

\begin{eqnarray}
L = \frac{1}{2}(\boldsymbol{y} - X\boldsymbol{w})^{T}(\boldsymbol{y}-X\boldsymbol{w}) + λ||\boldsymbol{w}||_{1}
\end{eqnarray}

Lasso regression differs from Ridge regression in that the regularization term is the L1 norm. I will omit the details this time.

The python code is below.

#Library import
from sklearn.linear_model import Lasso

#Model learning
lasso = Lasso(alpha=0.05)
lasso.fit(x_train_std, y_train)

#Forecast
pred_lasso = lasso.predict(x_test_std)

#Evaluation
#Coefficient of determination(R2)
r2_lasso = r2_score(y_test, pred_lasso)

#Average absolute error(MAE)
mae_lasso = mean_absolute_error(y_test, pred_lasso)

print("R2 : %.3f" % r2_lasso)
print("MAE : %.3f" % mae_lasso)

#Regression coefficient
print("Coef = ", lasso.coef_)

The regularization parameters are set appropriately (default of scikit-learn is alpha = 1.0) The output result is as follows.

R2 : 0.782
MAE : 3.071
Coef =  [-0.80179157  0.66308749 -0.144492    0.81447322 -1.61462819  2.63721307
  0.05772041 -2.64430158  2.11051544 -1.40028941 -2.06766744  1.04882786
 -3.85778379]

I would like to show the predicted value and the measured value in a scatter plot.

plt.xlabel("pred_lasso")
plt.ylabel("y_test")
plt.scatter(pred_lasso, y_test)

plt.show()

スクリーンショット 2020-11-10 21.37.19.png

The Lasso regression doesn't change much either.

4. About Elastic Net regression

It is a method that combines Elastic Net regression, L1 regularization, and L2 regularization.

The python code is below.

#Library import
from sklearn.linear_model import ElasticNet

#Model learning
elasticnet = ElasticNet(alpha=0.05)
elasticnet.fit(x_train_std, y_train)

#Forecast
pred_elasticnet = elasticnet.predict(x_test_std)

#Evaluation
#Coefficient of determination(R2)
r2_elasticnet = r2_score(y_test, pred_elasticnet)

#Average absolute error(MAE)
mae_elasticnet = mean_absolute_error(y_test, pred_elasticnet)

print("R2 : %.3f" % r2_elasticnet)
print("MAE : %.3f" % mae_elasticnet)

#Regression coefficient
print("Coef = ", elasticnet.coef_)

The regularization parameters are set appropriately (default of scikit-learn is alpha = 1.0) The output result is as follows.

R2 : 0.781
MAE : 3.080
Coef =  [-0.80547228  0.64625644 -0.27082019  0.84654972 -1.51126947  2.66279832
  0.09096052 -2.51833347  1.89798734 -1.21656705 -2.01097151  1.05199894
 -3.73854124]

I would like to show the predicted value and the measured value in a scatter plot.

plt.xlabel("pred_elasticnet")
plt.ylabel("y_test")
plt.scatter(pred_elasticnet, y_test)

plt.show()

スクリーンショット 2020-11-10 22.04.43.png

Elastic Net regression hasn't changed much either.

5. About Random Forest regression

Next, we will build a prediction model for the decision tree system. First is the Random Forest regression.

RandomForest is a collection of many different decision trees based on the bagging of ensemble learning. The drawback of decision trees alone is that they are easy to overfit, but Random Forest is one way to deal with this problem.

The python code is below.

#Library import
from sklearn.ensemble import RandomForestRegressor

#Model learning
RF = RandomForestRegressor()
RF.fit(x_train_std, y_train)

#Forecast
pred_RF = RF.predict(x_test_std)

#Evaluation
#Coefficient of determination(R2)
r2_RF = r2_score(y_test, pred_RF)

#Average absolute error(MAE)
mae_RF = mean_absolute_error(y_test, pred_RF)

print("R2 : %.3f" % r2_RF)
print("MAE : %.3f" % mae_RF)

#Variable importance
print("feature_importances = ", RF.feature_importances_)

The parameters are left at their defaults. The output result is as follows.

R2 : 0.899
MAE : 2.122
feature_importances =  [0.04563176 0.00106449 0.00575792 0.00071877 0.01683655 0.31050293
 0.01897821 0.07745557 0.00452725 0.01415068 0.0167309  0.01329619
 0.47434878]

I would like to show the predicted value and the measured value in a scatter plot.

plt.xlabel("pred_RF")
plt.ylabel("y_test")
plt.scatter(pred_RF, y_test)

plt.show()

スクリーンショット 2020-11-10 22.22.32.png

Sounds better than regression system models (linear regression, Ridge regression, Lasso regression, ElasticNet regression). I think it's useful to know that you can also return to Random Forest. Also, since RandomForest does not know the regression coefficient, we evaluate the validity of the model by looking at the importance of variables.

6. About GBDT (gradient boosting tree)

Next is GBDT (gradient boosting tree).

GBDT is also one of ensemble learning, and is an algorithm that aims to improve generalization performance by sequentially creating decision trees that correct mistakes in certain decision trees.

The python code is below.

#Library import
from sklearn.ensemble import GradientBoostingRegressor

#Model learning
GBDT = GradientBoostingRegressor()
GBDT.fit(x_train_std, y_train)

#Forecast
pred_GBDT = GBDT.predict(x_test_std)

#Evaluation
#Coefficient of determination(R2)
r2_GBDT = r2_score(y_test, pred_GBDT)

#Average absolute error(MAE)
mae_GBDT = mean_absolute_error(y_test, pred_GBDT)

print("R2 : %.3f" % r2_GBDT)
print("MAE : %.3f" % mae_GBDT)

#Variable importance
print("feature_importances = ", GBDT.feature_importances_)

The parameters are left at their defaults. The output result is as follows.

R2 : 0.905
MAE : 2.097
feature_importances =  [0.03411472 0.00042674 0.00241657 0.00070636 0.03040394 0.34353116
 0.00627447 0.10042527 0.0014266  0.0165308  0.03114765 0.01129208
 0.42130366]

I would like to show the predicted value and the measured value in a scatter plot.

plt.xlabel("pred_GBDT")
plt.ylabel("y_test")
plt.scatter(pred_GBDT, y_test)

plt.show()

スクリーンショット 2020-11-10 22.29.32.png

It's the most accurate method so far. However, please be aware that GBDT is easy to overfit if you do not set the parameters properly.

7. About SVR (Support Vector Machine)

The last is SVR (Support Vector Machine). Support Vector Machine (SVM) is an algorithm originally developed to solve binary classification problems. Therefore, some people may think that only classification problems can be used. In fact, SVM has an SVR that extends the objective variable to continuous values so that it can handle regression problems. SVR is characterized by being able to solve nonlinear regression problems with relatively high accuracy.

The python code is below.

#Library import
from sklearn.svm import SVR

#Model learning
SVR = SVR(kernel='linear', C=1, epsilon=0.1, gamma='auto')
SVR.fit(x_train_std, y_train)

#Forecast
pred_SVR = SVR.predict(x_test_std)

#Evaluation
#Coefficient of determination(R2)
r2_SVR = r2_score(y_test, pred_SVR)

#Average absolute error(MAE)
mae_SVR = mean_absolute_error(y_test, pred_SVR)

print("R2 : %.3f" % r2_SVR)
print("MAE : %.3f" % mae_SVR)

#Regression coefficient
print("Coef = ", SVR.coef_)

This time, the kernel function used (linear: linear regression). Parameter tuning is required because there are four other kernel functions.

The output result is as follows.

R2 : 0.780
MAE : 2.904
Coef =  [[-1.18218512  0.62268229  0.09081358  0.4148341  -1.04510071  3.50961979
  -0.40316769 -1.78305137  1.58605612 -1.78749695 -1.54742196  1.01255493
  -2.35263548]]

I would like to show the predicted value and the measured value in a scatter plot.

plt.xlabel("pred_SVR")
plt.ylabel("y_test")
plt.scatter(pred_SVR, y_test)

plt.show()

スクリーンショット 2020-11-10 23.19.19.png

SVR is not so accurate compared to Random Forest and GBDT.

Script summary

Organize the scripts so far.

#Library for score calculation
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR

def preprocess_sc(df):
    """Divide the data into training data and evaluation data and standardize

    Parameters
    ----------
    df : pd.DataFrame
Data set (explanatory variable + objective variable)

    Returns
    -------
    x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
    y_train : pd.DataFrame
Training data (objective variable)
    x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
    y_test : pd.DataFrame
Evaluation data (objective variable)
    """
    x_train, x_test, y_train, y_test = train_test_split(df.iloc[:, 0:13], df.iloc[:, 13],
                                                        test_size=0.2, random_state=1)

    #Standardize data
    sc = StandardScaler()
    sc.fit(x_train) #Standardized with training data
    x_train_std = sc.transform(x_train)
    x_test_std = sc.transform(x_test)

    return x_train_std, x_test_std, y_train, y_test

def Linear_Regression(x_train_std, y_train, x_test_std):  
    """Predict by linear regression

    Parameters
    ----------
    x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
    y_train : pd.DataFrame
Training data (objective variable)
    x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)

    Returns
    -------
    pred_lr : pd.DataFrame
Prediction results of linear regression
    """
    lr = LinearRegression()
    lr.fit(x_train_std, y_train)

    pred_lr = lr.predict(x_test_std)

    return pred_lr

def Ridge_Regression(x_train_std, y_train, x_test_std, ALPHA=10.0):  
    """Predict with Ridge regression

    Parameters
    ----------
    x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
    y_train : pd.DataFrame
Training data (objective variable)
    x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
    ALPHA : float
Regularization parameter α

    Returns
    -------
    pred_ridge : pd.DataFrame
Ridge regression prediction results
    """
    ridge = Ridge(alpha=ALPHA)
    ridge.fit(x_train_std, y_train)

    pred_ridge = ridge.predict(x_test_std)

    return pred_ridge

def Lasso_Regression(x_train_std, y_train, x_test_std, ALPHA=0.05):  
    """Predict by Lasso regression

    Parameters
    ----------
    x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
    y_train : pd.DataFrame
Training data (objective variable)
    x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
    ALPHA : float
Regularization parameter α

    Returns
    -------
    pred_lasso : pd.DataFrame
Lasso regression prediction results
    """
    lasso = Lasso(alpha=ALPHA)
    lasso.fit(x_train_std, y_train)

    pred_lasso = lasso.predict(x_test_std)

    return pred_lasso

def ElasticNet_Regression(x_train_std, y_train, x_test_std, ALPHA=0.05):  
    """Predict with Elastic Net regression

    Parameters
    ----------
    x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
    y_train : pd.DataFrame
Training data (objective variable)
    x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
    ALPHA : float
Regularization parameter α

    Returns
    -------
    pred_elasticnet : pd.DataFrame
Elastic Net regression prediction results
    """
    elasticnet = ElasticNet(alpha=ALPHA)
    elasticnet.fit(x_train_std, y_train)

    pred_elasticnet = elasticnet.predict(x_test_std)

    return pred_elasticnet

def RandomForest_Regressor(x_train_std, y_train, x_test_std):  
    """Predict with Random Forest regression

    Parameters
    ----------
    x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
    y_train : pd.DataFrame
Training data (objective variable)
    x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)

    Returns
    -------
    pred_RF : pd.DataFrame
Predicted results of Random Forest regression
    """
    RF = RandomForestRegressor()
    RF.fit(x_train_std, y_train)

    pred_RF = RF.predict(x_test_std)

    return pred_RF

def GradientBoosting_Regressor(x_train_std, y_train, x_test_std):  
    """Predict with GBDT

    Parameters
    ----------
    x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
    y_train : pd.DataFrame
Training data (objective variable)
    x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)

    Returns
    -------
    pred_GBDT : pd.DataFrame
GBDT prediction results
    """
    GBDT = GradientBoostingRegressor()
    GBDT.fit(x_train_std, y_train)

    pred_GBDT = GBDT.predict(x_test_std)

    return pred_GBDT

def SVR_Regression(x_train_std, y_train, x_test_std):  
    """Predict with SVR

    Parameters
    ----------
    x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
    y_train : pd.DataFrame
Training data (objective variable)
    x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)

    Returns
    -------
    pred_SVR : pd.DataFrame
GBDT prediction results
    """
    svr = SVR()
    svr.fit(x_train_std, y_train)

    pred_SVR = svr.predict(x_test_std)

    return pred_SVR

def main():
    #Data set reading
    boston = load_boston()

    #Creating a data frame
    #Storage of explanatory variables
    df = pd.DataFrame(boston.data, columns = boston.feature_names)

    #Add objective variable
    df['MEDV'] = boston.target

    #Data preprocessing
    x_train_std, x_test_std, y_train, y_test = preprocess_sc(df)

    pred_lr = pd.DataFrame(Linear_Regression(x_train_std, y_train, x_test_std))
    pred_ridge = pd.DataFrame(Ridge_Regression(x_train_std, y_train, x_test_std, ALPHA=10.0))
    pred_lasso = pd.DataFrame(Lasso_Regression(x_train_std, y_train, x_test_std, ALPHA=0.05))
    pred_elasticnet = pd.DataFrame(ElasticNet_Regression(x_train_std, y_train, x_test_std, ALPHA=0.05))
    pred_RF = pd.DataFrame(RandomForest_Regressor(x_train_std, y_train, x_test_std))
    pred_GBDT = pd.DataFrame(GradientBoosting_Regressor(x_train_std, y_train, x_test_std))
    pred_SVR = pd.DataFrame(SVR_Regression(x_train_std, y_train, x_test_std))
    pred_all = pd.concat([df_lr, pred_ridge, pred_lasso, pred_elasticnet, pred_RF, pred_GBDT, pred_SVR], axis=1, sort=False)
    pred_all.columns = ["df_lr", "pred_ridge", "pred_lasso", "pred_elasticnet", "pred_RF", "pred_GBDT", "pred_SVR"]

    return pred_all

if __name__ == "__main__":
    pred_all = main()

finally

Thank you for reading to the end. However, I hope you understand that there are various methods for building a prediction model. Also, I hope you found that you can easily implement any of them by using scikit-learn.

In practice, from here, evaluation of each model, parameter tuning, feature quantity engineering, etc. are further performed. It is necessary to improve the accuracy.

If you have a request for correction, we would appreciate it if you could contact us.