1.First of all

This time, I will evaluate the performance of the regression model used for machine learning while writing code.

2. Data set

The dataset used is the Boston home price that comes with sklearn.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle

# --------Data set loading---------
dataset = load_boston()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.Series(dataset.target, name='y')
print('X.shape = ', X.shape)
print(X.join(y).head())

スクリーンショット 2019-11-29 10.00.34.png There are 506 data in total, 13 features, and y is the target house price.

3. Regression model

There are five regression models used this time. Later, we will put it together in the form of a pipeline for ease of use. Hyperparameters are the default.

# ----------Pipeline settings-----------
pipelines = {
  '1.Linear': Pipeline([('std',StandardScaler()),
                        ('est',LinearRegression())]),
     
  '2.Ridge' : Pipeline([('std',StandardScaler()),
                        ('est',Ridge(random_state=0))]),

  '3.Tree'  : Pipeline([('std',StandardScaler()),
                        ('est',DecisionTreeRegressor(random_state=0))]),

  '4.Random': Pipeline([('std',StandardScaler()),
                        ('est',RandomForestRegressor(random_state=0, n_estimators=100))]),  
     
  '5.GBoost': Pipeline([('std',StandardScaler()),
                        ('est',GradientBoostingRegressor(random_state=0))])
}

1.Linear A ** Linear Regression Model (Linear) ** using the least squares method.

2.Ridge It is a ** ridge regression model (Ridge) ** that suppresses overfitting by adding an L2 regularization item to the linear regression model.

3.Tree This is a regression model based on ** Decision Tree **.

4.Random It is ** Random Forest ** that creates multiple decision trees from randomly selected features and outputs the predictions of all decision trees on average.

5.GBoost It is ** Gradient Boosting ** that improves the prediction accuracy by trying to explain the information (residual) that the existing tree group cannot explain by the subsequent trees.

4. Evaluation index

Use ** R2_score ** as the error index. This is how much the predicted and measured squared error Σ can be made smaller than the measured and measured average squared error Σ.

If the predictions are all in agreement with the actual measurements, the index can be 1, and if the predictions are too bad, the indicators can be negative. スクリーンショット 2019-11-29 19.04.31.png

5. Holdout method

First, the holdout method, which is a general method for evaluating the performance of a model, is performed. The generalization performance is examined by dividing the data into ** training data: test data = 8: 2 **, training with the training data, and then evaluating with unknown test data.

# -----------Holdout method-----------
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.20, random_state=1)  

scores = {}
for pipe_name, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    scores[(pipe_name,'train')] = r2_score(y_train, pipeline.predict(X_train))
    scores[(pipe_name,'test')] = r2_score(y_test, pipeline.predict(X_test))
print(pd.Series(scores).unstack())

スクリーンショット 2019-11-29 10.46.24.png As a result of evaluation with unknown test data, ** 5.GBoost ** was the most accurate (0.924750).

In the training data, the most accurate (1.0000) ** 3.Tree ** is significantly set back to (0.821282) in the test data, indicating that it is overfitting.

** 2.Ridge ** should be an improved version of ** 1.Linear **, but the test data shows a slight reversal of accuracy. This is because the accuracy measurement by the holdout method has some variation, and we will compare the performance later by the more strict k-fold method.

6. Residual plot

Residual plots are made to visualize model performance. This plots training data and test data with ** predicted value ** on the horizontal axis and ** difference between predicted value and actual value ** on the vertical axis.

# -------------Residual plot------------
for pipe_name, est in pipelines.items():
    y_train_pred = est.predict(X_train)
    y_test_pred = est.predict(X_test)
    plt.scatter(y_train_pred, y_train_pred - y_train, c = 'blue', alpha=0.5, marker = 'o', label = 'train')
    plt.scatter(y_test_pred, y_test_pred - y_test, c = 'red', marker ='x', label= 'test' )
    plt.hlines(y = 0, xmin = 0, xmax = 50, color = 'black')
    plt.ylim([-20, 20])
    plt.xlabel('Predicted values')
    plt.ylabel('Residuals')        
    plt.title(pipe_name)
    plt.legend()
    plt.show()

As the output of the code, 5 residuals are plotted from 1.Linear to 5.GBoost, but here we will raise only 3 typical ones.

スクリーンショット 2019-11-29 11.07.29.png ** Linear regression model **. Both the training data and the test data have almost the same residual variation.

スクリーンショット 2019-11-29 11.07.54.png ** Decision tree **. While the training data has a complete zero residual (accuracy 100%), the test data has a large residual variation. This is typical overfitting.

スクリーンショット 2019-11-29 11.08.09.png ** Gradient boosting **. The variation in residuals is suppressed in both training data and test data.

7. k-hold method

The ** k-hold method ** (k-fold cross-validation) enables stricter model evaluation than the holdout method.

The specific procedure is to first divide the data into k pieces, select them one by one to make test data, and use the remaining k-1 pieces as training data.

Then, the ** k-hold method ** is to train with the training data, repeat the accuracy measurement with the test data k times, and take the average of the obtained accuracy as the accuracy of the model. Here, k = 5 (specified by cv = 5).

Note that ** cross_val_score ** does not automatically shuffle data like ** train_test_split **, so first use the utility ** shuffle ** to shuffle the data before processing.

# -------------- k-fold method--------------
X_shuffle, y_shuffle =shuffle(X, y, random_state= 1)  #Data shuffle

scores={}
for pipe_name, est in pipelines.items():  
    cv_results = cross_val_score(est, X_shuffle, y_shuffle, cv=5, scoring='r2')    
    scores[(pipe_name,'avg')] = cv_results.mean()
    scores[(pipe_name,'score')] = np.round(cv_results,5)  # np.round is digit adjustment
print(pd.Series(scores).unstack())

スクリーンショット 2019-11-29 11.21.38.png The five numbers ** score ** enclosed in [] next to ** avg ** are the precisions calculated each time. If you look at ** 1.Linear **, you can see that it varies from ** minimum 0.64681 to maximum 0.76342 **. By averaging these numbers, a rigorous model evaluation is performed.

You can see that ** 2.Ridge ** is slightly more accurate than ** 1.Linear **.

Ultimately, the best model for Boston home prices was ** 5.GBoost **.

[PYTHON] Try to evaluate the performance of machine learning / regression model