[PYTHON] Try to evaluate the performance of machine learning / regression model

1.First of all

This time, I will evaluate the performance of the regression model used for machine learning while writing code.

2. Data set

The dataset used is the Boston home price that comes with sklearn.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle

# --------Data set loading---------
dataset = load_boston()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.Series(dataset.target, name='y')
print('X.shape = ', X.shape)
print(X.join(y).head())

スクリーンショット 2019-11-29 10.00.34.png There are 506 data in total, 13 features, and y is the target house price.

3. Regression model

There are five regression models used this time. Later, we will put it together in the form of a pipeline for ease of use. Hyperparameters are the default.

# ----------Pipeline settings-----------
pipelines = {
  '1.Linear': Pipeline([('std',StandardScaler()),
                        ('est',LinearRegression())]),
     
  '2.Ridge' : Pipeline([('std',StandardScaler()),
                        ('est',Ridge(random_state=0))]),

  '3.Tree'  : Pipeline([('std',StandardScaler()),
                        ('est',DecisionTreeRegressor(random_state=0))]),

  '4.Random': Pipeline([('std',StandardScaler()),
                        ('est',RandomForestRegressor(random_state=0, n_estimators=100))]),  
     
  '5.GBoost': Pipeline([('std',StandardScaler()),
                        ('est',GradientBoostingRegressor(random_state=0))])
}

1.Linear A ** Linear Regression Model (Linear) ** using the least squares method.

2.Ridge It is a ** ridge regression model (Ridge) ** that suppresses overfitting by adding an L2 regularization item to the linear regression model.

3.Tree This is a regression model based on ** Decision Tree **.

4.Random It is ** Random Forest ** that creates multiple decision trees from randomly selected features and outputs the predictions of all decision trees on average.

5.GBoost It is ** Gradient Boosting ** that improves the prediction accuracy by trying to explain the information (residual) that the existing tree group cannot explain by the subsequent trees.

4. Evaluation index

Use ** R2_score ** as the error index. This is how much the predicted and measured squared error Σ can be made smaller than the measured and measured average squared error Σ.

If the predictions are all in agreement with the actual measurements, the index can be 1, and if the predictions are too bad, the indicators can be negative. スクリーンショット 2019-11-29 19.04.31.png

5. Holdout method

First, the holdout method, which is a general method for evaluating the performance of a model, is performed. The generalization performance is examined by dividing the data into ** training data: test data = 8: 2 **, training with the training data, and then evaluating with unknown test data.

# -----------Holdout method-----------
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.20, random_state=1)  

scores = {}
for pipe_name, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    scores[(pipe_name,'train')] = r2_score(y_train, pipeline.predict(X_train))
    scores[(pipe_name,'test')] = r2_score(y_test, pipeline.predict(X_test))
print(pd.Series(scores).unstack())

スクリーンショット 2019-11-29 10.46.24.png As a result of evaluation with unknown test data, ** 5.GBoost ** was the most accurate (0.924750).

In the training data, the most accurate (1.0000) ** 3.Tree ** is significantly set back to (0.821282) in the test data, indicating that it is overfitting.

** 2.Ridge ** should be an improved version of ** 1.Linear **, but the test data shows a slight reversal of accuracy. This is because the accuracy measurement by the holdout method has some variation, and we will compare the performance later by the more strict k-fold method.

6. Residual plot

Residual plots are made to visualize model performance. This plots training data and test data with ** predicted value ** on the horizontal axis and ** difference between predicted value and actual value ** on the vertical axis.

# -------------Residual plot------------
for pipe_name, est in pipelines.items():
    y_train_pred = est.predict(X_train)
    y_test_pred = est.predict(X_test)
    plt.scatter(y_train_pred, y_train_pred - y_train, c = 'blue', alpha=0.5, marker = 'o', label = 'train')
    plt.scatter(y_test_pred, y_test_pred - y_test, c = 'red', marker ='x', label= 'test' )
    plt.hlines(y = 0, xmin = 0, xmax = 50, color = 'black')
    plt.ylim([-20, 20])
    plt.xlabel('Predicted values')
    plt.ylabel('Residuals')        
    plt.title(pipe_name)
    plt.legend()
    plt.show()

As the output of the code, 5 residuals are plotted from 1.Linear to 5.GBoost, but here we will raise only 3 typical ones.

スクリーンショット 2019-11-29 11.07.29.png ** Linear regression model **. Both the training data and the test data have almost the same residual variation.

スクリーンショット 2019-11-29 11.07.54.png ** Decision tree **. While the training data has a complete zero residual (accuracy 100%), the test data has a large residual variation. This is typical overfitting.

スクリーンショット 2019-11-29 11.08.09.png ** Gradient boosting **. The variation in residuals is suppressed in both training data and test data.

7. k-hold method

The ** k-hold method ** (k-fold cross-validation) enables stricter model evaluation than the holdout method.

The specific procedure is to first divide the data into k pieces, select them one by one to make test data, and use the remaining k-1 pieces as training data.

Then, the ** k-hold method ** is to train with the training data, repeat the accuracy measurement with the test data k times, and take the average of the obtained accuracy as the accuracy of the model. Here, k = 5 (specified by cv = 5).

Note that ** cross_val_score ** does not automatically shuffle data like ** train_test_split **, so first use the utility ** shuffle ** to shuffle the data before processing.

# -------------- k-fold method--------------
X_shuffle, y_shuffle =shuffle(X, y, random_state= 1)  #Data shuffle

scores={}
for pipe_name, est in pipelines.items():  
    cv_results = cross_val_score(est, X_shuffle, y_shuffle, cv=5, scoring='r2')    
    scores[(pipe_name,'avg')] = cv_results.mean()
    scores[(pipe_name,'score')] = np.round(cv_results,5)  # np.round is digit adjustment
print(pd.Series(scores).unstack())

スクリーンショット 2019-11-29 11.21.38.png The five numbers ** score ** enclosed in [] next to ** avg ** are the precisions calculated each time. If you look at ** 1.Linear **, you can see that it varies from ** minimum 0.64681 to maximum 0.76342 **. By averaging these numbers, a rigorous model evaluation is performed.

You can see that ** 2.Ridge ** is slightly more accurate than ** 1.Linear **.

Ultimately, the best model for Boston home prices was ** 5.GBoost **.

Recommended Posts

Try to evaluate the performance of machine learning / regression model
Try to evaluate the performance of machine learning / classification model
Evaluate the performance of a simple regression model using LeaveOneOut cross-validation
I tried to organize the evaluation indexes used in machine learning (regression model)
Machine learning beginners try linear regression
How to increase the number of machine learning dataset images
[Machine learning] I tried to summarize the theory of Adaboost
[Machine learning] Check the performance of the classifier with handwritten character data
Try to model the cumulative return of rollovers in futures trading
How to use machine learning for work? 01_ Understand the purpose of machine learning
Machine learning model management to avoid quarreling with the business side
Evaluate the accuracy of the learning model by cross-validation from scikit learn
Try to predict the triplet of boat race by ranking learning
<Course> Machine Learning Chapter 3: Logistic Regression Model
Machine learning algorithm (generalization of linear regression)
Record the steps to understand machine learning
<Course> Machine Learning Chapter 1: Linear Regression Model
<Course> Machine Learning Chapter 2: Nonlinear Regression Model
I tried calling the prediction API of the machine learning model from WordPress
I tried to visualize the model with the low-code machine learning library "PyCaret"
Try to forecast power demand by machine learning
About the development contents of machine learning (Example)
Improvement of performance metrix by two-step learning model
Try to simulate the movement of the solar system
Try using Jupyter Notebook of Azure Machine Learning
Arrangement of self-mentioned things related to machine learning
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
Try to write code from 1 using the machine learning framework chainer (mnist edition)
I tried to predict the presence or absence of snow by machine learning.
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
Machine learning logistic regression
Machine learning linear regression
I'm an amateur on the 14th day of python, but I want to try machine learning with scikit-learn
Introduction to machine learning
Count the number of parameters in the deep learning model
Try to solve the problems / problems of "Matrix Programmer" (Chapter 1)
Impressions of taking the Udacity Machine Learning Engineer Nano-degree
How to visualize the decision tree model of scikit-learn
Try to predict forex (FX) with non-deep machine learning
About testing in the implementation of machine learning models
Try to estimate the number of likes on Twitter
Predict the gender of Twitter users with machine learning
Try to get the contents of Word with Golang
Machine learning beginners try to make a decision tree
Summary of the basic flow of machine learning with Python
Attempt to include machine learning model in python package
Record of the first machine learning challenge with Keras
[Machine learning] Try to detect objects using Selective Search
I tried to compress the image using machine learning
The first step of machine learning ~ For those who want to implement with python ~
Introduction to machine learning ~ Let's show the table of K-nearest neighbor method ~ (+ error handling)
Machine learning model considering maintainability
Try to get the function list of Python> os package
[Translation] scikit-learn 0.18 User Guide 3.1. Cross-validation: Evaluate the performance of the estimator
An introduction to machine learning
Machine Learning: Supervised --Linear Regression
Basics of Machine Learning (Notes)
The result of Java engineers learning machine learning in Python www
Survey on the use of machine learning in real services
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
I tried to compare the accuracy of machine learning models using kaggle as a theme.