[PYTHON] Ridge Regression (for beginners) -Code Edition-

This time, I will summarize the implementation (code) of Ridge regression.

■ Ridge regression procedure

Proceed with the next 7 steps.

Preparation of module
Data preparation
Parameter search
Creating a model
Calculation of predicted value
Residual plot
Model evaluation

1. Preparation of module

First, import the required modules.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Module to read the dataset
from sklearn.datasets import load_boston

#Module that separates training data and test data
from sklearn.model_selection import train_test_split

#Module for standardization (distributed normalization)
from sklearn.preprocessing import StandardScaler

#Module to search for parameter (alpha)
from sklearn.linear_model import RidgeCV

#Module to plot parameters (alpha)
from yellowbrick.regressor import AlphaSelection

#Module that performs Ridge regression (least squares method + L2 regularization term)
from sklearn.linear_model import Ridge

## 2. Data preparation After acquiring the data, divide it for easy processing.


#Loading Boston dataset
boston = load_boston()

#Divide into objective variable and explanatory variable
X, y = boston.data, boston.target

#Standardization (distributed normalization)
SS = StandardScaler()
X = SS.fit_transform(X)

#Divide into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=123)

In standardization, for example, when there are 2-digit and 4-digit features (explanatory variables), the influence of the latter becomes large. The scale is aligned by setting the average to 0 and the variance to 1 for all features.

In random_state, the seed value is fixed so that the result of data division is the same each time.

3. Parameter search

Ridge regression adds a regularization term to the least squares equation to avoid overfitting. Increasing alpha makes regularization stronger, and decreasing alpha makes it weaker.

Perform grid search and cross-validation on the training data to find the optimal alpha.


#Set the search interval of the parameter (alpha)
alphas = np.logspace(-10, 1, 500)

#Cross-validate training data to find optimal alpha
ridgeCV = RidgeCV(alphas = alphas)

#Plot alpha
visualizer = AlphaSelection(ridgeCV)
visualizer.fit(X_train, y_train)

visualizer.show()
plt.show()

Output result
From the above, the optimum alpha = 8.588 was found.

4. Creating a model

We will create a model of Ridge regression using the parameter (alpha) obtained earlier.


#Create an instance of Ridge regression
ridge = Ridge(alpha = 8.588)

#Generate a model from training data (least squares method + regularization term)
ridge.fit(X_train, y_train)

#Output intercept
print(ridge.intercept_)

#Output regression coefficient (slope)
print(ridge.coef_)

Output result


lr.intercept_: 22.564747201001634

lr.coef_: [-0.80818323  0.81261982  0.24268597  0.10593523 -1.39093785  3.4266411
 -0.23114806 -2.53519513  1.7685398  -1.62416829 -1.99056814  0.57450373
 -3.35123162]

lr.intercept_: intercept (weight $ w_0 $) lr.coef_: Regression coefficient / slope (weight $ w_1 $ ~ $ w_ {13} $)

Therefore, a concrete numerical value in the model formula (regression formula) was obtained.

$ y = w_0 + w_1x_1+w_2x_2+ \cdots + w_{12}x_{12} + w_{13}x_{13}$

5. Calculation of predicted value

Put the test data (X_test) in the created model formula and find the predicted value (y_pred).


y_pred = lr.predict(X_test)
y_pred

Output result


y_pred: [15.25513373 27.80625237 39.25737057 17.59408487 30.55171547 37.48819278
 25.35202855 ..... 17.59870574 27.10848827 19.12778747 16.60377079 22.13542152]

## 6. Residual plot Before evaluating the model, let's look at the residual plot.

Residual: Difference between predicted value and correct answer value (y_pred --y_test)

#Creating drawing objects and subplots
fig, ax = plt.subplots()

#Residual plot
ax.scatter(y_pred, y_pred - y_test, marker = 'o')

# y =Plot the red straight line of 0
ax.hlines(y = 0, xmin = -10, xmax = 50, linewidth = 2, color = 'red')
 
#Set the axis label
ax.set_xlabel('y_pred')
ax.set_ylabel('y_pred - y_test')

#Added graph title
ax.set_title('Residual Plot')

plt.show()

Output result

The data is well-balanced above and below the red line (y_pred --y_test = 0). It can be confirmed that there is no big bias in the output of the predicted value.

7. Model evaluation

This time, we will evaluate using the coefficient of determination.


#Score for training data
print(ridge.score(X_train, y_train))

#Score against test data
print(ridge.score(X_test, y_test))

Output result


Train Score: 0.763674626990198
Test Score: 0.6462122981958535

## ■ Finally As mentioned above, Ridge regression was performed according to the steps 1 to 7 above.

This time, for beginners, I have summarized only the implementation (code). Looking at the timing in the future, I would like to write an article about theory (mathematical formula).

Thank you for reading.

References: A new textbook for data analysis using Python (Python 3 engineer certification data analysis test main teaching material)