[PYTHON] Ridge Regression (for beginners) -Code Edition-

This time, I will summarize the implementation (code) of Ridge regression.

■ Ridge regression procedure

Proceed with the next 7 steps.

  1. Preparation of module
  2. Data preparation
  3. Parameter search
  4. Creating a model
  5. Calculation of predicted value
  6. Residual plot
  7. Model evaluation

1. Preparation of module

First, import the required modules.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Module to read the dataset
from sklearn.datasets import load_boston

#Module that separates training data and test data
from sklearn.model_selection import train_test_split

#Module for standardization (distributed normalization)
from sklearn.preprocessing import StandardScaler

#Module to search for parameter (alpha)
from sklearn.linear_model import RidgeCV

#Module to plot parameters (alpha)
from yellowbrick.regressor import AlphaSelection

#Module that performs Ridge regression (least squares method + L2 regularization term)
from sklearn.linear_model import Ridge

## 2. Data preparation After acquiring the data, divide it for easy processing.

#Loading Boston dataset
boston = load_boston()

#Divide into objective variable and explanatory variable
X, y = boston.data, boston.target

#Standardization (distributed normalization)
SS = StandardScaler()
X = SS.fit_transform(X)

#Divide into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=123)

In standardization, for example, when there are 2-digit and 4-digit features (explanatory variables), the influence of the latter becomes large. The scale is aligned by setting the average to 0 and the variance to 1 for all features.

In random_state, the seed value is fixed so that the result of data division is the same each time.

3. Parameter search

Ridge regression adds a regularization term to the least squares equation to avoid overfitting. Increasing alpha makes regularization stronger, and decreasing alpha makes it weaker.

Perform grid search and cross-validation on the training data to find the optimal alpha.


#Set the search interval of the parameter (alpha)
alphas = np.logspace(-10, 1, 500)

#Cross-validate training data to find optimal alpha
ridgeCV = RidgeCV(alphas = alphas)

#Plot alpha
visualizer = AlphaSelection(ridgeCV)
visualizer.fit(X_train, y_train)

visualizer.show()
plt.show()


Output result
image.png From the above, the optimum alpha = 8.588 was found.

4. Creating a model

We will create a model of Ridge regression using the parameter (alpha) obtained earlier.


#Create an instance of Ridge regression
ridge = Ridge(alpha = 8.588)

#Generate a model from training data (least squares method + regularization term)
ridge.fit(X_train, y_train)

#Output intercept
print(ridge.intercept_)

#Output regression coefficient (slope)
print(ridge.coef_)


Output result


lr.intercept_: 22.564747201001634

lr.coef_: [-0.80818323  0.81261982  0.24268597  0.10593523 -1.39093785  3.4266411
 -0.23114806 -2.53519513  1.7685398  -1.62416829 -1.99056814  0.57450373
 -3.35123162]

lr.intercept_: intercept (weight $ w_0 $) lr.coef_: Regression coefficient / slope (weight $ w_1 $ ~ $ w_ {13} $)

Therefore, a concrete numerical value in the model formula (regression formula) was obtained.

$ y = w_0 + w_1x_1+w_2x_2+ \cdots + w_{12}x_{12} + w_{13}x_{13}$

5. Calculation of predicted value

Put the test data (X_test) in the created model formula and find the predicted value (y_pred).


y_pred = lr.predict(X_test)
y_pred


Output result


y_pred: [15.25513373 27.80625237 39.25737057 17.59408487 30.55171547 37.48819278
 25.35202855 ..... 17.59870574 27.10848827 19.12778747 16.60377079 22.13542152]

## 6. Residual plot Before evaluating the model, let's look at the residual plot.

Residual: Difference between predicted value and correct answer value (y_pred --y_test)

#Creating drawing objects and subplots
fig, ax = plt.subplots()

#Residual plot
ax.scatter(y_pred, y_pred - y_test, marker = 'o')

# y =Plot the red straight line of 0
ax.hlines(y = 0, xmin = -10, xmax = 50, linewidth = 2, color = 'red')
 
#Set the axis label
ax.set_xlabel('y_pred')
ax.set_ylabel('y_pred - y_test')

#Added graph title
ax.set_title('Residual Plot')

plt.show()


Output result
image.png

The data is well-balanced above and below the red line (y_pred --y_test = 0). It can be confirmed that there is no big bias in the output of the predicted value.

7. Model evaluation

This time, we will evaluate using the coefficient of determination.


#Score for training data
print(ridge.score(X_train, y_train))

#Score against test data
print(ridge.score(X_test, y_test))


Output result


Train Score: 0.763674626990198
Test Score: 0.6462122981958535

## ■ Finally As mentioned above, Ridge regression was performed according to the steps 1 to 7 above.

This time, for beginners, I have summarized only the implementation (code). Looking at the timing in the future, I would like to write an article about theory (mathematical formula).

Thank you for reading.

References: A new textbook for data analysis using Python (Python 3 engineer certification data analysis test main teaching material)

Recommended Posts

Ridge Regression (for beginners) -Code Edition-
Logistic Regression (for beginners) -Code Edition-
Linear regression (for beginners) -Code edition-
Support Vector Machine (for beginners) -Code Edition-
[Kaggle for super beginners] Titanic (Logistic regression)
Roadmap for beginners
Techniques for code testing?
python textbook for beginners
Dijkstra algorithm for beginners
OpenCV for Python beginners
Learning flow for Python beginners
[For beginners] kaggle exercise (merucari)
Linux distribution recommended for beginners
Test code for evaluating decorators
Understand machine learning ~ ridge regression ~.
CNN (1) for image classification (for beginners)
Python3 environment construction (for beginners)
Overview of Docker (for beginners)
Python #function 2 for super beginners
Seaborn basics for beginners ④ pairplot
Basic Python grammar for beginners
Supervised learning (regression) 2 Advanced edition
100 Pandas knocks for Python beginners
[Python] Sample code for Python grammar
Python for super beginners Python #functions 1
Python #list for super beginners
~ Tips for beginners to Python ③ ~
[For Kaggle beginners] Titanic (LightGBM)
Linux command memorandum [for beginners]
Convenient Linux shortcuts (for beginners)
[For beginners] How to implement O'reilly sample code in Google Colab