[PYTHON] Machine learning algorithm (linear regression summary & regularization)

Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

This time, as a summary of the linear regression section, I will summarize linear approximation using the Gaussian kernel, overfitting and L1 regularization (Lasso) and L2 regularization (Ridge) that suppress it.

The following sites were referred to this time. Thank you very much.

About kernel approximation

Earlier, when I wrote about Generalization of Linear Regression, I wrote that "the basis function can be anything". It is not always possible to approximate with a straight line even in the regression of reality, and it is also necessary to perform a kunekune approximation such as a polynomial or trigonometric function.

For example, consider $ y = -xsin (x) + \ epsilon $ with noise added to $ y = -xsin (x) $.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

fig, ax = plt.subplots()

t = np.linspace(0, 2*np.pi, 100)
y1 = - t * np.sin(t)

n=40
np.random.seed(seed=2020)
x = 2*np.pi * np.random.rand(n)
y2 = - x * np.sin(x) + 0.4 * np.random.normal(loc=0.0, scale=1.0, size=n)

ax.scatter(x, y2, color='blue', label="sample(with noise)")
ax.plot(t, y1, color='green', label="-x * sin(x)")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=11)

plt.show()
gaussian_regression_1.png

The blue dot is a sample with noise, and the green solid line shows the assumed function. The noise was added with random numbers, but the seed value of the random numbers was fixed so that the same result would be obtained each time.

This time, we know the objective function, but we have to estimate it from the state where we do not know whether the sample is a polynomial or another function.

About Gaussian kernel

The Gaussian kernel is defined as follows:

k(\boldsymbol{x}, \boldsymbol{x'})=\exp(-\beta||\boldsymbol{x}-\boldsymbol{x'}||^2)

||\boldsymbol{x}-\boldsymbol{x'}||^2Is the distance between the vectors,$\sqrt{\sum_{j=1}^{n}(x_j-x'_j)^2} $It is also expressed as.

This function has the following form, and by moving or changing the size of the center of this function, it returns to the target data string.

ガウスカーネル

Linear regression with Gaussian kernel

Now the desired function

f({\bf x})=\sum_{i=1}^{N} \alpha k({\bf x}^{(i)}, {\bf x}')

And. Let the sample value be $ \ hat {y} $

\hat{y} =
\left(\begin{matrix}
k({\bf x}^{(1)}, {\bf x}) & k({\bf x}^{(2)}, {\bf x}) & \cdots & k({\bf x}^{(N)}, {\bf x})
\end{matrix}\right)
\left(\begin{matrix}
\alpha_0 \\
\alpha_1 \\
\vdots \\
\alpha_N
\end{matrix}\right)

Next,

{\bf y} = 
\left(\begin{matrix}
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(N)}
\end{matrix}\right)
,\;
K=\left(\begin{matrix}
k({\bf x}^{(1)}, {\bf x}^{(1)}) & k({\bf x}^{(2)}, {\bf x}^{(1)}) & \cdots & k({\bf x}^{(N)}, {\bf x}^{(1)}) \\
k({\bf x}^{(1)}, {\bf x}^{(2)}) & k({\bf x}^{(2)}, {\bf x}^{(2)}) & & \vdots \\
\vdots & & \ddots & \vdots \\
k({\bf x}^{(1)}, {\bf x}^{(N)}) & \cdots & \cdots & k({\bf x}^{(N)}, {\bf x}^{(N)})
\end{matrix}\right) 
,\;
{\bf \alpha}
=
\left(\begin{matrix}
\alpha^{(1)} \\
\alpha^{(2)} \\
\vdots \\
\alpha^{(N)}
\end{matrix}\right)

It will be developed like this (I'm sorry for this strange copy). Here, $ K $ is called ** gram matrix **. The square error $ L $ between $ f ({\ bf x}) $ and $ \ hat {y} $ is

L=({\bf y}- K {\bf \alpha})^{\mathrm{T}}({\bf y}- K {\bf \alpha})

If we solve this for $$ alpha , then $ {\ bf \ alpha} = (K ^ {\ mathrm {T}} K) ^ {-1} K ^ {\ mathrm {T}} {\ bf y} = K ^ {-1} {\ bf y} $$. In other words, if the inverse matrix of the Gram matrix can be calculated, it is possible to find $ \ alpha $. If the number of observed samples is too large, it will be difficult to obtain this inverse matrix, so another method must be selected.

Implementation in python

I implemented the KernelRegression class using the above formula as it is.

class KernelRegression: #No regularization
  def __init__(self, beta=1):
    self.alpha = np.array([])
    self.beta = beta

  def _kernel(self, xi, xj):
    return np.exp(- self.beta * np.sum((xi - xj)**2))

  def gram_matrix(self, X):
    N = X.shape[0]
    K = np.zeros((N, N))
    for i in range(N):
      for j in range(N):
        K[i][j] = self._kernel(X[i], X[j])
        K[j][i] = K[i][j]

    return K

  def fit(self, X, Y):
    K = self.gram_matrix(X)
    self.alpha = np.linalg.inv(K)@Y

  def predict(self, X, x):
    Y = 0
    for i in range(len(X)):
      Y += self.alpha[i] * self._kernel(X[i], x)
    return Y

Let's approximate using the first sample.

model = KernelRegression(beta=20)
model.fit(x, y2)

xp = np.linspace(0, 2*np.pi, 100)
yp = np.zeros(len(xp))

for i in range(len(xp)):
  yp[i] = model.predict(x, xp[i])


fig, ax = plt.subplots()

plt.ylim(-3,6)
ax.plot(xp, yp, color='purple', label="estimate curve")
ax.scatter(x, y2, color='blue', label="sample(with noise)")
ax.plot(t, y1, color='green', label="-x*sin(x)")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=11)

plt.show()
gaussian_regression_2.png

It seems that it is passing over the given sample, but it is messy. It's completely different from the green curve.

About overfitting

As shown above, the fact that the learning result is faithful to the teacher data but is far from the reality is called ** overfitting **. Is it similar to a math test that you can solve if you get the same problem as a textbook, but you can't solve it as soon as you arrange it a little?

Overfitted models are also referred to as having poor ** generalization performance **. The prediction accuracy is low. To prevent this, we need the concept of ** regularization **.

Regularization

In the above example, the parameter $ \ bf {\ alpha} $ has become extremely large, resulting in a violently violent curve to force the points of the given sample through. After that, it seems that this often happens when the number of data is extremely small or there are many variables.

To prevent this, when calculating $ \ alpha $, it can be achieved by limiting the range of $ \ alpha $ by adding a ** regularization term ** to the square error of kernel regression. L1 regularization (Lasso) and L2 regularization (Ridge) are famous in machine learning. Mixing the two is called Elastic Net regularization.

L2 regularization

Start with L2 instead of L1. Earlier\alphaThe square error between the objective function and the sample whenLI said,\lambda||{\bf \alpha}||_2^2Added

L=({\bf y}- K {\bf \alpha})^{\mathrm{T}}({\bf y}- K {\bf \alpha})+\lambda||{\bf \alpha}||_2^2

Optimize. The larger $ \ lambda $, the more restricted $ \ alpha $, resulting in a gentler (?) Curve. If this is partially differentiated with respect to $ \ alpha $ and set to 0, and solved for $ \ alpha $,

{\bf \alpha} = (K+\lambda I_N)^{-1}{\bf y}

It will be. This is easy to change because all you have to do is add the identity matrix to the Gram matrix and calculate the inverse matrix.

Implementation in python

I changed the part of the KernelRegression class that asks for $ \ alpha $. The rest is almost the same.

class KernelRegression: #L2 regularization
  def __init__(self, beta=1, lam=0.5):
    self.alpha = np.array([])
    self.beta = beta
    self.lam = lam

  def _kernel(self, xi, xj):
    return np.exp(- self.beta * np.sum((xi - xj)**2))

  def gram_matrix(self, X):
    N = X.shape[0]
    K = np.zeros((N, N))
    for i in range(N):
      for j in range(N):
        K[i][j] = self._kernel(X[i], X[j])
        K[j][i] = K[i][j]

    return K

  def fit(self, X, Y):
    K = self.gram_matrix(X)
    self.alpha = np.linalg.inv(K + self.lam * np.eye(X.shape[0]))@Y

  def predict(self, X, x):
    Y = 0
    for i in range(len(X)):
      Y += self.alpha[i] * self._kernel(X[i], x)
    return Y

Try to approximate using the same sample as if it was not regularized. I decided the values of $ \ beta $ and $ \ lambda $ appropriately.

model = KernelRegression(beta=0.3, lam=0.1)
model.fit(x, y2)

xp = np.linspace(0, 2*np.pi, 100)
yp = np.zeros(len(xp))

for i in range(len(xp)):
  yp[i] = model.predict(x, xp[i])


fig, ax = plt.subplots()

plt.ylim(-3,6)
ax.plot(xp, yp, color='purple', label="estimate curve")
ax.scatter(x, y2, color='blue', label="sample(with noise)")
ax.plot(t, y1, color='green', label="-x*sin(x)")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=11)

plt.show()
gaussian_regression_3.png

This time it was a perfect match. Actually, the shape changes outside the displayed range, but you can see that a fairly close approximation curve can be drawn in a specific range.

$ \ Beta $ and $ \ lambda $ are called ** hyperparameters **, and you actually need to optimize them by changing them little by little to minimize the value of the loss function.

L1 regularization

In L2 regularization, as a regularization term,$\lambda ||{\bf \alpha}||^2_{2} Was added, but in L2 regularization as a regularization term\lambda||{\bf \alpha}||_1$Add.

Finding a solution is a bit complicated because L1 regularization cannot differentiate the regularization term. It seems that scikit-learn is implemented by a method called Coordinate Descent, but I do not understand it yet, so I would like to use scikit-learn obediently in this article.

L1 regularization can reduce a significant portion of the parameter $ \ alpha $ to 0 (get a sparse solution), which helps to remove unwanted parameters.

Implementation in python

I will try to implement it with scikit-learn obediently.

from sklearn.metrics.pairwise import rbf_kernel
from sklearn.linear_model import Lasso

kx = rbf_kernel(x.reshape(-1,1), x.reshape(-1,1))
KX = rbf_kernel(xp.reshape(-1,1), x.reshape(-1,1))

clf = Lasso(alpha=0.01, max_iter=10000)
clf.fit(kx, y2)

yp = clf.predict(KX)

fig, ax = plt.subplots()
plt.ylim(-3,6)
ax.plot(xp, yp, color='purple', label="estimate curve")
ax.scatter(x, y2, color='blue', label="sample(with noise)")
ax.plot(t, y1, color='green', label="-x*sin(x)")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=11)

plt.show()

print("coef: ", clf.coef_)

coef:  [-0.          0.          0.         -0.         -0.79703436 -0.
 -0.         -0.          0.         -0.         -0.          3.35731837
  0.         -0.         -0.77644607  0.         -0.          0.
 -0.         -0.19590705  0.         -0.          0.         -0.
  0.          0.         -0.         -0.         -0.          0.
 -0.          0.         -0.         -0.          0.          0.58052561
  0.3688375   0.         -0.          1.75380012]
gaussian_regression_4.png

This is also very similar to the original function. You can also see that the parameter (coef) is 0 for a large part.

Elasticnet Both L1 regularization and L2 regularization add a regularization term to the loss function, but they can be considered independently and you can decide which and how much to adopt, which is called ** ElasticNet **. I will. In fact, in the scikit-learn implementation, the ElasticNet parameters were changed and the functions were defined as Lasso and Ridge.

Summary

Any function can be selected as the basis function in linear regression. Since the Gaussian kernel (not limited to) has a high degree of freedom and tends to be overfitted, it is possible to create a regression model with higher generalization performance by adding restrictions by L1 regularization or L2 regularization.

I've summarized the regression series several times, but to help you understand the content of Regression in scikit-learn cheat sheet Isn't it?

Tsukkomi is welcome because I have summarized it appropriately.

Recommended Posts

Machine learning algorithm (linear regression summary & regularization)
Machine learning algorithm (generalization of linear regression)
Machine Learning: Supervised --Linear Regression
Machine learning algorithm (logistic regression)
Machine learning beginners try linear regression
Machine learning algorithm (multiple regression analysis)
Machine learning algorithm (simple regression analysis)
<Course> Machine Learning Chapter 1: Linear Regression Model
Machine learning algorithm classification and implementation summary
Machine learning tutorial summary
Machine learning ⑤ AdaBoost Summary
Machine learning logistic regression
EV3 x Python Machine Learning Part 2 Linear Regression
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Understand machine learning ~ ridge regression ~.
Machine learning article summary (self-authored)
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Machine learning algorithm (simple perceptron)
Supervised machine learning (classification / regression)
Machine learning algorithm (support vector machine)
Machine learning ④ K-nearest neighbor Summary
Machine learning stacking template (regression)
Python Machine Learning Programming Chapter 2 Classification Problems-Machine Learning Algorithm Training Summary
<Course> Machine Learning Chapter 6: Algorithm 2 (k-means)
Machine learning algorithm (support vector machine application)
Machine learning ① SVM (Support Vector Machine) Summary
Machine learning ③ Summary of decision tree
Classification and regression in machine learning
Machine learning algorithm (gradient descent method)
Machine Learning: Supervised --Linear Discriminant Analysis
[Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
Machine learning
Linear regression
[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics
<Course> Machine Learning Chapter 3: Logistic Regression Model
scikit-learn How to use summary (machine learning)
Machine learning with python (2) Simple regression analysis
"Python Machine Learning Programming" Summary Note (Jupyter)
Machine learning algorithm (implementation of multi-class classification)
<Course> Machine Learning Chapter 2: Nonlinear Regression Model
Stock price forecast using machine learning (regression)
[Machine learning] Regression analysis using scikit learn
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
Gaussian mixed model EM algorithm [statistical machine learning]
Dictionary learning algorithm
[Python3] Let's analyze data using machine learning! (Regression)
Summary of evaluation functions used in machine learning
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Machine Learning Professional Series Round Reading Session Slide Summary
Machine learning python code summary (updated from time to time)
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Site summary to learn machine learning with English video
Summary of the basic flow of machine learning with Python
Talk about improving machine learning algorithm bottlenecks with Cython
About machine learning overfitting
Machine Learning: Supervised --AdaBoost
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning course memo