Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

This time, as a summary of the linear regression section, I will summarize linear approximation using the Gaussian kernel, overfitting and L1 regularization (Lasso) and L2 regularization (Ridge) that suppress it.

The following sites were referred to this time. Thank you very much.

About kernel approximation

Earlier, when I wrote about Generalization of Linear Regression, I wrote that "the basis function can be anything". It is not always possible to approximate with a straight line even in the regression of reality, and it is also necessary to perform a kunekune approximation such as a polynomial or trigonometric function.

For example, consider $ y = -xsin (x) + \ epsilon $ with noise added to $ y = -xsin (x) $.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

fig, ax = plt.subplots()

t = np.linspace(0, 2*np.pi, 100)
y1 = - t * np.sin(t)

n=40
np.random.seed(seed=2020)
x = 2*np.pi * np.random.rand(n)
y2 = - x * np.sin(x) + 0.4 * np.random.normal(loc=0.0, scale=1.0, size=n)

ax.scatter(x, y2, color='blue', label="sample(with noise)")
ax.plot(t, y1, color='green', label="-x * sin(x)")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=11)

plt.show()

The blue dot is a sample with noise, and the green solid line shows the assumed function. The noise was added with random numbers, but the seed value of the random numbers was fixed so that the same result would be obtained each time.

This time, we know the objective function, but we have to estimate it from the state where we do not know whether the sample is a polynomial or another function.

About Gaussian kernel

The Gaussian kernel is defined as follows:

k(\boldsymbol{x}, \boldsymbol{x'})=\exp(-\beta||\boldsymbol{x}-\boldsymbol{x'}||^2)

||\boldsymbol{x}-\boldsymbol{x'}||^2Is the distance between the vectors,$\sqrt{\sum_{j=1}^{n}(x_j-x'_j)^2} $It is also expressed as.

This function has the following form, and by moving or changing the size of the center of this function, it returns to the target data string.

ガウスカーネル

Linear regression with Gaussian kernel

Now the desired function

f({\bf x})=\sum_{i=1}^{N} \alpha k({\bf x}^{(i)}, {\bf x}')

And. Let the sample value be $ \ hat {y} $

\hat{y} =
\left(\begin{matrix}
k({\bf x}^{(1)}, {\bf x}) & k({\bf x}^{(2)}, {\bf x}) & \cdots & k({\bf x}^{(N)}, {\bf x})
\end{matrix}\right)
\left(\begin{matrix}
\alpha_0 \\
\alpha_1 \\
\vdots \\
\alpha_N
\end{matrix}\right)

Next,

{\bf y} = 
\left(\begin{matrix}
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(N)}
\end{matrix}\right)
,\;
K=\left(\begin{matrix}
k({\bf x}^{(1)}, {\bf x}^{(1)}) & k({\bf x}^{(2)}, {\bf x}^{(1)}) & \cdots & k({\bf x}^{(N)}, {\bf x}^{(1)}) \\
k({\bf x}^{(1)}, {\bf x}^{(2)}) & k({\bf x}^{(2)}, {\bf x}^{(2)}) & & \vdots \\
\vdots & & \ddots & \vdots \\
k({\bf x}^{(1)}, {\bf x}^{(N)}) & \cdots & \cdots & k({\bf x}^{(N)}, {\bf x}^{(N)})
\end{matrix}\right) 
,\;
{\bf \alpha}
=
\left(\begin{matrix}
\alpha^{(1)} \\
\alpha^{(2)} \\
\vdots \\
\alpha^{(N)}
\end{matrix}\right)

It will be developed like this (I'm sorry for this strange copy). Here, $ K $ is called ** gram matrix **. The square error $ L $ between $ f ({\ bf x}) $ and $ \ hat {y} $ is

L=({\bf y}- K {\bf \alpha})^{\mathrm{T}}({\bf y}- K {\bf \alpha})

If we solve this for $$ alpha , then $ {\ bf \ alpha} = (K ^ {\ mathrm {T}} K) ^ {-1} K ^ {\ mathrm {T}} {\ bf y} = K ^ {-1} {\ bf y} $$. In other words, if the inverse matrix of the Gram matrix can be calculated, it is possible to find $ \ alpha $. If the number of observed samples is too large, it will be difficult to obtain this inverse matrix, so another method must be selected.

Implementation in python

I implemented the KernelRegression class using the above formula as it is.

class KernelRegression: #No regularization
  def __init__(self, beta=1):
    self.alpha = np.array([])
    self.beta = beta

  def _kernel(self, xi, xj):
    return np.exp(- self.beta * np.sum((xi - xj)**2))

  def gram_matrix(self, X):
    N = X.shape[0]
    K = np.zeros((N, N))
    for i in range(N):
      for j in range(N):
        K[i][j] = self._kernel(X[i], X[j])
        K[j][i] = K[i][j]

    return K

  def fit(self, X, Y):
    K = self.gram_matrix(X)
    self.alpha = np.linalg.inv(K)@Y

  def predict(self, X, x):
    Y = 0
    for i in range(len(X)):
      Y += self.alpha[i] * self._kernel(X[i], x)
    return Y

Let's approximate using the first sample.

model = KernelRegression(beta=20)
model.fit(x, y2)

xp = np.linspace(0, 2*np.pi, 100)
yp = np.zeros(len(xp))

for i in range(len(xp)):
  yp[i] = model.predict(x, xp[i])


fig, ax = plt.subplots()

plt.ylim(-3,6)
ax.plot(xp, yp, color='purple', label="estimate curve")
ax.scatter(x, y2, color='blue', label="sample(with noise)")
ax.plot(t, y1, color='green', label="-x*sin(x)")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=11)

plt.show()

It seems that it is passing over the given sample, but it is messy. It's completely different from the green curve.

About overfitting

As shown above, the fact that the learning result is faithful to the teacher data but is far from the reality is called ** overfitting **. Is it similar to a math test that you can solve if you get the same problem as a textbook, but you can't solve it as soon as you arrange it a little?

Overfitted models are also referred to as having poor ** generalization performance **. The prediction accuracy is low. To prevent this, we need the concept of ** regularization **.

Regularization

In the above example, the parameter $ \ bf {\ alpha} $ has become extremely large, resulting in a violently violent curve to force the points of the given sample through. After that, it seems that this often happens when the number of data is extremely small or there are many variables.

To prevent this, when calculating $ \ alpha $, it can be achieved by limiting the range of $ \ alpha $ by adding a ** regularization term ** to the square error of kernel regression. L1 regularization (Lasso) and L2 regularization (Ridge) are famous in machine learning. Mixing the two is called Elastic Net regularization.

L2 regularization

Start with L2 instead of L1. Earlier\alphaThe square error between the objective function and the sample whenLI said,\lambda||{\bf \alpha}||_2^2Added

L=({\bf y}- K {\bf \alpha})^{\mathrm{T}}({\bf y}- K {\bf \alpha})+\lambda||{\bf \alpha}||_2^2

Optimize. The larger $ \ lambda $, the more restricted $ \ alpha $, resulting in a gentler (?) Curve. If this is partially differentiated with respect to $ \ alpha $ and set to 0, and solved for $ \ alpha $,

{\bf \alpha} = (K+\lambda I_N)^{-1}{\bf y}

It will be. This is easy to change because all you have to do is add the identity matrix to the Gram matrix and calculate the inverse matrix.

Implementation in python

I changed the part of the KernelRegression class that asks for $ \ alpha $. The rest is almost the same.

class KernelRegression:　#L2 regularization
  def __init__(self, beta=1, lam=0.5):
    self.alpha = np.array([])
    self.beta = beta
    self.lam = lam

  def _kernel(self, xi, xj):
    return np.exp(- self.beta * np.sum((xi - xj)**2))

  def gram_matrix(self, X):
    N = X.shape[0]
    K = np.zeros((N, N))
    for i in range(N):
      for j in range(N):
        K[i][j] = self._kernel(X[i], X[j])
        K[j][i] = K[i][j]

    return K

  def fit(self, X, Y):
    K = self.gram_matrix(X)
    self.alpha = np.linalg.inv(K + self.lam * np.eye(X.shape[0]))@Y

  def predict(self, X, x):
    Y = 0
    for i in range(len(X)):
      Y += self.alpha[i] * self._kernel(X[i], x)
    return Y

Try to approximate using the same sample as if it was not regularized. I decided the values of $ \ beta $ and $ \ lambda $ appropriately.

model = KernelRegression(beta=0.3, lam=0.1)
model.fit(x, y2)

xp = np.linspace(0, 2*np.pi, 100)
yp = np.zeros(len(xp))

for i in range(len(xp)):
  yp[i] = model.predict(x, xp[i])


fig, ax = plt.subplots()

plt.ylim(-3,6)
ax.plot(xp, yp, color='purple', label="estimate curve")
ax.scatter(x, y2, color='blue', label="sample(with noise)")
ax.plot(t, y1, color='green', label="-x*sin(x)")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=11)

plt.show()

This time it was a perfect match. Actually, the shape changes outside the displayed range, but you can see that a fairly close approximation curve can be drawn in a specific range.

$ \ Beta $ and $ \ lambda $ are called ** hyperparameters **, and you actually need to optimize them by changing them little by little to minimize the value of the loss function.

L1 regularization

In L2 regularization, as a regularization term,$\lambda ||{\bf \alpha}||^2_{2} Was added, but in L2 regularization as a regularization term\lambda||{\bf \alpha}||_1$Add.

Finding a solution is a bit complicated because L1 regularization cannot differentiate the regularization term. It seems that scikit-learn is implemented by a method called Coordinate Descent, but I do not understand it yet, so I would like to use scikit-learn obediently in this article.

L1 regularization can reduce a significant portion of the parameter $ \ alpha $ to 0 (get a sparse solution), which helps to remove unwanted parameters.

Implementation in python

I will try to implement it with scikit-learn obediently.

from sklearn.metrics.pairwise import rbf_kernel
from sklearn.linear_model import Lasso

kx = rbf_kernel(x.reshape(-1,1), x.reshape(-1,1))
KX = rbf_kernel(xp.reshape(-1,1), x.reshape(-1,1))

clf = Lasso(alpha=0.01, max_iter=10000)
clf.fit(kx, y2)

yp = clf.predict(KX)

fig, ax = plt.subplots()
plt.ylim(-3,6)
ax.plot(xp, yp, color='purple', label="estimate curve")
ax.scatter(x, y2, color='blue', label="sample(with noise)")
ax.plot(t, y1, color='green', label="-x*sin(x)")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=11)

plt.show()

print("coef: ", clf.coef_)

coef:  [-0.          0.          0.         -0.         -0.79703436 -0.
 -0.         -0.          0.         -0.         -0.          3.35731837
  0.         -0.         -0.77644607  0.         -0.          0.
 -0.         -0.19590705  0.         -0.          0.         -0.
  0.          0.         -0.         -0.         -0.          0.
 -0.          0.         -0.         -0.          0.          0.58052561
  0.3688375   0.         -0.          1.75380012]

This is also very similar to the original function. You can also see that the parameter (coef) is 0 for a large part.

Elasticnet Both L1 regularization and L2 regularization add a regularization term to the loss function, but they can be considered independently and you can decide which and how much to adopt, which is called ** ElasticNet **. I will. In fact, in the scikit-learn implementation, the ElasticNet parameters were changed and the functions were defined as Lasso and Ridge.

Summary

Any function can be selected as the basis function in linear regression. Since the Gaussian kernel (not limited to) has a high degree of freedom and tends to be overfitted, it is possible to create a regression model with higher generalization performance by adding restrictions by L1 regularization or L2 regularization.

I've summarized the regression series several times, but to help you understand the content of Regression in scikit-learn cheat sheet Isn't it?

Tsukkomi is welcome because I have summarized it appropriately.

[PYTHON] Machine learning algorithm (linear regression summary & regularization)

Introduction

About kernel approximation

About Gaussian kernel

Linear regression with Gaussian kernel

Implementation in python

About overfitting

Regularization

L2 regularization

Implementation in python

L1 regularization

Implementation in python

Summary