Understand linear regression, Ridge regression, LASSO, and Elastic-Net with mathematical formulas and implement them with scikit-learn.
It is assumed that you have already learned calculus, linear algebra, and statistics.
Regression is the prediction of the value of one or more target variables $ y $ from an input value consisting of one or more explanatory variables $ x $.
First, consider a simple regression with an explanatory variable $ x $ and an objective variable $ y $. Assuming that the parameters are $ a and b $, the estimated value $ \ hat {y} $ can be expressed by a linear function as follows.
\hat{y} = a x + b
In the regression problem, we adjust $ a and b $ to predict $ y $, so we aim for the difference between $ \ hat {y} $ and $ y $ to be zero.
The first thing that comes to mind is the difference
Therefore, the formula to be solved can be expressed as follows. Where $ n $ represents the total number of data.
L = \sum^n_{i=1} \{y_i - (ax_i + b) \}^2
Since we want to find the minimum value of $ L $, we differentiate it by $ a and b $.
\begin{align}
\frac{\partial L}{\partial a} &= \frac{\partial}{\partial a} \sum^n_{i=1} (y_i - ax_i - b)^2 \\
&= \sum^n_{i=1} \{ 2 \cdot (-x_i) \cdot (y_i - ax_i - b) \} \\
&= 2 \left\{ a \sum^n_{i=1} x^2_i + b \sum^n_{i=1} x_i - \sum^n_{i=1} x_iy_i \right\}\\
\frac{\partial L}{\partial b} &= \frac{\partial}{\partial b} \sum^n_{i=1} (y_i - ax_i - b)^2 \\
&= \sum^n_{i=1} \{ 2 \cdot (-1) \cdot (y_i - ax_i - b) \} \\
&= 2 \left\{ a \sum^n_{i=1} x_i + nb - \sum^n_{i=1} y_i \right\} \\
\end{align}
Solving the derivative for $ a and b $ as 0 gives the following simultaneous equations.
a \sum^n_{i=1} x^2_i + b \sum^n_{i=1} x_i - \sum^n_{i=1} x_iy_i = 0 \\
a \sum^n_{i=1} x_i + nb - \sum^n_{i=1} y_i = 0
I will solve this. First, if you transform the formula below for $ b $,
b = \frac{1}{n} \sum^n_{i=1}y_i - a \frac{1}{n} \sum^n_{i=1} x_i
here,
\frac{1}{n} \sum^n_{i=1}y_i = \bar{y}, \frac{1}{n} \sum^n_{i=1} x_i = \bar{x}
If you say
b = \bar{y} - a \bar{x}
Then divide the other expression by $ n $ and then substitute
a = \frac{ \frac{1}{n} \sum^n_{i=1} x_iy_i - \bar{x} \bar{y} }{ \frac{1}{n} \sum^n_{i=1} x^2_i - \bar{x}^2 }
Here, the denominator and numerator can be transformed as follows.
\begin{align}
\frac{1}{n} \sum^n_{i=1} x^2_i - \bar{x}^2 &= \frac{1}{n} \sum^n_{i=1} x^2_i - 2\bar{x}^2 + \bar{x}^2 \\
&= \frac{1}{n} \sum^n_{i=1} x^2_i - 2 \bar{x} \frac{1}{n} \sum^n_{i=1} \bar{x} + \bar{x}^2 \\
&= \frac{1}{n} \left\{ \sum^n_{i=1} x^2_i - 2 \bar{x} \sum^n_{i=1} x_i + n \bar{x}^2 \right\} \\
&= \frac{1}{n} \sum^n_{i=1} \left( x^2_i - 2 \bar{x}x_i + \bar{x}^2 \right) \\
&= \frac{1}{n} \sum^n_{i=1} \left( x_i - \bar{x} \right)^2 \\
\frac{1}{n} \sum^n_{i=1} x_iy_i - \bar{x}\bar{y} &= \frac{1}{n} \sum^n_{i=1} x_iy_i - 2\bar{x}\bar{y} + \bar{x}\bar{y} \\
&= \frac{1}{n} \left( \sum^n_{i=1}x_iy_i - 2n \bar{x}\bar{y} + n \bar{x}\bar{y} \right) \\
&= \frac{1}{n} \left( \sum^n_{i=1} x_iy_i - n \frac{1}{n} \sum^n_{i=1}x_i \bar{y} - n \frac{1}{n} \sum^n_{i=1} y_i \bar{x} - n\bar{x}\bar{y} \right) \\
&= \frac{1}{n} \sum^n_{i=1} \left( x_iy_i - \bar{y}x_i - \bar{x}y_i + \bar{x}\bar{y} \right) \\
&= \frac{1}{n} \sum^n_{i=1} \left\{ (x_i - \bar{x} )( y_i - \bar{y} ) \right\} \\
\end{align}
Therefore,
a = \frac{\frac{1}{n} \sum^n_{i=1} \left\{ (x_i - \bar{x} )( y_i - \bar{y} ) \right\}}{\frac{1}{n} \sum^n_{i=1} \left( x_i - \bar{x} \right)^2} = \frac{Cov[x, y]}{\mathbb{V}[x]}
Where $ Cov [x, y] $ represents the covariance and $ \ mathbb {V} [x] $ represents the variance. Therefore, it was found that the simple regression can be obtained by the following formula.
y = \frac{Cov[x, y]}{\mathbb{V}[x]} x + \left( \bar{y} - \frac{Cov[x, y]}{\mathbb{V}[x]} \bar{x} \right)
Next, consider multiple regression with two or more explanatory variables. If the explanatory variable $ x = (x_1, x_2, ..., x_m) $, the objective variable $ y $, and the parameter is $ w = (w_0, w_1, ..., w_m) $, the estimated value is $ \ hat { y} $ can be expressed as follows.
\hat{y}(w, x) = w_0 + w_1x_1 + \cdots + w_mx_m
$ w_0 $ represents the intercept $ b $ in simple regression, and by introducing the pseudo-explanatory variable $ x_0 = 1 $ corresponding to $ w_0 $, the above equation can be expressed as a vector.
\hat{y}(w, x) = w_0x_0 + w_1x_1 + \cdots + w_mx_m = {\bf x}^T {\bf w} \\
{\bf x}^T = (1, x_1, ..., x_m),
{\bf w} =
\begin{pmatrix}
w_0 \\
w_1 \\
\vdots \\
w_m
\end{pmatrix}
Also, since $ x $ exists for the number of data $ n $, we will use the matrix $ \ bf {X} $ with the row direction as the number of data and the column direction as the explanatory variables.
\hat{y} = {\bf X} {\bf w} \\
{\bf X} =
\begin{pmatrix}
1 & x_{11} & \cdots & x_{1m} \\
\vdots & & \ddots & \vdots \\
1 & x_{n1} & \cdots & x_{nm}
\end{pmatrix}
Find the minimum value for the parameter $ w $ using the least squares method as in simple regression. Bold notation is stopped for the sake of simplicity.
\min_w || Xw - y ||^2
When this is transformed,
\begin{align}
L &= || Xw - y ||^2 \\
&= (Xw - y)^T(Xw - y) \\
&= (w^TX^T - y^T)(Xw - y) \\
&= w^TX^TXw - w^TX^Ty - y^TXw + y^Ty \\
&= w^TX^TXw - 2w^TX^Ty + y^Ty
\end{align}
So, if you differentiate for $ w $ and find 0,
\frac{\partial L}{\partial w} = 2X^TXw - 2X^Ty = 0 \\
w = (X^TX)^{-1}X^Ty
Therefore, we found that the parameter $ w $ can be obtained from the data. However, if $ X ^ TX $ does not have an inverse matrix, no solution can be determined.
Ridge regression is a linear regression plus an L2 term for the parameter $ w $. The formula to be solved is as follows.
\min_w ||Xw - y||^2 + \alpha||w||^2_2
If this is differentiated with respect to $ w $ and solved as 0,
\frac{\partial L}{\partial w} = 2X^TXw - 2X^Ty + 2 \alpha w = 0 \\
w = (X^TX + \alpha I)^{-1} X^T y
Where $ I $ represents the identity matrix.
In linear regression, we couldn't find a solution if $ X ^ TX $ didn't have an inverse matrix, but in Ridge regression we can guarantee that it has an inverse matrix by adding the L2 term.
Having an inverse matrix is called regularization, and it is called regularization because it always has an inverse matrix by adding the L2 term.
Also, adding the L2 term for the parameter $ w $ is known as weight attenuation. To see the effect, let's calculate a concrete example here. Since the L2 term works only on the part of $ (X ^ TX + \ alpha I) ^ {-1} $, consider the case where two data are obtained with the following explanatory variables. The first column corresponds to the intercept, so it is a pseudo-explanatory variable.
X =
\begin{pmatrix}
1 & 2 & 3 \\
1 & 4 & 5 \\
1 & 3 & 2 \\
\end{pmatrix}
At this time, $ X ^ TX $ is obtained as follows, and the inverse matrix is as follows.
X^TX =
\begin{pmatrix}
3 & 9 & 10 \\
9 & 29 & 32 \\
10 & 32 & 38 \\
\end{pmatrix}
,
(X^TX)^{-1} =
\begin{pmatrix}
4.875 & -1.375 & -0.125 \\
-1.375 & 0.875 & -0.375 \\
-0.125 & -0.375 & 0.375
\end{pmatrix}
Here, when the coefficient of the L2 term is $ \ alpha = 1 $ (the first row and first column are pseudo-explanatory variables, so 0).
\alpha I =
\begin{pmatrix}
0 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
\end{pmatrix}
To $ X ^ TX $
X^TX + \alpha I =
\begin{pmatrix}
3 & 9 & 10 \\
9 & 30 & 32 \\
10 & 32 & 39 \\
\end{pmatrix}
,
(X^TX + \alpha I)^{-1} =
\begin{pmatrix}
3.744 & -0.795 & -0.308 \\
-0.795 & 0.436 & -0.154 \\
-0.308 & -0.154 & 0.231
\end{pmatrix}
You can see that the value is certainly smaller and has the effect of attenuating the weight.
LASSO The addition of the L1 term for the parameter $ w $ to the linear regression is called LASSO (Least Absolute Shrinkage and Selection Operator) [1]. The formula to be solved is as follows.
\min_w \frac{1}{2}|| Xw - y ||^2 + \alpha ||w||_1
Since the above formula contains absolute values, it is necessary to separate the cases. So first, let's think about $ w_0 $.
Considering that regularization does not work for intercept $ w_0 $,
L_0 = \frac{1}{2} \left\{ \sum^n_{i=1} \left( w_0 + \sum^m_{j=1} x_{ij} w_j - y_i \right) \right\}^2
As before, if you differentiate $ L_0 $ with respect to $ w_0 $ and solve it as 0,
\frac{\partial L_0}{\partial w_0} = \sum^n_{i=1} \left( w_0 + \sum^m_{j=1} x_{ij} w_j - y_i \right) = 0 \\
w_0 = \frac{1}{n} \sum^n_{i=1} \left( y_i - \sum^m_{j=1} x_{ij} w_j \right)
Next, consider $ w_1, ..., w_m $. Since it contains absolute values, it is divided into right and left derivatives with respect to $ w_k $.
\frac{\partial L}{\partial w^+_k} = \sum^n_{i=1} \left( w_0 + \sum^m_{j=1} x_{ij} w_j - y_i \right)x_{ik} + \alpha \\
\frac{\partial L}{\partial w^-_k} = \sum^n_{i=1} \left( w_0 + \sum^m_{j=1} x_{ij} w_j - y_i \right)x_{ik} - \alpha
If you solve it as 0 as before,
w_k = \left\{
\begin{array}{ll}
\frac{\sum^n_{i=1} \left( y_i - w_0 - \sum^m_{j=1, (j \neq k)} x_{ij} w_j \right)x_{i} - \alpha}{\sum^n_{i=1} x^2_{ik}} & (w_k > 0) \\
\frac{\sum^n_{i=1} \left( y_i - w_0 - \sum^m_{j=1, (j \neq k)} x_{ij} w_j \right)x_{i} + \alpha}{\sum^n_{i=1} x^2_{ik}} & (w_k < 0)
\end{array}
\right.
It will be. However, $ w_k $ in the above formula has a condition, and if the condition is not met, $ w_k $ is set to 0.
w_k = \left\{
\begin{array}{ll}
\frac{\sum^n_{i=1} \left( y_i - w_0 - \sum^m_{j=1, (j \neq k)} x_{ij} w_j \right)x_{ik} - \alpha}{\sum^n_{i=1} x^2_{ik}} & \left(\sum^n_{i=1} \left( y_i - w_0 - \sum^m_{j=1, (j \neq k)} x_{ij} w_j \right)x_{ik} > \alpha \right) \\
0 & \left( -\alpha \leq \sum^n_{i=1} \left( y_i - w_0 - \sum^m_{j=1, (j \neq k)} x_{ij} w_j \right)x_{ik} \leq \alpha \right) \\
\frac{\sum^n_{i=1} \left( y_i - w_0 - \sum^m_{j=1, (j \neq k)} x_{ij} w_j \right)x_{ik} + \alpha}{\sum^n_{i=1} x^2_{ik}} & \left( \sum^n_{i=1} \left( y_i - w_0 - \sum^m_{j=1, (j \neq k)} x_{ij} w_j \right)x_{ik} < - \alpha \right)
\end{array}
\right.
You can also define a soft threshold function as follows:
S(z, \alpha) = \left\{
\begin{array}{ll}
z - \alpha & (z > \alpha) \\
0 & (-\alpha \leq z \leq \alpha) \\
z + \alpha & (z < -\alpha)
\end{array}
\right. \\
z = \sum^n_{i=1} \left( y_i - w_0 - \sum^m_{j=1, (j \neq k)} x_{ij} w_j \right)x_{ik} \\
LASSO in scikit-learn is implemented by the coordinate descent method. In the coordinate descent method, the approximate solution is obtained by repeating the process of substituting 0 for the explanatory variable and applying the soft threshold function to all the explanatory variables one by one.
LASSO obtains sparse representations and feature selection by setting many of the parameters to 0, but if the number of data $ n $ is less than the explanatory variable $ m $ ($ n <m $), up to the number of data. It should be noted that only features can be selected.
Also, if there is a high correlation between the explanatory variables, only one of them will be selected, so it is a good idea to make the explanatory variables uncorrelated and standardize the input features to an average of 0 and a variance of 1.
Elastic-Net The linear regression plus the L1 and L2 terms for the parameters is called Elastic-Net [2]. The formula to be solved is as follows.
\min_w \frac{1}{2}||Xw - y||^2 + \alpha \rho ||w||_1 + \frac{\alpha(1 - \rho)}{2} ||w||^2_2
Elastic-Net improves on the shortcomings of LASSO by adding an L2 term as well as an L1 term.
Elastic-Net in scikit-learn is implemented by the coordinate descent method like LASSO.
The figure below shows the orange ellipse with two explanatory variables.
In the L1 constraint, $ w_1 $ is 0, and many of the parameters are 0 in this way to get a sparse solution.
Since scikit-learn provides a diabates dataset as a data set for regression problems, we will use it.
The number of data | Number of features | Target value |
---|---|---|
442 | 10 | [25, 346] |
Quantitative evaluations of the regression model include Mean Squared Error and R2 Score.
Mean Squared Error Mean Squared Error (MSE) finds the mean square error between $ \ hat {y} $ and test data $ y $ predicted using parameters estimated from training data.
MSE = \frac{1}{n} \sum^n_{i=1} (y_i - \hat{y}_i)^2
The smaller the MSE, the more accurate the model.
R2 Score R2 Score (Coefficient of Determination) indicates how well the explanatory variables explain the objective variable. Assuming that the value predicted using the parameters estimated from the training data is $ \ hat {y} $ and the value of the test data is $ y $, the R2 Score can be calculated from the following formula.
R^2 = 1 - \frac{SS_{residual}}{SS_{total}} \\
SS_{residual} = \sum^n_{i=1} (y_i - \hat{y}_i)^2,
SS_{total} = \sum^n_{i=1} (y_i - \bar{y})^2
The R2 Score has a value of [0, 1], and the closer it is to 1, the more accurate the model is.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Scikit-learn 0.23.0
The implemented program is published on GitHub.
linear_regression.py
The execution result is as follows.
Linear Regression
Mean Squared Error: 3424.32
R2 score: 0.33
Ridge Regression
Mean Squared Error: 3427.67
R2 score: 0.33
LASSO
Mean Squared Error: 3434.59
R2 score: 0.33
Elastic-Net
Mean Squared Error: 3295.57
R2 score: 0.36
Among them, Elastic-Net seems to be more accurate than other models.
1.1.1 Ordinary Least Squares 1.1.2 Ridge Regression and Classification 1.1.3 Lasso 1.1.5 Elastic-Net