We compared the effects of L1 regularization and Leaky Relu on error backpropagation. It is explained that L1 regularization is used for dimensional compression that strips away unnecessary explanatory variables in machine learning. On the other hand, LeakyRelu is an activation function that prevents learning from stopping even in multiple layers by giving a slight inclination even when x is negative. I don't even know why I compared these completely unrelated things.
python
x1 = Input(shape=(124,))
x2 = Dense(100)(x1)
x3 = Activation('relu')(x2)
y = Dense(10)(x3)
model.compile(loss='mean_squared_error')
\begin{align}
x_1 &= input\\
x_2 &= Ax_1 + B\\
x_3 &= relu(x_2)\\
y &= Cx_3+D\\
L_{mse} &= \frac{1}{2}(t-y)^2\\
\end{align}
It is assumed that the loss function, the activation function, and the total connection calculation are performed as described above. In this case, each partial differential is
\begin{align}
\frac{\partial L_{mse}}{\partial y} &= -(t - y)\\
\frac{\partial y}{\partial C} &= x_3\\
\frac{\partial y}{\partial D} &= 1\\
\frac{\partial y}{\partial x_3} &= C\\
\frac{\partial x_3}{\partial x_2} &= H(x_2)\\
H(x)&=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.\\
\frac{\partial x_2}{\partial A} &= x_1\\
\frac{\partial x_2}{\partial B} &= 1\\
\end{align}
From the model weights $ A, B, C, D $ are updated using the chain rule
\begin{align}
\frac{\partial L_{mse}}{\partial A} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial x_3}\frac{\partial x_3}{\partial x_2}\frac{\partial x_2}{\partial A}\\
&=-(t - y)\cdot C \cdot H(x_2) \cdot x_1\\
\frac{\partial L_{mse}}{\partial B} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial x_3}\frac{\partial x_3}{\partial x_2}\frac{\partial x_2}{\partial B}\\
&=-(t - y)\cdot C \cdot H(x_2) \\
\frac{\partial L_{mse}}{\partial C} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial C}\\
&=-(t - y)\cdot x_3 \\
\frac{\partial L_{mse}}{\partial D} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial D}\\
&=-(t - y)\\
\end{align}
Can be shown as.
In Keras, regularization has the following three. ・ Kernel_regularizer ・ Bias_regularizer ・ Activity_regularizer https://keras.io/ja/regularizers/
In terms of the coefficients of the first fully connected layer $ x_2 = Ax_1 + B $ in the previous section, kernel_regularizer has a weight of $ A $, bias_regularizer has a weight of $ B $, and activity_regularizer has a regularization of output $ x_2 $. I will do it.
In this case, L1 regularization adds the absolute value of the variable multiplied by an additional small factor $ \ lambda $ to the loss function.
Each
Here is the activity of L1 regularization_regularizer、
L_{L1} = \lambda |x_2| \\
\frac{\partial |x|}{\partial x} = \left\{
\begin{array}{ll}
1 & (x \geq 0) \\
-1 & (x \lt 0)
\end{array}
\right.\\
H(x)=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.\\
\frac{\partial L_{L1}}{\partial x_2} = \lambda(2H(x_2)-1)\\
Therefore, the change in the update amount of the model weights $ A, B, C, D $ due to the activity_regularizer of L1 regularization is
\begin{align}
\frac{\partial L_{L1}}{\partial A} &= \frac{\partial L_{L1}}{\partial x_2}\frac{\partial x_2}{\partial A}\\
&=\lambda(2H(x_2)-1) \cdot x_1\\
\frac{\partial L_{L1}}{\partial B} &= \frac{\partial L_{L1}}{\partial x_2}\frac{\partial x_2}{\partial B}\\
&=\lambda(2H(x_2)-1) \\
\frac{\partial L_{L1}}{\partial C} &= 0\\
\frac{\partial L_{L1}}{\partial D} &= 0\\
\end{align}
The explanation that L1 regularization is used for dimensional compression is activity._kernel instead of regularizer_of regularizer
python
x1 = Input(shape=(124,))
x2 = Dense(100, activity_regularizer=regularizers.l1(0.01))(x1)
x3 = Activation('relu')(x2)
y = Dense(10)(x3)
model.compile(loss='mean_squared_error')
LeakyRelu is the activation function as follows. Generally $ \ alpha ≪1 $
LeakyRelu(x)=\left\{
\begin{array}{ll}
x & (x \geq 0) \\
\alpha x & (x \lt 0)
\end{array}
\right.\\
\frac{\partial LeakyRelu(x)}{\partial x}
=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
\alpha & (x \lt 0)
\end{array}
\right.\\
From
H(x)=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.\\
Leaky Relu and regular Relu gradient difference using
\begin{align}
\frac{\partial LeakyRelu(x)}{\partial x} &=(1-\alpha )H(x) + \alpha \\
\frac{\partial LeakyRelu(x)}{\partial x}-\frac{\partial Relu(x)}{\partial x}
&=((1-\alpha )H(x) + \alpha )-H(x)\\
&=-\alpha (H(x) -1)\\
\end{align}
Can be written. Therefore, the change in the update amount of the model weights $ A, B, C, D $ due to Relu => LeakyRelu is
\begin{align}
\frac{\partial L_{mse}}{\partial A} &=(t - y)\cdot C \cdot \alpha (H(x_2)-1) \cdot x_1\\
\frac{\partial L_{mse}}{\partial B} &=(t - y)\cdot C \cdot \alpha (H(x_2)-1) \\
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}
Can be shown as. Also, if you dare to transform this, it will be as follows. Here, the two items are just the normal error backpropagation multiplied by $ \ alpha $, so they will be ignored from now on.
\begin{align}
\frac{\partial L_{mse}}{\partial A} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1) \cdot x_1-[\alpha (t - y)\cdot C \cdot H(x_2) \cdot x_1]\\
\frac{\partial L_{mse}}{\partial B} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1)-[\alpha (t - y)\cdot C \cdot H(x_2)] \\
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}
python
x1 = Input(shape=(124,))
x2 = Dense(100)(x1)
x3 = LeakyReLU(alpha=0.01)(x2)
y = Dense(10)(x3)
model.compile(loss='mean_squared_error')
Let's compare the changes from normal backpropagation. Here, $ \ lambda << 1, \ alpha << 1 $ are both well less than 1.
--Changes due to activity_regularizer of L1 regularization
\begin{align}
\frac{\partial L_{L1}}{\partial A} &=\lambda(2H(x_2)-1) \cdot x_1\\
\frac{\partial L_{L1}}{\partial B} &=\lambda(2H(x_2)-1) \\
\frac{\partial L_{L1}}{\partial C} &= 0\\
\frac{\partial L_{L1}}{\partial D} &= 0\\
\end{align}
--Relu => Leaky Relu changes
\begin{align}
\frac{\partial L_{mse}}{\partial A} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1) \cdot x_1\\
\frac{\partial L_{mse}}{\partial B} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1) \\
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}
Here, the change due to Relu => LeakyRelu can be regarded as close to the change due to L1 regularization of activity_regularizer in the early stage of learning where $ (t --y) \ cdot C $ cannot be regarded as zero. On the other hand, if the learning progresses sufficiently and $ (t --y) \ cdot C $ approaches zero, can we consider that the activity_regularizer of L1 regularization weakens? I thought about it. However, the change in activity_regularizer of L1 regularization is only related to the layer to which the regularization is applied, but the change by Relu => LeakyRelu is also transmitted to the layer before the activation function.
Also, if there are only two items that change due to Relu => LeakyRelu, it will be the amount obtained by multiplying the normal error back propagation by $ \ alpha $.
--Relu => Leaky Relu changes in two items
\begin{align}
\frac{\partial L_{mse}}{\partial A} &=-\alpha (t - y)\cdot C \cdot H(x_2) \cdot x_1\\
\frac{\partial L_{mse}}{\partial B} &=-\alpha (t - y)\cdot C \cdot H(x_2) \\
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}
Combined with the original gradient and multiplied by LeakyRelu's activation function, all previous gradients are $ (1 + \ alpha) $ times larger. It is suggested that if the model is deep, replacing all Relu functions with LeakyRelu functions can logarithmically increase the gradient of the shallow layer closer to the input by $ (1 + \ alpha) ^ n $.
We compared L1 regularization and Leaky Relu. I thought that the change due to Relu => LeakyRelu is the change due to L1 regularization of activity_regularizer and the gradient before the activation function is multiplied by $ (1 + \ alpha) $. This interpretation assumes that LeakyRel is the sum of the following functions.
\begin{align}
LeakyRelu(x)&=\left\{
\begin{array}{ll}
x & (x \geq 0) \\
\alpha x & (x \lt 0)
\end{array}
\right.\\
&=\left\{
\begin{array}{ll}
(1+\alpha )x & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right. - \alpha|x|\\
&=(1+\alpha )Relu(x)- \alpha|x|
\end{align}
Recommended Posts