We compared the effects of L1 regularization and Leaky Relu on error backpropagation. It is explained that L1 regularization is used for dimensional compression that strips away unnecessary explanatory variables in machine learning. On the other hand, LeakyRelu is an activation function that prevents learning from stopping even in multiple layers by giving a slight inclination even when x is negative. I don't even know why I compared these completely unrelated things.

Error backpropagation of normal full coupling

`python`


x1 = Input(shape=(124,))
x2 = Dense(100)(x1)
x3 = Activation('relu')(x2)
y = Dense(10)(x3)

model.compile(loss='mean_squared_error')

\begin{align}
x_1 &= input\\
x_2 &= Ax_1 + B\\
x_3 &= relu(x_2)\\
y &= Cx_3+D\\
L_{mse} &= \frac{1}{2}(t-y)^2\\
\end{align}

It is assumed that the loss function, the activation function, and the total connection calculation are performed as described above. In this case, each partial differential is

\begin{align}
\frac{\partial L_{mse}}{\partial y} &= -(t - y)\\
\frac{\partial y}{\partial C} &= x_3\\
\frac{\partial y}{\partial D} &= 1\\
\frac{\partial y}{\partial x_3} &= C\\
\frac{\partial x_3}{\partial x_2} &= H(x_2)\\
 H(x)&=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.\\
\frac{\partial x_2}{\partial A} &= x_1\\
\frac{\partial x_2}{\partial B} &= 1\\
\end{align}

From the model weights $ A, B, C, D $ are updated using the chain rule

\begin{align}
\frac{\partial L_{mse}}{\partial A} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial x_3}\frac{\partial x_3}{\partial x_2}\frac{\partial x_2}{\partial A}\\
&=-(t - y)\cdot C \cdot H(x_2) \cdot x_1\\ 
\frac{\partial L_{mse}}{\partial B} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial x_3}\frac{\partial x_3}{\partial x_2}\frac{\partial x_2}{\partial B}\\
&=-(t - y)\cdot C \cdot H(x_2) \\ 
\frac{\partial L_{mse}}{\partial C} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial C}\\
&=-(t - y)\cdot x_3 \\
\frac{\partial L_{mse}}{\partial D} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial D}\\
&=-(t - y)\\
\end{align}

Can be shown as.

Error backpropagation of fully connected layers of L1 regularization

In Keras, regularization has the following three. ・ Kernel_regularizer ・ Bias_regularizer ・ Activity_regularizer https://keras.io/ja/regularizers/

In terms of the coefficients of the first fully connected layer $ x_2 = Ax_1 + B $ in the previous section, kernel_regularizer has a weight of $ A $, bias_regularizer has a weight of $ B $, and activity_regularizer has a regularization of output $ x_2 $. I will do it. In this case, L1 regularization adds the absolute value of the variable multiplied by an additional small factor $ \ lambda $ to the loss function. Each\lambda |A|、\lambda |B|、\lambda |x_2|。 On the other hand, L2 regularization adds the square of the variable multiplied by a small factor $ \ lambda $ to the loss function. $ \ Lamb A ^ 2 $, $ \ lambda B ^ 2 $, $ \ lambda x_2 ^ 2 $, respectively.

Here is the activity of L1 regularization_regularizer、\lambda |x_2|Think about. Considering the additional loss function as follows|x|Is the derivative ofx>0so1、x<0so-1So

L_{L1} = \lambda |x_2| \\
\frac{\partial |x|}{\partial x} = \left\{
\begin{array}{ll}
1 & (x \geq 0) \\
-1 & (x \lt 0)
\end{array}
\right.\\
 H(x)=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.\\
\frac{\partial L_{L1}}{\partial x_2} = \lambda(2H(x_2)-1)\\

Therefore, the change in the update amount of the model weights $ A, B, C, D $ due to the activity_regularizer of L1 regularization is

\begin{align}
\frac{\partial L_{L1}}{\partial A} &= \frac{\partial L_{L1}}{\partial x_2}\frac{\partial x_2}{\partial A}\\
&=\lambda(2H(x_2)-1) \cdot x_1\\ 
\frac{\partial L_{L1}}{\partial B} &= \frac{\partial L_{L1}}{\partial x_2}\frac{\partial x_2}{\partial B}\\
&=\lambda(2H(x_2)-1) \\ 
\frac{\partial L_{L1}}{\partial C} &= 0\\
\frac{\partial L_{L1}}{\partial D} &= 0\\
\end{align}

The explanation that L1 regularization is used for dimensional compression is activity._kernel instead of regularizer_of regularizerL_{L1} = \lambda |A|This is an explanation when considering the L1 regularization loss term of.

`python`


x1 = Input(shape=(124,))
x2 = Dense(100, activity_regularizer=regularizers.l1(0.01))(x1)
x3 = Activation('relu')(x2)
y = Dense(10)(x3)

model.compile(loss='mean_squared_error')

Error backpropagation of fully connected layers of Leaky Relu

LeakyRelu is the activation function as follows. Generally $ \ alpha ≪1 $

 LeakyRelu(x)=\left\{
\begin{array}{ll}
x & (x \geq 0) \\
\alpha x & (x \lt 0)
\end{array}
\right.\\
\frac{\partial LeakyRelu(x)}{\partial x}
 =\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
\alpha  & (x \lt 0)
\end{array}
\right.\\

From

 H(x)=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.\\

Leaky Relu and regular Relu gradient difference using

\begin{align}
\frac{\partial LeakyRelu(x)}{\partial x} &=(1-\alpha )H(x) + \alpha \\
\frac{\partial LeakyRelu(x)}{\partial x}-\frac{\partial Relu(x)}{\partial x} 
 &=((1-\alpha )H(x) + \alpha )-H(x)\\
&=-\alpha (H(x) -1)\\
\end{align}

Can be written. Therefore, the change in the update amount of the model weights $ A, B, C, D $ due to Relu => LeakyRelu is

\begin{align}
\frac{\partial L_{mse}}{\partial A} &=(t - y)\cdot C \cdot \alpha (H(x_2)-1) \cdot x_1\\ 
\frac{\partial L_{mse}}{\partial B} &=(t - y)\cdot C \cdot \alpha (H(x_2)-1) \\ 
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}

Can be shown as. Also, if you dare to transform this, it will be as follows. Here, the two items are just the normal error backpropagation multiplied by $ \ alpha $, so they will be ignored from now on.

\begin{align}
\frac{\partial L_{mse}}{\partial A} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1) \cdot x_1-[\alpha (t - y)\cdot C \cdot H(x_2) \cdot x_1]\\ 
\frac{\partial L_{mse}}{\partial B} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1)-[\alpha (t - y)\cdot C \cdot H(x_2)] \\ 
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}

`python`


x1 = Input(shape=(124,))
x2 = Dense(100)(x1)
x3 = LeakyReLU(alpha=0.01)(x2)
y = Dense(10)(x3)

model.compile(loss='mean_squared_error')

Comparison

Let's compare the changes from normal backpropagation. Here, $ \ lambda << 1, \ alpha << 1 $ are both well less than 1.

--Changes due to activity_regularizer of L1 regularization

\begin{align}
\frac{\partial L_{L1}}{\partial A} &=\lambda(2H(x_2)-1) \cdot x_1\\ 
\frac{\partial L_{L1}}{\partial B} &=\lambda(2H(x_2)-1) \\ 
\frac{\partial L_{L1}}{\partial C} &= 0\\
\frac{\partial L_{L1}}{\partial D} &= 0\\
\end{align}

--Relu => Leaky Relu changes

\begin{align}
\frac{\partial L_{mse}}{\partial A} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1) \cdot x_1\\ 
\frac{\partial L_{mse}}{\partial B} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1) \\ 
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}

Here, the change due to Relu => LeakyRelu can be regarded as close to the change due to L1 regularization of activity_regularizer in the early stage of learning where $ (t --y) \ cdot C $ cannot be regarded as zero. On the other hand, if the learning progresses sufficiently and $ (t --y) \ cdot C $ approaches zero, can we consider that the activity_regularizer of L1 regularization weakens? I thought about it. However, the change in activity_regularizer of L1 regularization is only related to the layer to which the regularization is applied, but the change by Relu => LeakyRelu is also transmitted to the layer before the activation function.

Also, if there are only two items that change due to Relu => LeakyRelu, it will be the amount obtained by multiplying the normal error back propagation by $ \ alpha $.

--Relu => Leaky Relu changes in two items

\begin{align}
\frac{\partial L_{mse}}{\partial A} &=-\alpha (t - y)\cdot C \cdot H(x_2) \cdot x_1\\ 
\frac{\partial L_{mse}}{\partial B} &=-\alpha (t - y)\cdot C \cdot H(x_2) \\ 
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}

Combined with the original gradient and multiplied by LeakyRelu's activation function, all previous gradients are $ (1 + \ alpha) $ times larger. It is suggested that if the model is deep, replacing all Relu functions with LeakyRelu functions can logarithmically increase the gradient of the shallow layer closer to the input by $ (1 + \ alpha) ^ n $.

Summary:

We compared L1 regularization and Leaky Relu. I thought that the change due to Relu => LeakyRelu is the change due to L1 regularization of activity_regularizer and the gradient before the activation function is multiplied by $ (1 + \ alpha) $. This interpretation assumes that LeakyRel is the sum of the following functions.

\begin{align}
 LeakyRelu(x)&=\left\{
\begin{array}{ll}
x & (x \geq 0) \\
\alpha x & (x \lt 0)
\end{array}
\right.\\
&=\left\{
\begin{array}{ll}
(1+\alpha )x & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.  - \alpha|x|\\
&=(1+\alpha )Relu(x)- \alpha|x|
\end{align}

[PYTHON] Comparison of L1 regularization and Leaky Relu

Error backpropagation of normal full coupling

python

Error backpropagation of fully connected layers of L1 regularization

python

Error backpropagation of fully connected layers of Leaky Relu

python

Comparison

Summary:

`python`

`python`

`python`