Chapter 7 [Error back propagation method] P275 ~ (Middle) [Learn by moving with Python! New machine learning textbook]

Backpropagation method (backpropagation)

This error backpropagation method uses the information of the error (difference from the teacher signal) generated at the output of the network from the output layer weight $ v_ {kj} $ to the intermediate layer weight $ w_ {ji} $. This name is given because the weight is updated in the opposite direction of the input method. However, this error back propagation method is the gradient method itself, and when the gradient method is applied to the feedforward neural network, the error back propagation method is naturally derived.

Since the classification is performed, consider the average cross entropy error of Equation 7-18 for the error function (P266). $ E(w,v)=-\frac{1}{N}\sum_{n=0}^{N-1}\sum_{k=0}^{K-1}t_{nk}log(y_{nk})\hspace{40pt}(7-18) $

・ What I want to do here is to find $ \ partial {E} / \ partial {w_ {ji}} $ (average cross entropy error). ・ When considering the cross entropy $ E_n $ when there is one data, the derivative is $ \ partial {E_n} / Since it is calculated by \ partial {w_ {ji}} $, the favorite $ \ partial {E} / \ partial {w_ {ji}} $ (average cross entropy error) is $ \ partial {E_n} / for the number of data. Find \ partial {w_ {ji}} $ and average it ・ Here, t is represented by t = [0,0,1], which is a teacher signal (a signal indicating what class) </ font>

That is ↓ $ \frac{\partial{E}}{\partial{w_{ji}}}=\frac{\partial}{\partial{w_{ji}}}\frac{1}{N} \sum_{n=0}^{N-1}E_n =\frac{1}{N}\sum_{n=0}^{N-1} \frac{\partial{E_n}}{\partial{w_{ji}}} $

This time, consider the case of D = 2, M = 2, K = 3.

D: Number of input values M: Number of intermediate layers K: Number of outputs w, v: weight

Since there are two weights, w and v, first find the equation obtained by partially differentiating $ E_n $ with $ v_ {kj} $. Next, find the equation obtained by partially differentiating $ E_n $ with $ w_ {ji} $. (There seems to be no particular meaning in the order of ~~. In the error back propagation method, the expression expansion propagates in the reverse order based on the error, so it is performed from the reverse.)

$ \frac{\partial{E}}{\partial{v_{kj}}} $ The $ E_n $ part of is 7-22 $ E_n (w, v) =-\ sum_ {k = 0} ^ {K-1} t_ {nk} log (y_ {nk}) $, so $ \ frac {\ partial {E}} {\ partial {v_ {kj}}} $ This can be obtained by the chain rule.

Chain rule ↓ $ \frac{\partial{E}}{\partial{v_{kj}}}= \frac{\partial{E}}{\partial{a_k}}\frac{\partial{a_k}}{\partial{v_{kj}}} $

** $ a_k $ is the sum of the input value and the dummy variable **

Here, first consider from $ \ frac {\ partial {E}} {\ partial {a_k}} $ on the left side. Since $ E_n $ is omitted in the E part, if you replace it with Equation 7-22, (Consider when k = 0 [Number of data 0?])

\frac{\partial{E}}{\partial{a_0}}= \ \frac{\partial}{\partial{a_0}}(-t_0logy_0-t_1logy_1-t_2logy_2)

Will be. If we use the logarithmic derivative formula here -As explained, t is a teacher signal and y is the output of the total input, so it is related to $ a_0 $. ・ Here, t is represented by t = [0,0,1], which is a teacher signal (a signal indicating what class) </ font>

Logarithmic derivative: $ \begin{align*} (\log x)' = \frac{1}{x} \end{align*} $

Can be expressed as ↓ $ \frac{\partial{E}}{\partial{a_0}}=\ -t_0\frac{1}{y_0}\frac{\partial{y_0}}{\partial{a_0}} -t_1\frac{1}{y_1}\frac{\partial{y_1}}{\partial{a_0}} -t_2\frac{1}{y_2}\frac{\partial{y_2}}{\partial{a_0}} \hspace{40pt}(7-27) $

As explained in Chapter 7 (first half), the softmax function is used in the $ \ partial {y_0} / \ partial {a_0} $ part.

Therefore, if we follow the formula for the partial differential of the softmax function derived in 4-130, Official 4-130: \frac{\partial{y_j}}{\partial{x_i}}=y_j(l_{ij}-y_i) ・ I is the coefficient of the input value and j is the coefficient of the output value. Here, l is 1 when $ i = j $ and 0 when $ i \ neq {j} $.

It looks like Equation 7-28. $ \frac{\partial{y_0}}{\partial{a_0}}=y_0(1-y_0) $ ** This time, including the dummy variable, M = K = 3, so the l part is 1 **

The remaining two have different input value coefficients and output value coefficients, so they are different from each other. $ \frac{\partial{y_1}}{\partial{a_0}}=-y_0y_1 $ $ \frac{\partial{y_2}}{\partial{a_0}}=-y_0y_2 $ Will be.

Substituting 3 for each Therefore, Equation 7-27 becomes (7-31) :.

\begin{align}
\frac{\partial{E}}{\partial{a_0}}&=
-t_0\frac{1}{y_0}\frac{\partial{y_0}}{\partial{a_0}}
-t_1\frac{1}{y_1}\frac{\partial{y_1}}{\partial{a_0}}
-t_2\frac{1}{y_2}\frac{\partial{y_2}}{\partial{a_0}}\\
&=-t_0(1-y_0)+t_1y_0+t_2y_0\\
&=(t_0+t_1+t_2)y_0-t_0\\
&=y_0-t_0
\end{align}

At the end, I used $ t_0 + t_1 + t_2 = 1 $.

Since $ y_0 $ is the output of the neuron of the first node and $ t_0 $ is the teacher signal for it, $ y_0-t_0 $ represents the error.

** Similarly, considering the kth data (k = 1,2), it becomes like equation (7-32). ** **

And the left part of the chain rule of Equation 7-25 can be expressed as 7-33. (Left side of partial differential)

** Consider the $ \ partial {a_k} / \ partial {v_kj} $ part of Equation 7-25. ** ** Considering the case of k = 0, the sum of $ a_0 $ is the sum of the output (z) of the intermediate layer and the weight v from the intermediate layer to the output layer. $ a_0=v_{00}z_0+v_{01}z_1+v_{02}z_{02} $

So, solving $ a_0 $ for each v ($ v_0, v_1, v_2 $) gives 7-37.

So, if you write 7-37 together, it becomes 7-38.

** Similar results can be obtained even when k = 1 and k = 2, so all of them are summarized as Equation 7-39 ** $ \frac{\partial{a_k}}{\partial{v_{kj}}}=z_j\hspace{40pt}(7-39) $

Now that the left and right sides of the partial differential are aligned, the combination is as shown in Equation 7-40. </ font>

What I'm saying in Equation 7-41 is that the value I want now is $ v_ {kj} , which is an appropriate weight value, so the theory for adjusting that value is shown in the figure. ・ Since z wants to be v that eliminates the error between the output y and the teacher signal t with the probability of taking a value from 0 to 1 (because it passes through the sigmoid), the error ( \ delta_k ^ {(2)} ) Change the value of v by that amount.) ・ When the error ( \ delta_k ^ {(2)} $) is 0 = If the output $ y_k $ and the target data (teacher signal) $ t_k $ match, change ($ y_k-t_k = 0 $) Minute $-\ alpha \ delta_k ^ {(2)} z_j $ is 0.

(Important) P281:

If the target data $ t_k $ is 0, but the output $ y_k $ is greater than 0, the error $ \ delta_k ^ {(2)} = (y_k-t_k) $ will be positive. Since $ z_j $ is always positive, as a result, $-\ alpha \ delta_k ^ {(2)} z_j $ is a negative number, and $ v_ {kj} $ is modified to decrease. In other words, it can be interpreted that the output is too large and an error occurs, so the weight is changed in the direction of narrowing down the influence from the neuron $ z_j $. Also, if the input $ z_j $ is large, it means that the contribution from the combination to the output is large, so it can be interpreted that the amount of change in $ v_ {kj} $ is also increased accordingly.

This part is very important and what I'm saying here is -If the output data y is larger than the target data t, the change in v changes to reduce the output y (Is it v = 0.2?) -And if there is no error (y-t = 0), v does not change -Also, if the input value z (which represents the probability) is large, the amount of change in v will be large accordingly.

Find the derivative with w of E.

** Find $ \ partial {E} / \ partial {w_ {ji}} $ **

・ Equation 7-43 is the same as 7-34 ・ Equation 7-44 is the same as 7-39 ・ Therefore, equation 7-45 is the same as 7-41. Can be processed as.

** $ \ delta_ {j} ^ {(1)} $ is set aside for the time being, and we ask for what it is. ** **

First, if the equation 7-43 is partially differentiated by the chain rule: $ \delta_j^{(1)}=\frac{\partial{E}}{\partial{b_j}}= \biggl( \sum_{k=0}^{K-1}\frac{\partial{E}}{\partial{a_k}} \frac{\partial{a_k}}{\partial{z_j}} \biggr) \frac{\partial{z_j}}{\partial{b_j}} $ To understand that, we need to understand the following:

-Here, $ g_0 $ and $ g_1 $ are functions of $ w_0 $ and w_1 $, and f is a function of g_0 $ and g_1 $, so replace them.

E(a_k(z_0,z_1)）

Therefore, (4-62) can be applied. It becomes a function. $ \frac{\partial}{\partial{z_j}}E(a_0(z_0,z_1),a_1(z_0,z_1),a_2(z_0,z_1))=\sum_{k=0}^{K-1}\frac{\partial{E}}{\partial{a_k}}\frac{\partial{a_k}}{\partial{z_j}} $ Will be.

Combining these, it can be expanded as in Equation 7-46.

** ~~ I don't know why Equation 7-47 becomes $ v_ {kj} $ ~~, but as explained on page 283, $ \ delta_ {j} ^ {(1)} $ is Become ** ** ・ Equation 7-47 says that $ z_j $ should be partially differentiated, so other v is a constant, and only v that remains as z = 1 is also a constant. Since it remains, $ v_ {kj} $ remains. ** </ font>

Put them together ↓

\delta_{j}^{(1)}=h'(b_j)\sum_{k=0}^{K-1}v_{kj}\delta_{k}^{(2)}

P284 ~ Summary of error back propagation method

** What we want to do with the error backpropagation method is the optimization of weight parameters such as w and v, which is made possible by propagating in the opposite direction. ** **

-Since the first $ h'(b_j) $ is converted by the sigmoid function, $ b_j $ takes a value from 0 to 1, so this value always takes a positive value. ・ $ \ Frac {\ partial {E}} {\ partial {a_k}} = \delta_{k}^{(2)}= Since it is (y_k-t_k) h'(a_k) $, it is collected by multiplying the output y of the output layer by the weight v. (If the error in the output layer is large, it becomes v * (large value), and v works greatly.)

In (1), first, set w and v appropriately to obtain the output y. (2) Since there is a teacher signal t, compare that t with y. (3) Propagate the error in the opposite direction using the equations derived so far. Specifically, the sum is calculated by multiplying all the errors (y-t) and the weight v (second layer), and then the sum is multiplied by $ h'(b_0) $ through the sigmoid function of the intermediate layer. (That is the formula in the figure of ③) ④ Update w and v so that the error becomes 0. (Since the error is known, set it to 0.)