This is a continuation of [Previous article](# https://qiita.com/kuroitu/items/d22c8750e34d5d75fb6c). In this article, we will briefly explain backpropagation using computational graphs. The chain rule is briefly mentioned in [here](# https://qiita.com/kuroitu/items/221e8c477ffdd0774b6b#chain rule), so I will omit it. Backpropagation tends to have many mathematical formulas, but I would like to explain it with implementation so that it can be understood as intuitively as possible.
-[Backpropagation with scalar](#backpropagation with scalar)
-[Backpropagation theory in scalar](#Backpropagation theory in scalar)
-[Backpropagation implementation in scalar](#backpropagation implementation in scalar)
-[Backpropagation in matrix](#Backpropagation in matrix)
-[Backpropagation theory in matrix](#Backpropagation theory in matrix)
-[Backpropagation implementation in matrix](#Backpropagation implementation in matrix)
-[Backpropagation theory batch consideration version in matrix](#Backpropagation theory batch consideration version in matrix)
-[Implementation of __init__ method](Implementation of #init method)
-Conclusion
As in the example, let's start with backpropagation with scalars. However, backpropagation is quite difficult to follow theoretically even with scalars. However, I will explain it very briefly.
First, let's briefly consider the following calculation graph.
 If you write the forward propagation in the figure carefully
If you write the forward propagation in the figure carefully
\begin{align}
  m &= wx \\
  n &= m + b \\
  y &= \sigma(n)
\end{align}
It looks like this. Consider this backpropagation. Consider each partial derivative.
** << Caution >> ** ** Because the heights are forcibly aligned, "$ \ frac {} {} $" is mixed in some places. Please let me know if there is any good way. Also, there are other ways to add color to the formula ... **
It will be. If the error $ E $ from the upstream is passed through this, the back propagation will be as shown in the figure. After all, there are many mathematical formulas and it looks complicated, that is, it only multiplies the partial derivative when passing through the ** compute node. ** ** Even this is a scalar, so it may be easier.
Let's implement it. Let's code according to the formula. The implementation destination code is [here](https://qiita.com/kuroitu/items/884c62c48c2daa3def08#layer module code preparation).
baselayer.py
    def backward(self, grad):
        """
Implementation of backpropagation
        """
        dact = grad*self.act.backward(self.x, self.y)
        self.dw = dact*self.x
        self.db = dact
        self.dx = dact*self.w
        return self.dx
The middle layer is OK as it is, but it is necessary to add some processing for the output layer. I will touch on it later.
Next, let's consider backpropagation in a matrix. This is a lot of formulas and difficult, but you can see it if you follow it properly. As before, I'll add color to the formula so that you can see it at a glance. ~~ I feel heavy ... ~~
Consider the following calculation graph.
 It's the same as the one used in [Forward propagation](https://qiita.com/kuroitu/items/d22c8750e34d5d75fb6c# Forward propagation in matrix). It is a layer model with only 2 neurons, but as you can see, there are quite a lot of formulas.
First, let's find the partial differential of each part as in the case of scalar.
It's the same as the one used in [Forward propagation](https://qiita.com/kuroitu/items/d22c8750e34d5d75fb6c# Forward propagation in matrix). It is a layer model with only 2 neurons, but as you can see, there are quite a lot of formulas.
First, let's find the partial differential of each part as in the case of scalar.
\Rightarrow
\boldsymbol{dW} =
\left(
  \begin{array}{cc}
    \cfrac{\partial y_1}{\partial w_{1, 1}} & \cfrac{\partial y_2}{\partial w_{1, 2}} \\
    \cfrac{\partial y_1}{\partial w_{2, 1}} & \cfrac{\partial y_2}{\partial w_{2, 2}}
  \end{array}
\right)
=
\left(
  \begin{array}{c}
    x_1 \\
    x_2
  \end{array}
\right)
\left(
  \begin{array}{cc}
    \sigma'_1(m) & \sigma'_2(q)
  \end{array}
\right)
=
\boldsymbol{X}^{\top}\boldsymbol{\sigma'}
\Rightarrow
\boldsymbol{dB} =
\left(
  \begin{array}{cc}
    \cfrac{\partial y_1}{\partial b_1} & \cfrac{\partial y_2}{\partial b_2}
  \end{array}
\right)
=
\left(
  \begin{array}{cc}
     \sigma'_1(m) & \sigma'_2(q)
  \end{array}
\right)
\Rightarrow
\boldsymbol{dX} =
\left(
  \begin{array}{cc}
    \cfrac{\partial y_1}{\partial x_1} + \cfrac{\partial y_2}{\partial x_1} & \cfrac{\partial y_1}{\partial x_2} + \cfrac{\partial y_2}{\partial x_2}
  \end{array}
\right)
=
\left(
  \begin{array}{cc}
    \sigma'_1(m) & \sigma'_2(q)
  \end{array}
\right)
\left(
  \begin{array}{cc}
    w_{1, 1} & w_{2, 1} \\
    w_{1, 2} & w_{2, 2}
  \end{array}
\right)
=
\boldsymbol{\sigma'}\boldsymbol{W}^{\top}
It looks like. Add the error from the upstream to these [element product](https://qiita.com/kuroitu/items/d22c8750e34d5d75fb6c#%E8%A1%8C%E5%88%97%E3%81%AE%E8%A6%81 By multiplying by% E7% B4% A0% E7% A9% 8D), the error as shown in the above figure is propagated. As you can see from the fact that $ x_1 $ and $ x_2 $ are branched and the values are sent to other calculation nodes, the partial differentials that have flowed need to be added and calculated. So </ font>
\left(
  \begin{array}{cc}
    \cfrac{\partial y_1}{\partial x_1} + \cfrac{\partial y_2}{\partial x_1} & \cfrac{\partial y_1}{\partial x_2} + \cfrac{\partial y_2}{\partial x_2}
  \end{array}
\right)
It's . This formula intuitively (and theoretically) represents the effect of input $ x_1 $ and $ x_2 $ on output $ y_1 $ and $ y_2 $. </ font>
By the way, up to this point, the batch size $ N = 1 $ is implicitly assumed. So what happens when it's $ N \ ne 1 $? The answer is simple. Only the error with respect to the bias $ B $ changes. Let's look at this with a mathematical formula. Input $ N \ times L $ and output $ N \ times M $ </ font>
\begin{align}
  \underbrace{\boldsymbol{dW}}_{L \times M} &= \underbrace{\boldsymbol{X}^{\top}}_{L \times N} \underbrace{\boldsymbol{\sigma'}}_{N \times M} \\
  \underbrace{\boldsymbol{dB}}_{1 \times M} &= \underbrace{\boldsymbol{\sigma'}}_{N\times M} \\
  \underbrace{\boldsymbol{dX}}_{N \times L} &= \underbrace{\boldsymbol{\sigma'}}_{N \times M} \underbrace{\boldsymbol{W}^{\top}}_{M \times L}
\end{align}
It will be . Since the flowing gradient must be the same as each shape at the time of input, $ \ underbrace {\ boldsymbol {dW}} \ _ {L \ times M} $ and $ \ underbrace {\ boldsymbol {dB}} \ _ {1 \ times M} $, $ \ underbrace {\ boldsymbol {dX}} \ _ {N \ times L} $, of which $ \ underbrace {\ boldsymbol {dB}} \ _ {1 \ The shapes do not match by times M} $. So what to do is that when the bias $ \ underbrace {\ boldsymbol {B}} _ {1 \ times M} $ propagates forward, the ** broadcast function automatically $ \ underbrace {\ boldsymbol {B}} \ Recall that ** was supposed to be _ {N \ times M} $. In other words, applying the exact same bias to all batch data is synonymous with ** branching the same bias into $ N $ pieces **, so if you take the sum, It will be good **. </ font>
\underbrace{\boldsymbol{dB}}_{1 \times M} = \sum_{i=1}^{N}{\underbrace{\boldsymbol{\sigma'}_i}_{1\times M}} \xrightarrow{\textrm{coding}} \textrm{sum}(\boldsymbol{\sigma'}, \textrm{axis}=0)
This concludes the theory for implementation.
Let's implement it. Rewrite the implementation in scalar.
baselayer.py
    def backward(self, grad):
        """
Implementation of backpropagation
        """
        dact = grad*self.act.backward(self.x, self.y)
        self.dw = self.x.T@dact
        self.db = np.sum(dact, axis=0)
        self.dx = [email protected]
        return self.dx
Each has changed to matrix operation. Also note that the numpy.sum function will return the result of adding all the elements unless you specify ʻaxis = 0`.
A little processing needs to be added to the back propagation of the output layer.
outputlayer.py
    def backward(self, t):
        """
Implementation of backpropagation
        """
        #When the activation function of the output layer is the softmax function and the loss function is the cross entropy error
        #Separate cases of error propagation
        if isinstance(self.act, Softmax) \
        and isinstance(self.errfunc, CrossEmtropy):
            dact = self.y - t
            self.dw = self.x.T@dact
            self.db = np.sum(dact, axis=0)
            self.dx = [email protected]
            
            return self.dx
        else:
            grad = self.errfunc.backward(self.y, t)
            return super().__init__(grad)
The content of the override itself is easy, isn't it? If the activation function of the output layer is the softmax function and the loss function uses the cross entropy error, the error that backpropagates the activation function can be skipped like a code. In other cases, it is necessary to calculate the derivative of the loss function and pass the error.
__init__ methodNow, let's have the members have the ʻerrfunc` mentioned above.
baselayer.py
    def __init__(self, *, prev=1, n=1, 
                 name="", wb_width=1,
                 act="ReLU", err_func="square",
                 **kwds):
        self.prev = prev  #Number of outputs of the previous layer=Number of inputs to this layer
        self.n = n        #Number of outputs in this layer=Number of inputs to the next layer
        self.name = name  #The name of this layer
        #Set weight and bias
        self.w = wb_width*np.random.randn(prev, n)
        self.b = wb_width*np.random.randn(n)
        #Activation function(class)Get
        self.act = get_act(act)
        #Loss function(class)Get
        self.errfunc = get_errfunc(err_func)
Eventually I will write an article about the loss function.
I wonder if the difficulty of coloring mathematical formulas can be solved ...
-Introduction to Deep Learning ~ Basics ~ -Introduction to Deep Learning ~ Coding Preparation ~ -Introduction to Deep Learning ~ Forward Propagation ~ -Introduction to Deep Learning ~ Backpropagation ~ -List of activation functions (2020) -Thorough understanding of im2col -Complete understanding of numpy.pad function
Recommended Posts