1.3 Neural network learning

1.3.1 Loss function

In order to make "good inference" in a neural network, you have to set the optimum parameters.

Neural network learning requires an index to know how well learning is working → ** Loss **

** Loss function ** is used to find the loss of the neural network

--Loss function --Square error (in Deep Learning 1 starting from zero) ―― @ohakutsu Will it be used for regression? --Cross entropy error --Often used for multi-class classification

In this section, the following layer structure is used to find the loss.


Put together the Softmax and Cross Entropy Error layers,


Softmax with Loss

What is Softmax? → ** Softmax function **

y_k =  \frac {exp(s_k)}{\displaystyle \sum _{i=1}^{n} exp(s_i)}

What is Cross Entropy Error? → ** Cross entropy error **

L = - \sum_{k}t_k\space log\space y_k


Considering mini-batch processing

L = - \frac{1}{N} \sum_{n}\sum_{k}t_{nk}\space log\space y_{nk}


1.3.2 Derivatives and gradients

The goal of learning neural networks is to find parameters that minimize losses. What is important here is ** differentiation ** and ** gradient **.

differential → Amount of change at a certain moment @ohakutsu Introduction to Mathematics for AI (Artificial Intelligence) Starting from Junior High School Mathematics --YouTube

y = f(x)

The derivative of y with respect to x is


Can be expressed as

Differentiation can be obtained even if there are multiple variables With x as a vector

L = f(x)
\frac{\partial L}{\partial x} = \left( \frac{\partial L}{\partial x_1}, \frac{\partial L}{\partial x_2}, ..., \frac{\partial L}{\partial x_n} \right)

The sum of the derivatives of each element of the vector is called ** gradient **.

In the case of a matrix, the gradient can be considered in the same way. Let W be an m × n matrix

L = g(W)
\frac{\partial L}{\partial W} = \left(
    \frac{\partial L}{\partial w_{11}} & \cdots & \frac{\partial L}{\partial w_{1n}} \\
    \vdots & \ddots & \\
    \frac{\partial L}{\partial w_{m1}} & & \frac{\partial L}{\partial w_{mn}}

1.3.3 Chain rule

The neural network at the time of training outputs the loss when the training data is given. Once the loss gradient for each parameter is obtained, it can be used to update the parameters.

How to find the gradient of a neural network → ** Error back propagation method **

The key to understanding the error backpropagation method is ** chain rule **

--Chain rules --The law of differentiation regarding the composition function

↓ Such a guy

y = f(x) \\
z = g(y) \\


z = g(f(x)) \\

The derivative of z with respect to x is

\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y}\frac{\partial y}{\partial x}

No matter how complex a function is, its derivative can be determined by the derivative of an individual function.

1.3.4 Calculation graph

A visual representation of the calculation


z = x + y

097c0e7d-5245-f3a6-7445-c9e8b198efd7.png e22cfed3-f846-7607-fb78-0c8299b16700.png

Reverse propagation is "back propagation"


Below, typical arithmetic nodes

--Addition node 30e860a6-3518-5221-b139-f82bb44febe2.png

--Multiplication node 951e8bee-fef4-df5b-d9a2-d7de611e25bd.png

--Branch node e4686bbe-ad63-968c-c415-7ee499f24a39.png

--Repeat node 03e79e0c-7f3e-bbf4-b838-232a872dc3eb.png

--Sum node 99fe71b2-03f3-1de4-0a7a-5c416584f6ee.png

--MatMul node 744aada9-deea-379a-a8de-eb76267e1066.png

1.3.5 Gradient derivation and backpropagation implementation

Implement each layer


Sigmoid layer

The sigmoid function is

y =  \frac {1}{1 + exp(-x)}

The derivative of the sigmoid function is

\frac{\partial y}{\partial x} = y(1 - y)

The calculation graph of the Sigmoid layer is


When implemented in Python

class Sigmoid:
  def __init__(self):
    self.params, self.grads = [], []
    self.out = None

  def forward(self, x):
    out = 1 / (1 + np.exp(-x))
    self.out = out
    return out

  def backward(self, dout):
    dx = dout * (1.0 - self.out) * self.out
    return dx

Affine layer

Forward propagation of Affine layer

y = np.dot(x, W) + b

Bias addition is being broadcast


When implemented in Python

class Affine:
  def __init__(self, W, b):
    self.params = [W, b]
    self.grads = [np.zeros_like(W), np.zeros_like(b)]
    self.x = None

  def forward(self, x):
    W, b = self.params
    out = np.dot(x, W) + b
    self.x = x
    return out

  def backward(self, dout):
    W, b = self.params
    dx = np.dot(dout, W.T)
    dW = np.dot(self.x.T, dout)
    db = np.sum(dout, axis=0)

    self.grads[0][...] = dW
    self.grads[1][...] = db
    return dx

Softmax with Loss layer


class SoftmaxWithLoss:
  def __init__(self):
    self.params, self.grads = [], []
    self.y = None  #softmax output
    self.t = None  #Teacher label

  def forward(self, x, t):
    self.t = t
    self.y = softmax(x)

    #Teacher label is one-For hot vector, convert to correct index
    if self.t.size == self.y.size:
      self.t = self.t.argmax(axis=1)

    loss = cross_entropy_error(self.y, self.t)
    return loss

  def backward(self, dout=1):
    batch_size = self.t.shape[0]

    dx = self.y.copy()
    dx[np.arange(batch_size), self.t] -= 1
    dx *= dout
    dx = dx / batch_size

    return dx

1.3.6 Weight update

Update neural network parameters using the gradient obtained by the backpropagation method

To learn the neural network, follow the procedure below.

  1. Mini batch --If there is a lot of data, it will take time, so use a part of the data as an approximation of the whole (from Deep Learning 1 starting from zero)
  2. Gradient calculation --Find the gradient of the loss function for each weight parameter using the backpropagation method.
  3. Parameter update
  4. Repeat steps 1 to 3

3. Parameter update

Update the parameters in the opposite direction of the gradient (direction to reduce the loss) using the gradient obtained in `2. Gradient calculation``. → ** Gradient descent method **


Here we use the simplest ** SGD ** method of updating weights (several other types I wrote in Deep Learning 1 starting from zero).

W \leftarrow W - \eta \frac{\partial L}{\partial W} \\
\eta :Learning coefficient

When implemented in Python

class SGD:
  def __init__(self, lr=0.01):
    self.lr = lr

  def update(self, params, grads):
    for i in range(len(params)):
      params[i] -= self.lr * grads[i]

The actual neural network parameter update is as follows

model = TwoLayerNet( ... )
optimizer = SGD()

for i in range(10000):
  x_batch, t_batch = get_mini_batch( ... ) #Get a mini batch
  loss = model.forward(x_batch, t_batch)
  optimizer.update(model.params, model.grads)

Actually learn neural network in 1.4

The end


-O'Reilly Japan --Deep Learning from scratch ❷ -[oreilly-japan / deep-learning-from-scratch-2: "Deep Learning from scratch ❷" (O'Reilly Japan, 2018)](https://github.com/oreilly-japan/deep-learning- from-scratch-2)

