[PYTHON] Introduction to Deep Learning ~ Learning Rules ~

Target person

The previous article is here. In this article we will look at the implementation of learning rules. Even so, the main body has already been implemented at here, so please have a look.

The next article is here

table of contents

-[Learning Rule](# Learning Rule) -[Learning Law Theory](#Learning Law Theory) -[Implementation of learning rules](#Implementation of learning rules) -[Implementation of __init__ method](Implementation of #init method) -Conclusion

Learning rules

First of all, I will think with scalar as usual. Neuron objects have variables $ w and b $.

y = \sigma(xw + b)

Here, assuming that the input $ x $ is a constant

y = f(w, b) = \sigma(xw + b)

Can be written as In other words, the goal of the learning rule is to change the values of $ w and b $ appropriately to get closer to the target value $ y ^ {\ star} $. </ font>

Learning law theory

Let's look at it theoretically.

y = f(w, b) = \sigma(wx + b)

In

\begin{align}
  y &= y^{\star} = 0.5 \\
  x &= x_0 = 1
\end{align}

Then, if the activation function is the sigmoid function and the parameter space of $ w and b $ is illustrated, it will be as follows. Loss_space.png

code

show_loss_space.py


%matplotlib nbagg
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


x_0 = 1
y_star = 0.5
sigma = lambda x: 1/(1+np.exp(-x))

w = np.arange(-2, 2, 1e-2)
b = np.arange(-2, 2, 1e-2)
W, B = np.meshgrid(w, b)
y = 0.5*(sigma(x_0*W + B) - y_star)**2

elevation = np.arange(np.min(y), np.max(y), 1/2**8)

fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.set_xlabel("w")
ax.set_ylabel("b")
ax.set_zlabel("loss")
ax.view_init(60)
ax.grid()
ax.plot_surface(W, B, y, cmap="autumn", alpha=0.8)
ax.contour(W, B, y, cmap="autumn", levels=elevation, alpha=0.8)
fig.suptitle("Loss space")
fig.show()
fig.savefig("Loss_space.png ")

The figure shows the loss space when the squared error is used for the loss function. The learning rule is that the random initial value $ w_0, b_0 $ gradually approaches $ w ^ {\ star}, b ^ {\ star} $, which gives the optimum value $ y ^ {\ star} $. That's the purpose of. </ font> At this time, the learning rule is the ** gradient descent method **, which is an evolution of the ** steepest descent method **. Gradient descent method is a method of going down a slope using ** gradient (partial differential) at a certain point of each parameter. Here has formulas and codes for each method. Also, here shows the descent in some search spaces. <img src=https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F640911%2F6f55cfbc-4a9f-d4fe-c70e-49dca7cbf683.gif?ixlib=rb-1.2.2&auto=format&gif-q=60&q=75&s=02b37020417dbead312cc8c82f5eac7e>

This article deals with the simplest SGD. The formula for SGD is as follows.

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t, b_t) \\
  \Delta w_t &= - \eta g_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

This formula only lists $ w $, but you can see that if you replace it with $ b $, the learning rule for bias will be the same. Think of $ \ mathcal {L} (w_t, b_t) $ as the loss function and $ \ nabla_ {w_t} $ as the partial derivative with respect to $ w_t $ (the above formula is a matrix representation). </ font> Expressing the above formula in Japanese

--Find the gradient with partial differential --Calculate the amount of movement

  • Moving

It looks like. It's simple. Let's take a closer look.

The first "finding the gradient with partial differential" uses the error backpropagation method introduced in Backpropagation. That's fine. "Move" is also literal.

Regarding the part of "calculating the amount of movement"

  1. Why add a minus
  2. Why multiply $ \ eta \ ll 1 $ without using the gradient as is

I would like to talk about two points.

First of all, regarding 1., I think this is easy to understand if you actually think about it concretely. y=x^2.png For example, the slope at the point of $ (x, y) = (1, 1) $ is $ 2 $, but the direction you want to move is the minus direction, isn't it? Of course, the reverse is also true. Therefore, the direction and gradient you want to move are always opposite signs, so they are negative. As for 2., as you can see from the graph, if you use the slope $ 2 $ as it is and set $ \ Delta x = -2 $ etc., it will be $ x = -1 $ and it will pass the optimum value. .. </ font> y=x^2_move.png Therefore, we multiply the coefficient $ \ eta \ ll 1 $ called the learning rate to limit the amount of movement so that it gradually falls toward the optimum value. This learning rate is a value called ** hyperparameter **, and there are many learning rules that humans have to design this part. In most cases, using the default values given in papers will work, but depending on the problem you want to solve, you may have to experiment.

Learning rule implementation

Well, aside from the details, let's implement it for the time being. The implementation destination is as usual [baselayer.py](https://qiita.com/kuroitu/items/884c62c48c2daa3def08#%E3%83%AC%E3%82%A4%E3%83%A4%E3%83%BC % E3% 83% A2% E3% 82% B8% E3% 83% A5% E3% 83% BC% E3% 83% AB% E3% 81% AE% E3% 82% B3% E3% 83% BC% E3 % 83% 89% E6% BA% 96% E5% 82% 99).

baselayer.py

baselayer.py


    def update(self, **kwds):
        """
Implementation of parameter learning
        """
        dw, db = self.opt.update(self.grad_w, self.grad_b, **kwds)
        
        self.w += dw
        self.b += db

The part of self.opt.update (self.grad_w, self.grad_b, ** kwds) is thrown to here. Here is the SGD code as an example.

optimizers.py

optimziers.py


import numpy as np


class Optimizer():
    """
A superclass inherited by the optimization method.
    """
    def __init__(self, *args, **kwds):
        pass


    def update(self, *args, **kwds):
        pass


class SGD(Optimizer):
    def __init__(self, eta=1e-2, *args, **kwds):
        super().__init__(*args, **kwds)

        self.eta = eta


    def update(self, grad_w, grad_b, *args, **kwds):
        dw = -self.eta*grad_w
        db = -self.eta*grad_b
        return dw, db

The content of the code is exactly as the formula introduced above. It receives the gradient about $ w and b $ from the outside, and according to the learning rule, it is multiplied by $-\ eta $ to determine the movement amount and throw it back. </ font> The layer object receives this amount of movement and updates its parameters.

Well, that's all for this time. You might think, "What? In the case of a procession?" In fact, the code is exactly the same for matrices. [optimizers.py](https://qiita.com/kuroitu/items/36a58b37690d570dc618#%E5%AE%9F%E8%A3%85%E3%82%B3%E3%83%BC%E3%83%89 Even if you look at% E4% BE% 8B), you can't find the calculation of matrix product. The reason is natural when you think about it, but even if you learn with a mini-batch, the gradient that flows should be unique to each parameter, and it is necessary to calculate by involving the gradients of other parameters. Because there is no such thing. So, this time, if you think about it with a scalar and implement it, you can calculate it with a matrix in the same way.

Implementation of the __init__ method

Finally, let's make the layer object have the optimizer ʻoptwith theinit` method.

__init__.py


    def __init__(self, *, prev=1, n=1, 
                 name="", wb_width=1,
                 act="ReLU", err_func="square", opt="Adam",
                 act_dict={}, opt_dict={}, **kwds):
        self.prev = prev  #Number of outputs of the previous layer=Number of inputs to this layer
        self.n = n        #Number of outputs in this layer=Number of inputs to the next layer
        self.name = name  #The name of this layer
        
        #Set weight and bias
        self.w = wb_width*np.random.randn(prev, n)
        self.b = wb_width*np.random.randn(n)
        
        #Activation function(class)Get
        self.act = get_act(act, **act_dict)
        
        #Loss function(class)Get
        self.errfunc = get_errfunc(err_func)
        
        #Optimizer(class)Get
        self.opt = get_opt(opt, **opt_dict)

in conclusion

Next time, I will introduce the activation function, the localization of the optimizer, and the loss function.

Deep learning series

-Introduction to Deep Learning ~ Basics ~ -Introduction to Deep Learning ~ Coding Preparation ~ -Introduction to Deep Learning ~ Forward Propagation ~ -Introduction to Deep Learning ~ Backpropagation ~ -Introduction to Deep Learning ~ Learning Rules ~ -Introduction to Deep Learning ~ Localization and Loss Functions ~ -List of activation functions (2020) -Gradient descent method list (2020) -See and understand! Comparison of optimization methods (2020) -Thorough understanding of im2col -Complete understanding of numpy.pad function

Recommended Posts