Learning is a function in neural networks (deep learning). I tried to understand from scratch the calculations in the model that are being done to increase the predictive value of the predictive model. I implemented it without using a machine learning library.
In the previous article, I summarized the implications of learning in neural networks, the loss function required to improve the accuracy of models, and the concept of differentiation. https://qiita.com/Fumio-eisan/items/c4b5b7da5b5976d09504 This time, I would like to summarize the latter half of the implementation to the neural network.
This time as well, I referred to O'Reilly's deep learning textbook. It's very easy to understand. https://www.oreilly.co.jp/books/9784873117584/
The outline is as follows.
In the previous article, we confirmed that it is necessary to minimize the loss function in order to optimize the model. We have also shown that differential of a function is the means to minimize it. Now, let's think about optimizing the parameters of the model by actually using the derivative of this function.
By differentiating a function, you can know the direction in which the value of that function decreases. The gradient method advances a certain distance in the gradient direction from the current location. And it means to find the same gradient at the destination and proceed in the direction of the gradient. ** Going toward the minimum value is called the gradient descent method, and going toward the maximum value is called the gradient descent method. ** **
The above is a mathematical expression of the gradient method. η represents the amount of updates and is called the ** learning rate **. It shows how many parameters are updated in one learning. If this learning rate is too small, it will take time to approach the minimum value. On the contrary, if the learning rate is large, the minimum value may be exceeded. Therefore, you need to find the right value for each model. I will actually implement it. This is the function I used in the first half.
I would like to find the minimum value of this function.
nn.ipynb
def gradient_descent(f,init_x, lr=0.01, step_num=100):
x = init_x
for i in range(step_num):
grad = numerical_gradient(f,x)
x -= lr*grad
return x
def function_2(x):
return x[0]**2+x[1]**2
Now, let the initial value be (x0, x1) = (-3,4) and use the gradient method to find the minimum value. The true minimum value is taken when (0,0).
nn.ipynb
init_x = np.array([-3.0,4.0])
gradient_descent(function_2, init_x = init_x, lr =0.1, step_num=100)
array([-6.11110793e-10, 8.14814391e-10])
When the learning rate lr is 0.1, the above result is obtained and it is found that the value is almost (0,0). In this case, it can be said that the learning was successful.
nn.ipynb
init_x = np.array([-3.0,4.0])
gradient_descent(function_2, init_x = init_x, lr =10, step_num=100)
array([-2.58983747e+13, -1.29524862e+12])
Next, here is the case where the learning rate is set to 10. The value has diverged. You can see that this is not a good learning experience. This study shows that the optimal learning rate must be set for each model.
Apply the above method for finding the gradient to a neural network. In neural networks, it is applied to the gradient of the loss function. Let L be the loss function, and take a structure that partially differentiates with the weight w.
I would like to implement a neural network that performs the following procedure.
Mini batch Some data is randomly extracted from the training data (mini-batch). The goal is to minimize the loss function in that mini-batch.
Gradient calculation Find the gradient of the weight parameter to reduce the loss function of the mini-batch.
Parameter update Updates the weight parameter by a small amount in the negative gradient direction.
Repeat Repeat steps 1 to 3 as many times as you like.
Now, I would like to implement a two-layer neural network that actually has a learning function. The component diagram of the model implemented this time is as follows.
When setting the parameters mainly, it is decided by nn.ipynb. Also, since this time we will use the MNIST dataset, it will be read from the original URL of Mr. Yann et al. The calculation performed by the actual neural network is described on two_layer_net.py. The network is as shown in the figure below.
Since MNIST is originally 28 × 28 pixel image data, 28 × 28 = 784 dimensional numbers are in the first input layer. This time, the hidden layer is set to 100, and the final output layer is set to 10 dimensions to spit out as 10 types of numbers. In performing this calculation, the activation function sigmoid function and the softmax function for calculating the final probability can be calculated by reading the functions described in yet another functions.py. Substitute the output value obtained there and the correct index of the teacher data into the loss function. Then, calculate with the numerical_garadient method described in gradient.py to find the gradient of the weight parameter. In order to update the obtained gradient to the next weight parameter, it is described to update on the original nn.ipynb. The series of operations is repeated for the number of times.
Even if you just calculate and train a two-layer neural network, you need to read and calculate this many methods. You can see that it is not the amount that can be done by humans. You will also find that you need to understand the structure of your program, including classes and methods.
Now let's take a look at the updated parts of the weight parameters and bias.
nn.ipynb
#Update weight parameters and bias
for key in ('W1', 'b1', 'W2', 'b2'):
network.params[key] -= learning_rate * grad[key] #The point is that this sign is negative
loss = network.loss(x_batch, t_batch)
train_loss_list.append(loss)
The point is that the weight parameter and bias W1, W2, b1 and b2 are multiplied by the gradient (grad [key]) and the learning rate, respectively, and then ** subtracted **. When the gradient obtained by differentiation is a positive value, moving in the negative direction means that it is subtracted from the minimum value. What if you try to reverse this sign and make it positive?
The horizontal axis is the number of calculations and the vertical axis is the value of the loss function. You can see that the value has risen immediately. If you write the sign as a minus, it will be like this.
You can see that the value goes down properly. Now you can build a two-layer neural network without using the existing library of machine learning.
This time, we modeled the calculation and learning of neural networks without using a machine learning library. In learning the model, I understood that the idea of the loss function and the gradient (= differential of the function) are the points. I also needed to combine class modules well to perform calculations, which was a learning experience for Python itself.
The full program is stored here. https://github.com/Fumio-eisan/nn2layer_20200321
Recommended Posts