First deep learning

Introduction

I first came into contact with deep learning when I was in my third year of undergraduate school, and it seems that there were few lectures on deep learning in my undergraduate classes. Many of the deep learning studies were self-taught, but I picked up the steps I took to understand and implement the internal structure of deep learning and posted it on my blog. Put it on. Also, this is my first time writing a blog and it is a practice article, so it seems difficult to read. Please pardon.

What is deep learning?

First, I will explain what deep learning is doing before implementation. Deep learning is, in a nutshell, function optimization. If you optimize a function that takes an image as input and gives the probability that the image is a cat, it becomes a classifier that classifies the image into cats and others, and a function that takes $ x $ as input and gives $ sin (x) $. If you optimize, it is called a regression model. First, we will implement regression to realize that deep learning is function optimization.

Environmental arrangement

If you have a personal computer and an internet environment, you have everything you need for deep learning. If you use google colaboratory provided by google, you have all the necessary libraries, so use this (it has become a convenient world). google cola boratory can be selected from others by pressing a new button from google drive. When opened, the jupiter-notebook is opened and an interactive environment is configured immediately.

Regression implementation (predicting y = sin (x))

I think it is easier to understand if you discuss it while seeing what it implements and works, so first implement the regression model. First, let's insert the following code.

import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0,20,0.1)
y = np.sin(x)
plt.plot(x,y)

When I do this, I think the sin curve is displayed as a graph. This is the library that matplotlib used to draw tables, and I will use this library frequently from now on. Up to this point, training data has been generated. Real numbers from 0 to 20 in increments of 0.1 and their sins.

Next, define the model.


from keras import layers
from keras import models
from keras import optimizers
model = models.Sequential()
model.add(layers.Dense(256,activation = "relu",input_shape=(1,)))
model.add(layers.Dense(256,activation = "relu"))
model.add(layers.Dense(256,activation = "relu"))
model.add(layers.Dense(256,activation = "relu"))
model.add(layers.Dense(256,activation = "relu"))
model.add(layers.Dense(256,activation = "relu"))
model.add(layers.Dense(256,activation = "relu"))
model.add(layers.Dense(1))
model.compile(loss = "mse",optimizer="adam")

Here I imported what I needed from the keras library. The model is how to create a function this time, and although there are various ways to describe it, Sequential () is used in an easy-to-understand manner. You can see here that we are adding a lot of layers to the model. This time I'm using a lot of layers.Dense, but I'll explain what this is.

Fully connected layer

As shown in the subheading, layers.Dense receives a vector in the fully connected layer and returns the vector. I think there are many things that take 256 as an argument, but this is how many dimensions the output should be. For example, the input is n-dimensional x, the output is m-dimensional y, and the fully connected layer is expressed by the following equation.

y = Ax+b

Where A is an n * m matrix and b is an m-dimensional vector. Where does this matrix and vector come from? We have them all as variables and optimize them later. In other words, this makes it possible to reproduce any linear transformation.

Activation function

It turns out that the model above tries to get $ y $ by applying linear transformations many times, but multiple linear transformations can be reproduced with one linear transformation. In this case, it is meaningless to stack many fully connected layers. What comes out there is an activation function, which is a non-linear map. It does a non-linear mapping to all the values of the vector. By biting this between the fully connected layers, the expressive ability of the entire model increases when the layers are stacked. All of this code uses Relu. This is $ max (0, x) $, which is certainly non-linear. This Relu is often used in the field of deep learning.

optimisation

A large number of internal variables ($ A and b $) are used in the fully connected layer to represent an arbitrary linear transformation, and how to optimize this is described in model.compile. Loss is an index whose value becomes smaller as it becomes optimal, and this time we use mse (mean squared error). That is, the square of the difference between the predicted value and the correct value. Parameters cannot be optimized simply by calculating the loss of the current model. We have to calculate how much to move the parameter in which direction to reduce the loss. Basically, it is sufficient to descend the gradient of the parameters, but from the viewpoint of stabilization and speeding up, adam that uses the second derivative or the gradient in the previous step instead of a simple gradient is the optimum for deep learning. It is said that it is good for conversion.

Training

Let's actually train.

hist=model.fit(x,y,steps_per_epoch=10,epochs = 10)

In learning, the training data is often huge, so it is often not possible to go down the gradient using all the training data in one learning (parameter adjustment), so a batch with smaller training data is used. This time, the entire training data is divided into 10 (setps_per_epoch = 10). In other words, 2 going down the gradient from this data 10 times is one epoch (corresponding to the number of times the entire training data was licked in the learning unit), and this time training to lick the entire training data 10 times I let you.

Forecast

Let's predict.

test_x = x + 0.05
acc_y = np.sin(test_x)
pre_y = model.predict(test_x)
plt.plot(test_x,acc_y)
plt.plot(test_x,pre_y)
plt.show()

This will give you an idea of how much the model behaves in the same way as sin (x) for data that deviates from the training data by 0.05 in the x direction. If you move it as it is, it will look like this.

The value is far off at $ x> 10 $.

By improving the model itself and doing more training, it can be improved as shown below, so let's try it.

GPU If you need a model that takes a long time to learn, or if you need a long training period, go to Edit-> Notebook Settings in google colaboratory and change the hardware accelerator from None to GPU. Training should end early with the power of the GPU.

in conclusion

This time, the sin curve was expressed only by linear transformation and relu. Not only sin curves but also deep learning can approximate arbitrary functions from image recognition to image generation, so the possibilities are endless. Next time, I would like to explain about the convolution layer by implementing image recognition and solid but mnist handwriting recognition in as short a line as possible.

[PYTHON] Deep learning learned by implementation 1 (regression)