[PYTHON] I also learned + init, class, which tried to find the minimum value of a quadratic function by making an optimization method (SGD, momentum, AdaGrad) in deep learning by myself.

Introduction

New algorithms are being created every day for optimization methods in neural networks. In the previous article, I summarized how it was developed from SGD to ADAM.

https://qiita.com/Fumio-eisan/items/798351e4915e4ba396c2

In this article, I'd like to implement that optimization technique and see how fast it converges in a neural network! .. This time, I would like to apply this optimization method to a simple quadratic function. In textbooks, etc., basically, take a handwritten numeric image such as MNIST as an example and confirm that the value of the loss function decreases. However, since the loss function in the neural network is very complicated, it is hard to realize whether the minimum value is derived. Therefore, this time, I would like to take a quadratic function as an example and see how it goes toward the minimum value.   I also implemented defining and loading the class myself, but I'll note what I understood there.

This is an overview.

Purpose of the target function and optimization method

Here is the purpose of the quadratic function and optimization method targeted this time.

image.png

I decided to find the minimum value of a very easy-to-understand function. Since $ y = x ^ 2 $, $ y $, which takes the minimum value, becomes $ y = 0 $ when $ x = 0 $. Using this function, which is intuitively and visually easy to understand, I would like to confirm that $ x $ is continuously updated and approaches $ y = 0 $.

Implementation details

This time, the program that defines the function is stored in optimizers.py. Then, the function is read in sample.ipynb, calculated, and illustrated. Therefore, when implementing it, we would appreciate it if you could store these two programs in the same folder and execute them.

Implement SGD (Stochastic Gradient Descent)

First is the basic SGD. The formula is shown below.

\mathbf{x}_{t + 1} \gets \mathbf{x}_{t} - \eta g_t\\
g_t = 2 *\mathbf{x}_{t} 

$ \ eta $ is the learning rate and $ g_t $ is the gradient of the function. $ g_t $ is $ 2 * \ mathbf {x} _ {t} $ this time, so it's convenient.

Next, we will implement this technique.

optimizers.py


class SGD:

    def __init__(self, lr = 0.01,x=100):
        self.lr = 0.01
        self.x =100.0

    def update(self,x):
        x+= -self.lr * 2*x 
        return x

The formula itself is simple, so I think it's easy to understand by looking at the program. It is a function that calculates $ x $ in the next step and returns $ x $. $ \ Eta $ is written here as $ lr $.

Now, here is a program that repeatedly calculates $ x $ to actually find the minimum value.

sample.ipynb


max_iterations = 100

y7=[]
y8=[]
y9=[]

v7=[]
v8=[]
v9=[]
optimizer = optimizers.SGD()

x = 100
optimizer.lr =0.1
for i in range(max_iterations):
    x = optimizer.update(x)
    y7.append(x)
    v7.append(optimizer.lr)
    
x = 100
optimizer.lr =0.01
for i in range(max_iterations):
    x = optimizer.update(x)
    y8.append(x)
    v8.append(optimizer.lr)

x = 100
optimizer.lr =0.9
for i in range(max_iterations):
    x = optimizer.update(x)
    y9.append(x)
    v9.append(optimizer.lr)

Let's see how the convergence of the function changes when we change $ \ eta $ to 0.1,0.01,0.9.

sample.ipynb



x = np.arange(max_iterations)
plt.plot(x, y7, label='lr=0.1')
plt.plot(x, y8, label='lr=0.01')
plt.plot(x, y9, label='lr=0.9')
plt.xlabel("iterations")
plt.ylabel("y")
plt.ylim(-5, 120)
plt.legend()
plt.show()

005.png

You can see that the closer the value of $ y $ on the vertical axis approaches 0, the more it converges.

If $ \ eta $ is 0.01, it will take time to converge. On the contrary, if $ \ eta $ is too large up to 0.9, you can see that it goes to convergence while hunting. If this value is 1 or more in this function, it will diverge. Therefore, it can be seen that $ \ eta $ is a good value around 0.1, which converges quickly and suppresses divergence.

Implement Momentum

Next, we will implement Momentum. Compared to the previous SGD, this algorithm suppresses vibration by considering the movement of $ x $ itself.

\mathbf{x}_{t + 1} \gets \mathbf{x}_{t} + h_t\\
h_t=-\eta * g_t +\alpha * h_{t-1}

Define the function.

optimizers.py



class Momentum:

    def __init__(self,lr=0.01,alpha=0.8,v=None):
        self.lr = 0.01
        self.v = 0.0
        self.alpha = 0.8

    def update(self,x):
        self.v = self.v*self.alpha  - (2.0)*x*self.lr 
        x += + self.v
        return x

And run. This time, we will evaluate the effect by changing the value of $ \ alpha $ as a hyperparameter. 006.png

The standard $ \ alpha $ is said to be 0.8. Even in this result, we can see that 0.8 is a value that seems to be just right. If the value is larger than this, you will be hunting a lot.

Implement AdaGrad

Well, this time the last is AdaGrad. Previously, the learning rate $ \ eta $ was a constant. However, the feature is that this learning rate itself has the effect of gradually decreasing as the number of calculations.

h_{0} = \epsilon\\
h_{t} = h_{t−1} + g_t^{2}\\
\eta_{t} = \frac{\eta_{0}}{\sqrt{h_{t}}}\\
\mathbf{x}_{t+1} = \mathbf{w}^{t} - \eta_{t}g_t

Define the function.

optimizers.py


class AdaGrad:

    def __init__(self,h0=10.0):
        self.v = 0.0
        self.h = 0.0
        self.h0 = 10.0

    def update(self,x):
        self.h += + (2.0*x)**2
        self.v = -self.h0/(np.sqrt(self.h)+1e-7)
        x += + self.v*2*x
        return x

By the way, this time, I changed the initial value of learning rate $ \ eta_0 $ (h0 in the program) and calculated.

007.png

It turned out that $ \ eta_0 $ converges faster around 100 and does not diverge.

Understanding of class and init

Since I'm a beginner in Python, I used init in my class with a magical impression. This time, I found the following.

in conclusion

This time, the function was minimized by three kinds of optimization methods. You can see that it uses the recurrence formula concept learned in so-called high school mathematics. Also, although I made everything except numpy, it helped me deepen my understanding of programming itself. Also, as I learned in O'Reilly's deep learning book, I'm actually implementing an optimization method on a neural network. You can understand each content such as definition of functions and small variables, how to insert activation function, gradient calculation by inverse error propagation method, initial value determination of weight parameter, etc. However, when it comes to programming it, it becomes very complicated.

** Once again, I wondered if programmers were implementing a combination of these many rules and structural knowledge, which would be a daunting task. Based on that hardship, I would like to thank you for calling it as a class or method and using the function easily. ** **

The full program is stored here. https://github.com/Fumio-eisan/optimizers_20200326

Recommended Posts

I also learned + init, class, which tried to find the minimum value of a quadratic function by making an optimization method (SGD, momentum, AdaGrad) in deep learning by myself.
Find the minimum value of a function by particle swarm optimization (PSO)
I tried to display the altitude value of DTM in a graph
[Deep Learning from scratch] I tried to explain the gradient confirmation in an easy-to-understand manner.
I tried to create a Python script to get the value of a cell in Microsoft Excel
I also tried to imitate the function monad and State monad with a generator in Python
I tried to fight the Local Minimum of Goldstein-Price Function
I want to clear up the question of the "__init__" method and the "self" argument of a Python class.
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to verify the result of A / B test by chi-square test
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range â‘ 
I tried to predict the number of people infected with coronavirus in Japan by the method of the latest paper in China
[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I made an appdo command to execute a command in the context of the app
I tried to extract a line art from an image with Deep Learning
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the change in snowfall for 2 years by machine learning