Introduction

New algorithms are being created every day for optimization methods in neural networks. In the previous article, I summarized how it was developed from SGD to ADAM.

https://qiita.com/Fumio-eisan/items/798351e4915e4ba396c2

In this article, I'd like to implement that optimization technique and see how fast it converges in a neural network! .. This time, I would like to apply this optimization method to a simple quadratic function. In textbooks, etc., basically, take a handwritten numeric image such as MNIST as an example and confirm that the value of the loss function decreases. However, since the loss function in the neural network is very complicated, it is hard to realize whether the minimum value is derived. Therefore, this time, I would like to take a quadratic function as an example and see how it goes toward the minimum value. 　 I also implemented defining and loading the class myself, but I'll note what I understood there.

This is an overview.

Target function
Implement SGD (Stochastic Gradient Descent)
Implement Momentum
Implement AdaGrad
Understanding of class and init

Purpose of the target function and optimization method

Here is the purpose of the quadratic function and optimization method targeted this time.

I decided to find the minimum value of a very easy-to-understand function. Since $ y = x ^ 2 $, $ y $, which takes the minimum value, becomes $ y = 0 $ when $ x = 0 $. Using this function, which is intuitively and visually easy to understand, I would like to confirm that $ x $ is continuously updated and approaches $ y = 0 $.

Implementation details

This time, the program that defines the function is stored in optimizers.py. Then, the function is read in sample.ipynb, calculated, and illustrated. Therefore, when implementing it, we would appreciate it if you could store these two programs in the same folder and execute them.

Implement SGD (Stochastic Gradient Descent)

First is the basic SGD. The formula is shown below.

\mathbf{x}_{t + 1} \gets \mathbf{x}_{t} - \eta g_t\\
g_t = 2 *\mathbf{x}_{t}

$ \ eta $ is the learning rate and $ g_t $ is the gradient of the function. $ g_t $ is $ 2 * \ mathbf {x} _ {t} $ this time, so it's convenient.

Next, we will implement this technique.

`optimizers.py`


class SGD:

    def __init__(self, lr = 0.01,x=100):
        self.lr = 0.01
        self.x =100.0

    def update(self,x):
        x+= -self.lr * 2*x 
        return x

The formula itself is simple, so I think it's easy to understand by looking at the program. It is a function that calculates $ x $ in the next step and returns $ x $. $ \ Eta $ is written here as $ lr $.

Now, here is a program that repeatedly calculates $ x $ to actually find the minimum value.

`sample.ipynb`


max_iterations = 100

y7=[]
y8=[]
y9=[]

v7=[]
v8=[]
v9=[]
optimizer = optimizers.SGD()

x = 100
optimizer.lr =0.1
for i in range(max_iterations):
    x = optimizer.update(x)
    y7.append(x)
    v7.append(optimizer.lr)
    
x = 100
optimizer.lr =0.01
for i in range(max_iterations):
    x = optimizer.update(x)
    y8.append(x)
    v8.append(optimizer.lr)

x = 100
optimizer.lr =0.9
for i in range(max_iterations):
    x = optimizer.update(x)
    y9.append(x)
    v9.append(optimizer.lr)

Let's see how the convergence of the function changes when we change $ \ eta $ to 0.1,0.01,0.9.

It's a dirty program because I couldn't think of initializing x or even lr as a variable for loop.

`sample.ipynb`



x = np.arange(max_iterations)
plt.plot(x, y7, label='lr=0.1')
plt.plot(x, y8, label='lr=0.01')
plt.plot(x, y9, label='lr=0.9')
plt.xlabel("iterations")
plt.ylabel("y")
plt.ylim(-5, 120)
plt.legend()
plt.show()

You can see that the closer the value of $ y $ on the vertical axis approaches 0, the more it converges.

If $ \ eta $ is 0.01, it will take time to converge. On the contrary, if $ \ eta $ is too large up to 0.9, you can see that it goes to convergence while hunting. If this value is 1 or more in this function, it will diverge. Therefore, it can be seen that $ \ eta $ is a good value around 0.1, which converges quickly and suppresses divergence.

Implement Momentum

Next, we will implement Momentum. Compared to the previous SGD, this algorithm suppresses vibration by considering the movement of $ x $ itself.

\mathbf{x}_{t + 1} \gets \mathbf{x}_{t} + h_t\\
h_t=-\eta * g_t +\alpha * h_{t-1}

Define the function.

`optimizers.py`



class Momentum:

    def __init__(self,lr=0.01,alpha=0.8,v=None):
        self.lr = 0.01
        self.v = 0.0
        self.alpha = 0.8

    def update(self,x):
        self.v = self.v*self.alpha  - (2.0)*x*self.lr 
        x += + self.v
        return x

And run. This time, we will evaluate the effect by changing the value of $ \ alpha $ as a hyperparameter.

The standard $ \ alpha $ is said to be 0.8. Even in this result, we can see that 0.8 is a value that seems to be just right. If the value is larger than this, you will be hunting a lot.

Implement AdaGrad

Well, this time the last is AdaGrad. Previously, the learning rate $ \ eta $ was a constant. However, the feature is that this learning rate itself has the effect of gradually decreasing as the number of calculations.

h_{0} = \epsilon\\
h_{t} = h_{t−１} + g_t^{2}\\
\eta_{t} = \frac{\eta_{0}}{\sqrt{h_{t}}}\\
\mathbf{x}_{t+1} = \mathbf{w}^{t} - \eta_{t}g_t

Define the function.

`optimizers.py`


class AdaGrad:

    def __init__(self,h0=10.0):
        self.v = 0.0
        self.h = 0.0
        self.h0 = 10.0

    def update(self,x):
        self.h += + (2.0*x)**2
        self.v = -self.h0/(np.sqrt(self.h)+1e-7)
        x += + self.v*2*x
        return x

By the way, this time, I changed the initial value of learning rate $ \ eta_0 $ (h0 in the program) and calculated.

It turned out that $ \ eta_0 $ converges faster around 100 and does not diverge.

Understanding of class and init

Since I'm a beginner in Python, I used init in my class with a magical impression. This time, I found the following.

The argument defined by init (self, argument) can be quoted after calling class to change the argument parameter.
On the contrary, if you do not write this argument, you cannot change the parameter after calling class.

in conclusion

This time, the function was minimized by three kinds of optimization methods. You can see that it uses the recurrence formula concept learned in so-called high school mathematics. Also, although I made everything except numpy, it helped me deepen my understanding of programming itself. Also, as I learned in O'Reilly's deep learning book, I'm actually implementing an optimization method on a neural network. You can understand each content such as definition of functions and small variables, how to insert activation function, gradient calculation by inverse error propagation method, initial value determination of weight parameter, etc. However, when it comes to programming it, it becomes very complicated.

** Once again, I wondered if programmers were implementing a combination of these many rules and structural knowledge, which would be a daunting task. Based on that hardship, I would like to thank you for calling it as a class or method and using the function easily. ** **

The full program is stored here. https://github.com/Fumio-eisan/optimizers_20200326

[PYTHON] I also learned + init, class, which tried to find the minimum value of a quadratic function by making an optimization method (SGD, momentum, AdaGrad) in deep learning by myself.

Introduction

Purpose of the target function and optimization method

Implementation details

Implement SGD (Stochastic Gradient Descent)

optimizers.py

sample.ipynb

sample.ipynb

Implement Momentum

optimizers.py

Implement AdaGrad

optimizers.py

Understanding of class and init

in conclusion

`optimizers.py`

`sample.ipynb`

`sample.ipynb`

`optimizers.py`

`optimizers.py`