[PYTHON] Understand Armijo's rules and convex functions

What is the gradient method?

Roughly, what is the gradient method? "By repeatedly differentiating the unknown function you want to find, you can find the point where the slope is close to 0, that is, the ** minimum / maximum ** (sometimes it settles to the maximum / minimum ...)." is.

Even if you say this much, you might be asked "So what?" ...

What makes me happy is that I can create plausible functions from plots of a lot of data and things that are difficult to differentiate analytically.

regression analysis

A plausible function can be derived from a large amount of data by using the gradient method.

Let's use a concrete example. If there are 50 data with $ x_i, y_i $ in the $ i $ th,

スクリーンショット 2020-03-06 3.00.34.png

You can plot it like this.

If you are told, "Draw a straight line like that in the middle!"

スクリーンショット 2020-03-06 3.00.08.png

I think most people draw such a straight line. In short, finding this straight line is just read as regression analysis.

By the way, regression analysis when there is one variable like this function is called ** simple regression analysis **.

Next, I will explain ** least squares **, which is one of the methods required for regression analysis.

Least squares

To create a plausible function from multiple data, take the following four steps.

  1. Create an appropriate function.
  2. Calculate the square of how far away from each data. (The degree of separation is called ** error **.)
  3. Sum the squares of all data errors.
  4. From the result of 3., determine the coefficient of the function created in 1. and create a plausible function.

Let's take a look at each one.

1. Create an appropriate function.

This time it looks like a straight line, so I prepared a one-variable function. The ultimate goal of the least squares method is to find $ a, b $ for this function.

\hat{y}_i = ax_i + b

Here, let's define the variables. $ x_i and y_i $ are the values of the $ i $ th data. On the other hand, $ \ hat {y_i} $ is the predicted value of the $ i $ th data. This $ \ hat {y_i} $ is sometimes called ** predicted value **.

This function also needs to be reshaped to accommodate the variability of its data.

2. Calculate the square of the error of each data.

スクリーンショット 2020-03-06 2.59.06.png

This blue line is the error, and it is necessary to examine this length one by one.

In the formula, the error can be written as follows.

y_i - \hat{y_i} = y_i - (ax_i + b)

However, if this is left as it is, the positive and negative values will change depending on the magnitude relationship of $ y_i and \ hat {y_i} $, so a pure error cannot be obtained. Therefore, the value of this error is squared.

Then the square of the error is

(y_i - \hat{y_i})^2 = (y_i - (ax_i + b))^2

You can ask for it like this.

Let's focus only on the 20th data for simplicity. スクリーンショット 2020-03-06 3.09.02.png

This data is about $ (x_ {20}, y_ {20}) = (0.6, 4.5) $, so So, if $ i = 20 $

\begin{align}
(y_{20} - \hat{y_{20}})^2
&= (y_{20} - (ax_{20} + b))^2\\
&= (4.5 - (a \times 0.6 + b))^2\\
&= \frac{9}{25}a^2+\frac{6}{5}ab +b^2-\frac{27}{5}a-9b+\frac{81}{4}
\end{align}

It will be.

In this way, the square of all 50 errors is calculated.

3. Sum the squares of all data errors.

As the title suggests, the sum of what was calculated in 2. is calculated. If you write it in a mathematical formula,

\begin{align}
S
&=\sum_{i}^{N}{(y_i - \hat{y_i})^2} \\
&= \sum_{i}^{N}{(y_i - (ax_i + b))^2}\\
&= a^2(\sum_{i}^{N}x_i^2)+2a\sum_{i}^{N}{(bx_i-x_iy_i)}-2b\sum_{i}^{N}{y_i}+\sum_{i}^{N}y_i^2 + Nb^2
\end{align}

It will be. Here, $ N $ is the number of data, so this time it is 50.

When this function $ S (a, b) $ is the minimum, $ a, b $ is the variable we want to find this time.

4. Determination of a and b

The gradient method determines this. Since $ S (a, b) $ is a convex function, the minimum value of the function is derived from the steepest gradient method. I will explain in detail in Part 2.

Up to this point, it has become indispensable knowledge for performing the gradient method, but since it includes many mathematical viewpoints, it is okay for people to skip it, saying "I don't want that!".

Hessian matrix and definite matrix

This is the concept needed to explain the following convex function.

What is the ** Hessian matrix **?

$ n $ For the variable function $ f (x1, x2, ⋯, xn) $ A $ n × n $ matrix whose component is $ \ frac {∂ ^ 2f} {∂x_i∂x_j} $ is called a Hessian matrix.

There is. (Reference 2)

In other words

\begin{pmatrix}
\frac{∂^2f}{∂x^2_1} & \cdots &  \frac{∂^2f}{∂x_1∂x_j}\\
\vdots & \ddots &  \vdots \\
\frac{∂^2f}{∂x_i∂x_1} & \cdots &  \frac{∂^2f}{∂x_i∂x_j}\\
\end{pmatrix}

It's a matrix like this. As you can see, the Hessian matrix is a symmetric matrix.

** Semi-definite matrix ** is defined as follows. (Reference 1)

Generally, when the symmetric matrix $ A $ satisfies "$ x ^ ⊤ Ax ≥ 0 $ for any vector $ x $", it is called a semi-definite value. Also, for "any vector $ x ≠ 0 $" When $ x ^ ⊤ Ax> 0 $ ”is satisfied, it is called a definite value.

From the definition of definite matrix

  1. The quadratic form $ bf {x} ^ ⊤ Abf {x} $ is non-negative for all $ n $ dimensional real vectors $ \ bf {x} $
  2. All eigenvalues of $ A $ are non-negative
  3. A real square matrix $ U $ exists and can be expressed as $ A = U ^ ⊤ U $
  4. All major submatrix expressions of $ A $ are non-negative

A matrix that satisfies is a definite matrix. As I will explain later, when the ** Hessian matrix $ \ Delta {S} $ is a semi-fixed value, $ S $ is a convex function. ** **

Now that we are ready, let's look at the convex function.

Convex function

** Convex function ** is a concept necessary to reach a global solution without falling into a local solution.

Let's see what it actually looks like. スクリーンショット 2020-03-04 13.33.54.png Looking at the formula,

λf(x) + (1 − λ)f(y) ≥ f(λx + (1 − λ)y)\\
0 ≤ λ ≤ 1

A function that satisfies the above conditions is called a convex function. If you chew it, "The point $ R (= f (z)) $ between the points $ P (= f (x)) $ and the point $ Q (= f (y)) $ existing on the function $ f $ is a straight line. When it is smaller than the point $ S $ existing on $ PQ $, it is called a convex function. " It will be.

If the function to be optimized is a convex function, it has the good property that the local optimal solution and the global optimal solution match. You can understand this by looking at the figure below. スクリーンショット 2020-03-06 15.14.29.png スクリーンショット 2020-03-06 15.14.40.png

Therefore, I would like to derive that the least squares function $ S $ mentioned above is a convex function. To say that $ S $ is a convex function, it is sufficient if $ \ Delta {S} $ is a semi-established value.

\sum_{i}^{N}x_i^2 = s_{xx},  \sum_{i}^{N}x_i = s_x,  \sum_{i}^{N}y_i^2 = s_{yy},  \sum_{i}^{N}y_i = s_y,  \sum_{i}^{N}x_iy_i = s_{xy}

If you say The least squares function $ S $ is

\begin{align}
S
&=a^2s_{xx}+2as_x-2as_{xy}-2bs_y+s_{yy}+Nb^2
\end{align}

It will be. Partial differentiation of this with $ a and b $, respectively

\begin{align}
\frac{∂Q}{∂a}&=2s_{xx}a + 2s_xb - 2s_{xy}\\
\frac{∂Q}{∂b}&=2s_{x}a + 2Nb - 2s_{y}
\end{align}

So if you make $ S $ a Hessian matrix based on this

\begin{align}
\Delta{S}
&=
\begin{pmatrix}
\frac{∂^2f}{∂x^2} & \frac{∂^2f}{∂x∂y} \\
\frac{∂^2f}{∂y∂x} & \frac{∂^2f}{∂y^2}
\end{pmatrix}\\
&=
2
\begin{pmatrix}
s_{xx} & s_x \\
s_x & n
\end{pmatrix}
\end{align}

It will be.

From the above semi-definite value condition 1, it is sufficient to show that "the quadratic form of $ \ Delta {S} $ is non-negative". $ \ sum_ {i} ^ {N} x_i ^ 2 = s_ {xx}, \ sum_ {i} ^ {N} x_i = s_x $, so

\begin{align}
2
\begin{pmatrix}
t & k
\end{pmatrix}
\begin{pmatrix}
s_{xx} & s_x \\
s_x & n
\end{pmatrix}
\begin{pmatrix}
t\\
k
\end{pmatrix}
&=
2(s_{xx}t^2+2s_xtk + Nk^2)\\
&=
2(t^2\sum_{i}^{N}{x_i^2}+2tk\sum_{i}^{N}{x_i}+Nk^2)\\
&=
2\sum_{i}^{N}{(t^2x_i^2+2tkx_i+k^2)}\\
&=
2\sum_{i}^{N}{(tx_i+k)^2} ≥0
\end{align}

Since $ \ Delta {S} $ is semi-definite, we can prove that the objective function $ S $ created by the least squares method is a convex function. From this, it can be seen that the function created by the least squares method is a convex function, and the local solution holds as the optimal solution.

For details, see Reference 1 Reference 5 / 028c457880c0ec576e27 #% E6% 9C% 80% E5% B0% 8F% E4% BA% 8C% E4% B9% 97% E6% B3% 95% E3% 81% AF% E5% 87% B8% E9% 96% Take a look at A2% E6% 95% B0). It was very easy to understand because I even proved it. In addition, the image is [Reference 6](https://qiita.com/hiyoko9t/items/6742e7dc121cf4cbef09#%E5%87%B8%E8%A8%88%E7%94%BB%E5%95%8F%E9% I quoted from A1% 8C).

Definition of differentiation

Here is the definition of the derivative used in the $ Armijo $ rule described below.

{\nabla}f(x_k)={\lim_{h \to 0}}\frac{f(x+h)-f(x)}{h}

If the error is $ o (h) $,

o(h) = \frac{f(x+h)-f(x)}{h} - {\nabla}f(x_k)
\begin{equation}
f(x+h) = f(x) + h{\nabla}f(x_k) + ho(h)\tag{1}
\end{equation}

Can be expressed as. Of course, this $ o (h) $ is

\lim_{h \to 0}o(h) = 0

However, when $ h $ takes a value that is not small enough, $ o (h) $ cannot be specified.

Armijo rules

The rule of $ Armijo $ is one of the methods to select the learning rate $ t $.

Set this objective function as $ f (x_k) $ as the number of times $ k $ has been learned.

In order to move from $ x_k $ to $ x_ {k + 1} $ at the next coordinate, suppose that the learning rate is $ t_k $ in the descending direction $ d_k $. Then

\begin{equation}
f(\textbf{x}_{k+1})=f(\textbf{x}_k + t_k\textbf{d}_k)\tag{2}
\end{equation}

Can be written as.

The descent direction $ d_k $ that first appeared here is

\begin{equation}
{\nabla}f(\textbf{x}_k)^\mathsf{T}\textbf{d}_k<0\tag{3}
\end{equation}

It is a vector that satisfies. This means that if the slope $ \ nabla {f (x_k)} $ is positive, it will descend in the negative direction, and if the slope $ \ nabla {f (x_k)} $ is negative, it will descend in the positive direction. It makes sense when you think about it. Also, $ d_k $ can be $ 1 $ in length as the orientation is only important.

Returning to the story, if equation (2) is transformed with reference to equation (1),

f(\textbf{x}_k+t_k\textbf{d}_k) = f(\textbf{x}_k) + t_k{\nabla}f(\textbf{x}_k)^\mathsf{T}\textbf{d}_k + t_ko(t_k)

Can be transformed. If you compare this with the one before the step

\begin{align}
f(\textbf{x}_{k+1}) - f(\textbf{x}_k) &=
f(\textbf{x}_k+t_k\textbf{d}_k) - f(\textbf{x}_k)\\ &= t_k{\nabla}f(\textbf{x}_k)^\mathsf{T}\textbf{d}_k + t_ko(t_k)\\
&= t_k({\nabla}f(\textbf{x}_k)^\mathsf{T}\textbf{d}_k + o(t_k))\tag{4}
\end{align}

It will be. Here, since $ o (t_k) $ is an error as described above, its value changes as $ t_k $ increases, but it should be ignored when $ t_k $ becomes a sufficiently small value. I can.

If $ t_k $ is set to a value so small that $ o (t_k) $ can be ignored, the condition of equation (3) shows that the right-hand side of equation (4) is negative. Also, the left side of equation (4) is also a major premise that $ f (x_ {k + 1}) <f (x_k) $, so it is naturally negative.

From the above, if you play with equation (4) a little

f(\textbf{x}_{k+1})-f(\textbf{x}_k) ≤ \gamma\beta^{l_k}{\nabla}f(\textbf{x}_k)^\mathsf{T}\textbf{d}_k\tag{5}\\

However, set each variable as follows.

\begin{cases}
\begin{align}
\gamma, \beta &\in (0, 1)\\
t_k &= \beta^{l_k}\\
l_k &= 0, 1, 2, 3 \cdots
\end{align}
\end{cases}

This conditional expression is the rule for $ Armijo $. In other words, $ t $ that satisfies this formula can be selected as the learning rate.

Now, let's break down equation (5) immediately.

Since equation (5) should be negative on both sides, the right side multiplied by $ \ gamma $ is larger than the left side.

The left side is annoying, so to summarize

\begin{align}
p(t_k)
&= f(\textbf{x}_k+t_k\textbf{d}_k) - f(\textbf{x}_k)\tag{6}\\ 
(&= f(\textbf{x}_k+\beta^{l_k}\textbf{d}_k) - f(\textbf{x}_k))\\ 
\end{align}

It will be. Also, if you try to differentiate this with $ t_k $,

\frac{d}{dt_k}p(t_k) = {\nabla}f(\textbf{x}_k+t_k\textbf{d}_k)^\mathsf{T}\textbf{d}_k\\
\frac{d}{dt_k}p(0) = {\nabla}f(\textbf{x}_k)^\mathsf{T}\textbf{d}_k\tag{7}

Can be transformed with. Since there is a part similar to equation (5), if you replace equation (5) with equations (6) and (7),

p(t_k) ≤ {\gamma}\frac{d}{dt_k}p(0)t_k\tag{8}

It will be. $ \ frac {d} {dt_k} p (0) $ is the slope of the tangent line of $ p (t_k) $ at $ t_k = 0 $, so it is red when $ \ gamma = 1 $ as shown below. Like a line, it is a tangent to $ p () t_k $ at $ t_k = 0 $. On the other hand, as $ \ gamma $ approaches $ 0 $, the slope converges to $ 0 $ as shown by the blue line. スクリーンショット 2020-03-05 2.52.20.png Also, if $ \ beta = 0.5 $ is set here,

l_k 0 1 2 3 4 5 6 7
t 1 0.5^{1} 0.5^{2} 0.5^{3} 0.5^{4} 0.5^{5} 0.5^{6} 0.5^{7}

It will be. Since $ l_k $ takes only integer values greater than or equal to $ 0 $, as $ l_k $ increases, $ t_k $ decreases, and the range is

0<t_k≤1

It will be. It is more efficient if the learning rate is as large as possible, so increase $ l_k $ from $ 0 $ and adopt the value that satisfies equation (5) as the learning rate. By using this rule, the selection of learning rate can be optimally selected. However, in order to implement this problem, it is necessary to guess the shape of $ p'(t) $ in advance, so it is necessary to devise when using it.

This time, Reference 3, [Reference 4](http://www. I thought based on kana-lab.c.titech.ac.jp/lecture/lec.2007.1st.suurijouhou1/06.2007.05.24.UnconstrainedOpt.pdf).

Finally

This time, I was able to solidify the theory required for the gradient method to some extent. If I have a chance, I would like to deepen my understanding by using actual methods.

References

・ Reference 1: Optimization and convex functions (University of Tokyo lecture materials) ・ Reference 2: Extreme judgment of multivariable function and Hessian matrix ・ Reference 3: Descent method (Kyoto University lecture materials) ・ Reference 4: [Unconstrained optimization problem (Nagoya University lecture material)](http://www.kana-lab.c.titech.ac.jp/lecture/lec.2007.1st.suurijouhou1/06.2007.05.24.UnconstrainedOpt .pdf) ・ Reference 5: [I thought about the merits of the stochastic gradient descent method](https://qiita.com/koshian2/items/028c457880c0ec576e27#%E6%9C%80%E5%B0%8F%E4%BA%8C % E4% B9% 97% E6% B3% 95% E3% 81% AF% E5% 87% B8% E9% 96% A2% E6% 95% B0) ・ Reference 6: [Continuous optimization that you should know when doing machine learning](https://qiita.com/hiyoko9t/items/6742e7dc121cf4cbef09#%E5%87%B8%E8%A8%88%E7%94 % BB% E5% 95% 8F% E9% A1% 8C)

Recommended Posts

Understand Armijo's rules and convex functions
Understand Armijo's rules and convex functions
Higher-order functions and decorators
Anonymous and map functions
Understand t-SNE and improve visualization
Understand PyTorch's DataSet and DataLoader (2)
Understand PyTorch's DataSet and DataLoader (1)
Understand Python packages and modules
Python 3 sorted and comparison functions
Class inheritance and super functions
Python higher-order functions and comprehensions