[PYTHON] I tried to fight the Local Minimum of Goldstein-Price Function

Previously, I learned that there is a RosenBrock function for Optimizer testing in the SciPy library, but when I searched Wikipedia, I found that there are various benchmarking functions other than RosenBrock.

https://en.wikipedia.org/wiki/Test_functions_for_optimization

The simplest form of Sphere Function is expected to be easy to intuitively find the optimal solution (minimum value point), but other functions may find it difficult to find the optimal solution. This time, I would like to take up "Goldstein Price Function" from this.

Fig. Sphere Function (Reference figure, this is easy to handle ...)

Fig. Goldstein-Price Function (This time, we will deal with this.)

Goldstein-The nature of the Price Function

First, the formula of the function is as follows.

f(x,y) = (1+(x+y+1)^2 (19 -14x+3x^2-14y+6xy+3y^2))\\
(30 + (2x -3y)^2 (18 - 32x +12x^2 +48y -36xy+27y^2)

As shown in the 3D plot in the above figure, the nonlinearity of the function is large as the values on the z-axis are larger than the values on the x-axis and y-axis. The minimum value point (optimal solution) of this function is

f(0, -1) = 3

As shown, z = 3 is obtained by (x, y) = (0, -1), but there are multiple local minimums around this. I drew a contour map to see the situation in detail.

There is a groove-like shape in a diagonal cross shape, and the intersection point of the groove is the Global Minumum point (x, y) = (0, -1). There is an island in a slightly larger area near (1, 0.2) on the upper right, and there is a possibility that it will take a Local Minimum. In addition, small Local Mimimum candidates are observed in the groove shape. This seems difficult to handle.

Solve with TensorFlow

Since we have various optimizers, we decided to solve it using TensorFlow. The code looks like this:

First is the definition of the Goldstein-Price function. It is coded according to the definition of the function, but the square calculation uses tf.square ().

def goldsteinprice_tf(x, y):
    # goal : (x, y) = (0., -1.0)
    term1 = (19. - 14. * x + 3. * x * x -14. * y + 6. * x * y + 3. * y * y)
    term2 = 1. + tf.square((x + y + 1.0)) * term1
    term3 = 18. -32. * x + 12. * x * x + 48. * y - 36. * x * y + 27. * y * y
    term4 = 30. + tf.square((2. * x - 3. * y)) * term3
    z = term2 * term4
    return z

Set the initial value and give the Goldstein-Price function as the cost.

    # set initial parameter
    x0, y0 = (0., 0.)
    x = tf.Variable(x0)
    y = tf.Variable(y0)
    loss = goldsteinprice_tf(x, y)

The optimizer can be selected from 6 types of TensorFlow Optimizer.

    lrate = 1.e-03    #Learning rate
    sw = 4            # Optimizer Selection
    
    # need to check sw = [2, 5]
    if sw == 1:
        optimizer = tf.train.GradientDescentOptimizer(lrate)
    elif sw == 2:
        optimizer = tf.train.AdagradOptimizer(lrate, initial_accumulator_value=0.1)
    elif sw == 3:
        optimizer = tf.train.MomentumOptimizer(lrate, momentum=0.0)
    elif sw == 4:
        optimizer = tf.train.AdamOptimizer(lrate)
    elif sw == 5:
        optimizer = tf.train.FtrlOptimizer(lrate)
    elif sw == 6:
        optimizer = tf.train.RMSPropOptimizer(lrate, decay=0.0)
    else:
        print('Error.')
    
    train_op = optimizer.minimize(loss)

All you have to do now is initialize the variables and run the session.

   init = tf.initialize_all_variables()

    with tf.Session() as sess:
        sess.run(init)
        print('Training...')
        print('initial (x, y) = (%8.4f, %8.4f) : loss = %8.4f' 
            % (sess.run(x), sess.run(y), sess.run(loss)))
        
        for i in range(10001):
            train_op.run()

            if i % 100 == 0:
                loss_ = float(sess.run(loss))
                # loss_log.append(loss_)
                x_ = sess.run(x)
                y_ = sess.run(y)
            
            if i % 1000 == 0:                # echo status on screen
                print('(x, y) = (%8.4f, %8.4f) : loss = %8.4f' 
                    % (x_, y_, loss_))

        # Check trained parameter
        print('final (x, y) = (%8.4f, %8.4f) : loss = %8.4f' 
            % (sess.run(x), sess.run(y), sess.run(loss)))

The calculation parameters that can be selected are --Initial value (x0, y0) --Type of Optimizer --Learning rate and number of Loop processes It becomes.

As a calculation example, the execution result of the above list settings (initial value (0,0), AdamOptimizer, Loop 10,000 times) is as follows.

Training...
initial (x, y) = (  0.0000,   0.0000) : loss = 600.0000
(x, y) = ( -0.0010,  -0.0010) : loss = 598.5597
(x, y) = ( -0.5198,  -0.4769) : loss =  31.8792
(x, y) = ( -0.5756,  -0.4230) : loss =  30.2262
(x, y) = ( -0.5987,  -0.4012) : loss =  30.0007
(x, y) = ( -0.6000,  -0.4000) : loss =  30.0000
(x, y) = ( -0.6000,  -0.4000) : loss =  30.0000
(x, y) = ( -0.6000,  -0.4000) : loss =  30.0000
(x, y) = ( -0.6000,  -0.4000) : loss =  30.0000
(x, y) = ( -0.6000,  -0.4000) : loss =  30.0000
(x, y) = ( -0.6000,  -0.4000) : loss =  30.0000
(x, y) = ( -0.6000,  -0.4000) : loss =  30.0000
final (x, y) = ( -0.6000,  -0.4000) : loss =  30.0000

The situation is brilliantly caught in the Local Minimum (-0.6, -0.4). By the way, when the initial value was set near the Global Minimum and calculated, it was confirmed that it properly converged to the Minimum point (0, -1).

Global Optimization method "simulated annealing"

Here, let's leave TensorFlow and try the Global Optimization method "Simulated Annealing". [Wikipedia](https://ja.wikipedia.org/wiki/%E7%84%BC%E3%81%8D%E3%81%AA%E3%81%BE%E3%81%97%E6%B3 The explanation in% 95) is as follows.

When the SA algorithm repeatedly finds a solution, it finds a solution in the random neighborhood of the current solution, which is affected by the value of the function given at that time and the global parameter T (meaning temperature). Then, the value of T (temperature) gradually decreases due to the similarity with the above-mentioned physical process. Therefore, since T is large at first, the solution changes boldly, but it converges as T approaches zero. At first, you can easily climb the slope, so you don't have to think about what to do if you fall into a local minimum that is a problem with hill climbing.

The wonderful thing is that the pseudo code was also posted, so I converted it to Python code and executed it.

The main part of the code is as follows.

def sim_anneal(startState, maxIter, goalE):
    state = startState
    x_, y_ = state[0], state[1]
    e = goldsteinprice(x_, y_)
    bestState = state
    bestE = e
    for iter in range(0, maxIter):
        delta = np.random.rand(2) - 0.5
        nextState = state + delta * 1.
        nextE = goldsteinprice(nextState[0], nextState[1])
        
        if bestE > nextE:
            bestState = nextState
            bestE = nextE
            if goalE >= bestE:
                return bestState
        r = np.random.rand()
        if probability(e, nextE, temperature(10.0, iter/maxIter)) >= r:
            state = nextState
            e = nextE
        if iter % 100 ==0:
            print('iter: nextE, bestE, goalE = (%5d, %9.3f, %9.3f, %9.3f)' 
               % (iter, nextE, bestE, goalE))
            
    return bestState

When the calculation is executed, it reaches the neighborhood of Global Munimum (0.0, -1.0).

Fig. Parameter(x,y) Path (Simulated Annealing)

In this calculation as well, there are various calculation parameter options in addition to the initial value. Also, what I noticed after running the calculation several times is that it is affected by the nature of random numbers. In some cases, if the seed of a random number was not set and the calculation was executed many times, the calculation would end in a completely different place. The results are difficult to read, so it is not a versatile calculation method.

Start over with TensorFlow Optimizer

To see the effect of the initial value, 9 initial values: [[0., 0.], [0., 1.], [1., 0.], [1., 1.], [0., -1.], [-1., 0. ], [-1., -1.], [-1., 1.], [1., -1.]] .

First, the calculation result using Adagrad Optimizer is shown in the figure below.

Fig. Parameter(x,y) path from 9 points (Adagrad)

None of them have reached the minimum point, except for the case where it was at Global Minimum from the beginning. (It may be improved by adjusting the learning rate and other parameters.) The result of using Adam Optimizer is shown below.

Fig. Parameter(x,y) path from 9 points (Adam)

Here, the minimum point is reached in two cases, but if one is excluded because the initial value was the minimum point, the case of (x0, y0) = (1.0, -1.0) is also successful.

From the above situation, I finally tried using random numbers to set the initial value. Normally, in neural network learning, random numbers are used to initialize the weights, and the purpose is to "promote the progress of learning by eliminating the symmetry of the net." This time, the meaning is slightly different because the random numbers are initialized in order to widely assign the parameters in consideration of the possibility of falling to the Local Minimum.

    num_jobs = 10

    with tf.Session() as sess:
        for job in range(num_jobs):
            init_param = tf.Variable(tf.random_uniform([2], minval=-1., maxval=1.))
            x = tf.slice(init_param, [0], [1])
            y = tf.slice(init_param, [1], [1])

            loss = goldsteinprice_tf(x, y)
(Omitted)

As mentioned above, tf.random_uniform () generated random numbers from -1.0 to 1.0 and used them as initial values. The calculation result is as follows.

Fig. Parameter(x,y) path from random point (I tried to separate the colors of Path.)

It was a little more pessimistic from the result of the initial value of 9 points, but it has reached (0, -1) of Global Minumum with a high probability. You can find the Golobal Minimum with a probability of 30% or more.

This time, I tried to "fight" with the Goldestein-Price function for no practical purpose, but I found various difficulties in finding the optimum point. By using a random number as the initial value, the problem of Local Minimum can be avoided to some extent, but what about the range of the random number? There seems to be no general answer to that.

However, assuming the situation where the learning data is input in the actual classification problem, there may be a situation where the training data itself has an error and is not caught by the Local Minimum by using the stochastic gradient descent method. I believe. (Of course, it is very useful to try various initial values in a situation where you can spend time.)

References (web site)

--Test functions for optimization (wikipedia) ... A beautiful function diagram (3D plot) is posted. https://en.wikipedia.org/wiki/Test_functions_for_optimization --Simulated Annealing (wikipesida) https://ja.wikipedia.org/wiki/%E7%84%BC%E3%81%8D%E3%81%AA%E3%81%BE%E3%81%97%E6%B3%95 --It was up to Tensorflow Document ... ver. 0.7.0. (As of February 22, 2016) https://www.tensorflow.org/ ――Learn the basics of Theano once again ―― Qiita http://qiita.com/TomokIshii/items/1f483e9d4bfeb05ae231