What is Bayesian optimization?

(Miscellaneous) ** A. How to find the best place while skipping experiments as much as possible. ** **

Using the method "** Gaussian process regression **" to statistically estimate a certain function $ f $, observe the value of $ y = f (x) $ only where it looks good ** and $ How to find the optimal value of f $.

This article is easy to understand as an example of actual use. Make the best cork high with Bayesian optimization-Watabori looks delicious

Since I had a chance to use it recently, I would like to introduce what I investigated for that purpose and the calculation I did as a preliminary experiment. Detailed mathematical discussions can be found by reading Chapter 6 of PRML and Chapter 6 of "Gaussian Process and Machine Learning", so this article will introduce an image story and experimental results. .. (The executable code has a GitHub link at the end)

What is Gaussian process regression?

Suppose you want to expect a function like the black line in the figure below. The actual function is $ y = \ frac {1} {16} x ^ 4-x ^ 2 + \ frac {5} {16} x $, but I don't know the function form. It is difficult to check everything from end to end, so let's get 5 points (blue circle) and predict the whole from here.

Here, if you enter the information of this point in "Gaussian process regression" and make a prediction, it will be as follows [^ nan].

[^ nan]: If these 5 points are obtained as observations, what kind of function is such a function? It is said that noise follows a Gaussian distribution or that the function is obtained in a certain expression. The probability distribution is derived by making the above assumptions.

In Gaussian process regression, the prediction comes out as a "probability distribution of functions". It is information about the probability of passing around here. The blue line in the figure is the average value of this probability distribution, and the part filled with light blue is the degree of variation in the probability (here, twice the standard deviation. $ 2 \ sigma $, so the probability of being in this range is 95%. ) Is shown. The actual function $ y = \ frac {1} {16} x ^ 4-x ^ 2 + \ frac {5} {16} x $ is nicely within the margin of error You can see that is small and the error range increases as the distance from the point increases.

If you add a few more points, It looks like. It is now possible to predict fairly accurately in the range of $ x = -3 to 1 $.

Since noise is included in the actual observation, it is also included in the calculation.

How to determine "a place that looks good"

Considering the problem of "I don't need to know the shape of all the functions, so I want to find only the minimum value". I feel that it is useless to observe the place (around $ x = -1 to 1 $) that is known to increase as the value increases. You want to preferentially search for places where the value seems to be small. Also, I would like to add a little more points where there is a lot of uncertainty (around $ x = 3 $) that I haven't searched much yet.

In this way, the "acquisition function" is to score the "places that are likely to have the optimum value" and the "places that are still uncertain" in a well-balanced manner. However, which one should be emphasized depends on the case, so there are various acquisition functions depending on the strategy.

EI strategy (expected improvement) Expected value of how much the minimum value can be updated from the minimum value of the points observed so far $E[y_{\text{min}}-\hat{y}] =(y_{\text{min}} - \mu)\Phi\left(\frac{y_{\text{min}}-\mu}{\sigma}\right) + \sigma \phi\left(\frac{y_{\text{min}}-\mu}{\sigma}\right)$ $ \ phi (x) = \ frac {1} {\ sqrt {2 \ pi}} \ exp (-\ frac {x ^ 2} {2}) $ is the Gaussian function, $ \ Phi (x) $ is its Cumulative density function.
LCB strategy (lowrer confidence bound) Lower bound of confidence interval $\mu-\sigma\times\kappa\sqrt{\frac{\log N}{N}}$ $ \ Kappa $ is the multiplier that uses the confidence interval. Change the width of the confidence interval appropriately according to the number of observed points $ N $. (It may be wrong because the EI curve is strange ...)

etc...

When each plot is made, it looks like this, and both are predicted to be the next observation points around $ -2.7 $.

BayesianOptimization Since it is troublesome to write such a calculation every time, I use a Python package called Bayesian Optimization. The target uses the same form as above, $ y = x ^ 4-16x ^ 2 + 5x $ [^ stf].

[^ stf]: This function is often used as an optimization benchmark as a function with multiple minimums in the general dimension. STYBLINSKI-TANG FUNCTION

ノイズを含んでいます。

First, if you take 3 points and perform Gaussian process regression, the prediction and acquisition functions will be like this. It was decided that the red vertical line should be observed next [^ gp].

[^ gp]: Acquisition function uses EI strategy

Observe around this x = 0.5, add points, and repeat the regression. The uncertainty around x = 0 has been significantly reduced.

If you repeat this cycle about 20 times, it will be as follows.

The value of x, which takes the minimum value, was predicted to be -2.59469813. The true solution is -2.9035 ..., so it's quite different, but it's noisy, so it can't be helped to some extent.

In the case of 2D

In general, optimization search can be performed in higher dimensional space as well. From STYBLINSKI-TANG FUNCTION Similarly, try to minimize it with a function of this type.

As a result of performing Gaussian process regression with 5 points appropriately, the mean value, standard deviation, and acquisition function are as follows. It looks like this when I plot it in 3D. (Blue is the average, green is the standard deviation ±)

Similarly, if 55 cycles of observation are performed,

You get a shape that is pretty close to a true function. The value of x, which takes the minimum value, was predicted to be (-2.79793531, -2.91749935). It seems that the accuracy is better than before. Considering that if x and y were searched from -5 to 5 in 0.1 increments, 100 experiments would be required for 10x10, the number of experiments could be saved.

Hyperparameter optimization

Bayesian optimization is effective when the cost of observing individual values is high and the dimension of the search space is large. Let's apply this to the parameter adjustment of machine learning, which increases the calculation time for each time when the number of data becomes huge.

Hyperparameters of Kerne Ridge ―

The machine learning model called Kernel Ridge Regression has two hyperparameters, ʻalpha and gamma (parameters that are not automatically determined and must be given externally depending on the data) [^ hyper]. .. ʻAlpha is the strength of regularization (work to prevent overfitting), and gamma is a parameter that determines the shape of the function to be applied.

[^ hyper]: There is really a way to determine the theoretically appropriate parameters, but here we will search randomly without thinking about it.

Let's find the parameters that maximize the accuracy of the model that predicts home prices in the boston dataset contained in sklearn.datasets.

First, let's check the accuracy with the default parameter values.

KernelRidge(kernel='rbf').fit(train_x, train_y).score(test_x, test_y)
#=> 0.4802674032751879

It became 0.48.

Since the digit of the parameter can change significantly, perform a parameter search using each logarithm.

def get_score_KR(x):
    alpha, gamma = x
    predictor = KernelRidge(kernel='rbf', alpha=alpha, gamma=gamma)
    return cross_val_score(predictor, train_x, train_y, cv=5).mean()
def get_score_KR_log(x):
    print(x)
    return get_score_KR((10**x[0][0], 10**x[0][1]))

Such a function verifies the accuracy of the model for alpha and gamma and uses it as an experiment for feedback.


import GPyOpt
bounds = [{'name': 'log alpha', 'type': 'continuous', 'domain': (-4,2)},
         {'name': 'log gamma', 'type': 'continuous', 'domain': (-4,2)}]
bo = GPyOpt.methods.bayesian_optimization.BayesianOptimization(
    f=get_score_KR_log, domain=bounds, model_type='GP', acquisition_type='EI', initial_design_numdata=5, maximize=True)
bo.run_optimization()
bo.plot_acquisition()

The first 5 points look like this.

After 11 cycles, it looks like this.

It looks like this in 51 cycles. [^ alpha] The optimum value of x is [-1.97439296, -0.25720405], that is, alpha = 0.0106, gamma = 0.553.

[^ alpha]: It seems that alpha has been swung to the smaller side, but this is probably because the number of data is small. If alpha is too small, regularization will not work and generalization performance will drop, so we will limit our search to this point.

predictor_opt = KernelRidge(kernel='rbf', alpha=10**bo.x_opt[0], gamma=10**bo.x_opt[1])
predictor_opt.fit(train_x, train_y)

predictor_opt.score(test_x ,test_y)
#=> 0.8114250068143878

When I checked the accuracy again using this value, the result was 0.81 which was a considerable improvement compared to before the optimization. You did it.

Comparison with grid search

In general, there are many documents that use "grid search" to search the space uniformly for hyperparameter adjustment [^ gs]. Similarly, let's search the parameter space of $ 10 ^ {-4} to 10 ^ 2 $.

[^ gs]: Example: Grid search and random parameter optimization in the Scikit-learn document 3.2. Tuning the hyper-parameters of an estimator Is introduced, and there is a description that grid search is widely used.

from sklearn.model_selection import GridSearchCV
parameters = {'alpha':[i*10**j for j in [-4, -3, -2, -1, 0, 1] for i in [1, 2, 4, 8]], 
              'gamma':[i*10**j for j in [-4, -3, -2, -1, 0, 1] for i in [1, 2, 4, 8]]}
gcv = GridSearchCV(KernelRidge(kernel='rbf'), parameters, cv=5)
gcv.fit(train_x, train_y)

bes = gcv.best_estimator_
bes.fit(train_x, train_y)
bes.score(test_x, test_y)
#=> 0.8097198949264954

The shape is almost the same as the predicted curved surface in Gauss optimization. In this grid search, "experiment" is performed with 24 points each for alpha and gamma, totaling 576 points, so it is difficult in a situation where the number of data is large and it takes time to calculate.

Summary

That's why we were able to find the parameters that show the same accuracy as the grid search with Bayesian optimization in about 1/10 of the number of experiments!

If you have any mistakes or questions, please comment.

The execution code and progress of each section are listed below.

What is Bayesian optimization? : BayesianOptimization_Explain BayesianOptimization：BayesianOptimization_Benchmark Hyperparameter Optimization: BayesianOptimization_HyperparameterSearch

References

C.M. Bishop, Translated by Hiroshi Motoda et al. (2012) "Pattern recognition and machine learning Statistical prediction by upper and lower base theory" Maruzen Publishing Daichi Mochihashi, Seisei Ohba (2019) "Gauss Process and Machine Learning" Kodansha Bayesian optimization package GPyOpt with Python Bayesian Optimization Mathematics Make the best cokehigh with Bayesian optimization Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. GPyOpt

[PYTHON] Adjust hyperparameters with Bayesian optimization