I recently read ["Gauss Process and Machine Learning"](https://www.amazon.co.jp/ Gauss Process and Machine Learning-Machine Learning Professional Series-Mochihashi-Earth / dp / 4061529269), so in Python I tried to implement it. The content is "3.4.2 Gaussian Process Regression Calculation" in Chapter 3. Deer is a landmark. It is easy to understand and highly recommended.

What is the Gaussian process?

First of all, to explain the Gaussian process in one word,

At Y = f (X) ** "If the input X is similar, the output Y is also similar." **

It is a tool for mathematically expressing the property.

This time, we will implement "Gaussian process regression", which is a regression analysis using this Gaussian process!

4 steps to Gaussian process regression

Here, we will explain by taking the following steps.

Linear regression ↓ Ridge regression ↓ Consider linear regression and ridge regression with Gaussian distribution (probability distribution) ↓ Gaussian process regression

There are two reasons to take the above steps.

** First, Gaussian process regression is a non-linear model that extends ridge regression. ** ** Therefore, an understanding of ridge regression is necessary as a prerequisite for understanding Gaussian process regression. Needless to say, it is necessary to understand linear regression, which is the premise of ridge regression. Therefore, I will explain from the linear regression of the major premise. (It's like "rush around".)

** Secondly, Gaussian process regression obtains the predicted value by sampling from the Gaussian distribution (probability distribution). ** ** As I will explain in detail later, the major difference between Gaussian process regression and ridge regression (or linear regression) is that the variance of the Gaussian distribution to be sampled changes according to the Gaussian process. Therefore, the ridge regression (or linear regression) also has the step of predicting by sampling from the Gaussian distribution so that we can understand how the Gaussian process regression is different from the ridge regression (or linear regression). ..

1. Linear regression

As a first step in understanding Gaussian process regression, let's start with a review of linear regression.

For linear regression, let X be the input matrix, w be the coefficient vector, and y be the output vector.

y=Xw \\
s.t. \min_{x}|Y-Xw|^2

It can be expressed by the formula. s.t. shows the constraints of linear regression

** Minimize the squared error between the measured value Y and the predicted value (y =) Xw. ** **

is what it means.

In particular, the solution of the coefficient w can be obtained by differentiating this constraint with w.

w=(X^TX)^{-1}X^Ty

The following results can be obtained by actually performing linear regression.

This time, the prediction was made by a linear function, but if the input X is added to the data with a high-order value such as square, cube, etc., the predicted value is also obtained by a high-order function such as a quadratic function or a cubic function. can also do.

2. Ridge regression

Next is ridge regression. Ridge regression is a linear regression with additional constraints.

y=Xw \\
s.t. \min_{x} |Y-Xw|^2 + \alpha|w|^2

The constraint condition represented by s.t. has a square (constant multiple) term of the coefficient vector w. this is,

Minimize the squared error between the measured value Y and the predicted value (y =) Xw. ** Moreover, the coefficient w should be as small as possible. ** **

is what it means.

This additional constraint provides two benefits.

Stabilization of calculation

When finding the coefficient w of linear regression, the following inverse matrix is actually

(X^TX)^{-1}

The calculation was done on the assumption that

Therefore, if this inverse matrix does not exist, there is a disadvantage that calculation becomes impossible.

Therefore, in ridge regression, by adding the square (constant multiple) term of the coefficient vector w to the constraint condition, the constraint condition is differentiated by w as in the case of linear regression.

w=(X^TX + \alpha I)^{-1}X^Ty

And since the identity matrix αI intentionally creates the inverse matrix, the calculation becomes possible.

Overfitting can be prevented. The figure on the left is the result of overfitting linear regression, and the figure on the right is the result of ridge regression. (Graph quote from Let's suppress overfitting of linear regression-Ridge regression and Lasso regression-)

As I explained earlier, linear regression can be predicted even with high-order functions, so the regression equation can become overly complicated, and this phenomenon is read as overfitting.

Therefore, as the graph shows, in ridge regression, it is possible to prevent overfitting when calculating the predicted value by intentionally limiting the coefficient to a small value.

3. Gaussian distribution and linear regression, ridge regression

When the linear regression and ridge regression described so far are actually used, the predicted value y and the measured value Y There will be some error between them.

In other words, considering that "prediction error occurs", the predicted value y given a certain x is Gaussian distribution.

N(w^Tx, \sigma ^2)

It can be said that it is sampled according to.

Here, the coefficients w of the linear regression and the ridge regression have already been obtained, so if they are transformed,

Linear regression

N \Bigl(\bigl((X^TX)^{-1}X^Ty \bigr)^Tx, \sigma ^2 \Bigr)

Ridge regression

N \Bigl(\bigl((X^TX + \alpha I)^{-1}X^Ty \bigr)^Tx, \sigma ^2 \Bigr)

By sampling from, each predicted value y can be obtained.

The point I would like you to keep in mind here is

** The mean of Gaussian distribution is different between linear regression and ridge regression. ** **

** But the variance remains the same, both assuming "homoscedasticity". ** **

That is the point.

4. Gaussian process regression

Finally, I will explain Gaussian process regression!

Gaussian process regression becomes ridge regression

** "If the input X is similar, the output Y is also similar." **

The feature of the Gaussian process is added as a constraint condition.

In other words, if the predicted value y of Gaussian process regression is also considered to be a value obtained by sampling from a certain Gaussian distribution like the linear regression and ridge regression mentioned earlier,

The Gaussian distribution that Gaussian process regression follows is

** Does not assume "homoscedasticity" like ridge regression, **

** The variance of the predicted values y, y'changes according to the distance (that is, whether they are similar) of the inputs x, x'. ** **

It can be said that.

With this as a premise, let's check the Gaussian distribution that Gaussian process regression follows.

N(k_*^TK^{-1}y, k_{**}-k_*^TK^{-1}k_*)

Here, unfamiliar k and K came out.

These k and K are called kernels and are "values obtained from two inputs x".

I will explain in detail later, but here

K: Kernel obtained from input x between training data k *: Kernel obtained from training data input x and test data input x k **: Kernel obtained from input x between test data

Because the mean and variance of the Gaussian distribution that Gaussian process regression follows includes one of the above kernels.

** Positional relationships between various inputs x affect both the mean and variance of the Gaussian distribution followed by Gaussian process regression. ** **

You can see that.

Implementation of Gaussian process regression

Data creation

This time, prepare the following data.

Original data: Artificially created mixed signal with noise Training data: Random sample points partially obtained from the original data Predicted data: Original noisy mixed signal

import numpy as np

#Creation of original data
n=100
data_x = np.linspace(0, 4*np.pi, n)
data_y = 2*np.sin(data_x) + 3*np.cos(2*data_x) + 5*np.sin(2/3*data_x) + np.random.randn(len(data_x))

#Missing the signal to get a partial sample point
missing_value_rate = 0.2
sample_index = np.sort(np.random.choice(np.arange(n), int(n*missing_value_rate), replace=False))

Now, let's check with pyplot of matplotlib.

from matplotlib import pyplot as plt
%matplotlib inline

plt.figure(figsize=(12, 5))
plt.title('signal data', fontsize=20)

#Original signal
plt.plot(data_x, data_y, 'x', color='green', label='correct signal')

#Partial sample points
plt.plot(data_x[sample_index], data_y[sample_index], 'o', color='red', label='sample dots')

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=12)
plt.show()

The results are as follows.

Since random sample points are obtained with np.random.choice (), the graph may change.

It's quite complicated data, but how do you predict the Gaussian process ... (a little uneasy.)

Kernel definition

1. About the kernel

The kernel suddenly came out here, but to explain it roughly,

** Find the covariance matrix of Gaussian process regression without calculating the coefficient vector w **

It's a tool.

Earlier, I explained that the Gaussian process is "ridge regression with mean and variance in the predicted value y".

Moreover, this mean and variance are calculated from the relationship that "if the input X is similar, the output Y is also similar" according to the characteristics of the Gaussian process.

For example, if the input X is one-dimensional data and the interval is 1.0 from -10 to 10, the predicted value y (= xw) of the Gaussian process is derived from the similarity or positional relationship of the input X, so 21 y can be obtained and the coefficient. The vector w has 21 dimensions.

If the input X is two-dimensional, the coefficient vector replaces the matrix and its elements increase to 441. If you increase it further to 3D, 4D, ..., the elements of the coefficient matrix will increase exponentially, resulting in a huge amount that cannot normally be calculated.

Moreover, not only the dimension but also the range of input X may be further expanded, so this is not enough for any computational power.

That's where the kernel comes in.

When using the kernel, the similarity between the two outputs y and y'corresponding directly between the two inputs x and x'is calculated without calculating the coefficient vector (or matrix) w, which is necessary for Gaussian process regression. Covariance matrix can be calculated!

2. Kernel implementation

There are several types of kernels, but this time we will use a Gaussian kernel (RBF kernel) that includes Gaussian errors.

Gaussian kernel *** "The value decays exponentially like a Gaussian distribution with the distance of the inputs x, x'." *** There are features such as.

The formula is expressed as follows. Gaussian kernel with noize.png Here, the parameter values this time are the following three. θ1, θ2: Gaussian kernel parameters θ3: Gaussian error variance

Also, the Gaussian error function δ is Two input values x and x'are equal: δ (x, x') = 1 Others: δ (x, x') = 0 Conditional branch as follows.

Now let's implement the expression as a function.

#Functionalize the Gaussian kernel
def kernel(x, x_prime, p, q, r):
    if x == x_prime:
        delta = 1
    else:
        delta = 0

    return p*np.exp(-1 * (x - x_prime)**2 / q) + ( r * delta)

θ1, θ2, θ3 are p, q, r x, x'is x, x_prime It has become.

Algorithm implementation

From here, we will proceed according to "Fig. 3.17 Basic algorithm for calculating Gaussian process regression" in Chapter 3.

For the time being, according to the notation in the book, divide the signal data created earlier into "learning data" and "test data" and redefine it in an easy-to-understand manner.

#Data definition
xtrain = np.copy(data_x[sample_index])
ytrain = np.copy(data_y[sample_index])

xtest = np.copy(data_x)

Now let's get into the implementation.

Here, the Gaussian distribution followed by the previously presented Gaussian process regression

N(k_*^TK^{-1}y, k_{**}-k_*^TK^{-1}k_*)

Both the mean and the variance in are calculated using the Gaussian kernel.

#average
mu = []
#Distributed
var = []

#Each parameter value
Theta_1 = 1.0
Theta_2 = 0.4
Theta_3 = 0.1

#Less than,Basic algorithm for Gaussian process regression calculation
train_length = len(xtrain)
#Prepare the base of the kernel matrix between training data
K = np.zeros((train_length, train_length))

for x in range(train_length):
    for x_prime in range(train_length):
        K[x, x_prime] = kernel(xtrain[x], xtrain[x_prime], Theta_1, Theta_2, Theta_3)
        
#Inner product calculated with dots
yy = np.dot(np.linalg.inv(K), ytrain)

test_length = len(xtest)
for x_test in range(test_length):
    
    #Prepare the base of the kernel matrix between test data and training data
    k = np.zeros((train_length,))
    for x in range(train_length):
        k[x] = kernel(xtrain[x], xtest[x_test], Theta_1, Theta_2, Theta_3)
        
    s = kernel(xtest[x_test], xtest[x_test], Theta_1, Theta_2, Theta_3)
    
    #The inner product is calculated with dots,Add to array of mean values
    mu.append(np.dot(k, yy))
    #First, "k* K^-1 ”part(Dot product because it is an inner product)Calculation
    kK_ = np.dot(k, np.linalg.inv(K))
    #Calculate the inner product with the latter half with dots,Add to distributed array
    var.append(s - np.dot(kK_, k.T))

It's a little long, so I'll explain it in order.

The first thing we want to do this time is the predicted mean and variance of the signal. (Since the interval of predicted values is the same as the original data, 100 mean values and variances are stored in the list.)

Also, specify the parameter value first.

The content of the algorithm is the same as the book, but the following changes are required when implementing it in Python. -Before calculating K [x, x'], it is necessary to prepare the base of the kernel matrix K. (This time, I created it with a zero matrix and rewrote the elements.) -Since the "*" in the book represents the inner product of the matrices, rewrite it as np.dot (). ・ K [x] also needs to be prepared before calculation. -The inner product of the three matrices in the calculation of variance is divided into two stages and rewritten as np.dot ().

If you keep the above points in mind, Gaussian process regression is perfect!

Finally, let's check with pyplot of matplotlib.

plt.figure(figsize=(12, 5))
plt.title('signal prediction by Gaussian process', fontsize=20)

#Original signal
plt.plot(data_x, data_y, 'x', color='green', label='correct signal')
#Partial sample points
plt.plot(data_x[sample_index], data_y[sample_index], 'o', color='red', label='sample dots')

#Convert variance to standard deviation
std = np.sqrt(var)

#Signal the mean value obtained in the Gaussian process
plt.plot(xtest, mu, color='blue', label='mean by Gaussian process')
#Range the standard deviation obtained in the Gaussian process*See end of code for range
plt.fill_between(xtest, mu + 2*std, mu - 2*std, alpha=.2, color='blue', label= 'standard deviation by Gaussian process')

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=12)
plt.show()

#Mean ±(Standard deviation x 2) … 95.4%The specified number appears in the range with the probability of

The results are as follows.

As mentioned at the beginning, the graph changes depending on the position of randomly obtained sample points.

As mentioned in the code, this time we are converting the predicted variance to the standard deviation. In addition, the expected range of the Gaussian process is twice the standard deviation.

For the method of determining the range, see the following in to-kei.net -Statistics for all humankind-. I referred to the table.

range	Probability that the specified number appears in the range
Mean ± standard deviation	68.3%
Mean ±(Standard deviation x 2)	95.4%
Mean ±(Standard deviation x 3)	99.7%

Benefits of Gaussian process

Gaussian process regression has two major advantages **.

1. Non-linear model is easily obtained

In linear regression and ridge regression, when creating a non-linear model, it was supported by preparing and learning multiple nth-order functions. Therefore, ** "what order of function should be considered" ** becomes a bottleneck, and considerable effort must be spent on model construction.

However, in Gaussian process regression, ** "using the kernel" makes it possible to obtain a nonlinear model without preparing an nth-order function. ** **

Therefore, the calculation process and algorithms for learning are complicated, but it is possible to build a complicated model with little effort.

2. Partial prediction accuracy can be confirmed

By observing the graph of Gaussian process regression obtained by implementing the algorithm earlier, we can grasp the approximate characteristics of the original data, but due to the nature of the Gaussian process, ** places where the intervals between sample points are large (horizontal axis: 8). (Nearby, etc.) ** is quite wrong.

However, such ** mispredicted parts have a wider "dispersion" than other parts **, and it seems that the Gaussian process itself confesses that it is not confident in the prediction.

In other words, the Gaussian process is ** a model that can confirm partial prediction accuracy **.

References / Reference Links

["Gauss Process and Machine Learning"](https://www.amazon.co.jp/ Gauss Process and Machine Learning-Machine Learning Professional Series-Mochihashi-Earth / dp / 4061529269) Let's suppress overfitting of linear regression-Ridge regression and Lasso regression- to-kei.net -Statistics for all humankind-

"Gaussian process and machine learning" Gaussian process regression implemented only with Python numpy