Plan to implement PRML algorithm almost exclusively with Numpy, but the Japanese version of PRML has already entered the second volume. In the kernel method of Chapter 6, the inner product of data points in the feature space is calculated using the kernel function (kernel trick), and linear regression or linear separation is performed in that space. I don't know much about the Gaussian process, so I think there are many mistakes. As for the Ichiou code, we have an implementation that works like that.

Gaussian process

In the linear regression we worked on in Chapter 3 etc., the parameter $ {\ bf w} $ often followed a Gaussian distribution. So $ y = {\ bf w} ^ {\ rm T} \ phi (x) $ also follows a one-dimensional Gaussian distribution, and $ {\ bf y} = {\ bf \ Phi} {\ bf w} $ It follows a multidimensional Gaussian distribution. However, $ {\ bf \ Phi} $ is a design matrix of $ \ {x_1, \ dots, x_N \} $. In such cases, $ p ({\ bf y}) $ is said to follow the Gaussian process. ** The Gaussian process follows a Gaussian distribution not only in one dimension ($ y $ is a scalar), in multiple dimensions ($ y $ is a finite number of vectors), but also in infinite dimensions ($ y $ is an infinite number of vectors). It can be interpreted as an infinite dimensional Gaussian distribution **.

The joint distribution of $ {\ bf y} $ is a Gaussian distribution with the mean $ {\ bf 0} $ and the covariance Gram matrix $ K $.

p({\bf y}) = \mathcal{N}({\bf y}|{\bf 0},K)

Will be. However, the elements of the Gram matrix are $ K_ {nm} = k (x_n, x_m) $ with $ k (x, x') $ as the kernel function. Gaussian functions $ k (x, x') = a \ exp \ left (-b (x --x') ^ 2 \ right) $ are often used as kernel functions with $ a and b $ as constants. I will. For $ x_n, x_m $ that are similar to each other, the correlation of $ y (x_n), y (x_m) $ is set to be high. This "similarity" depends on what kind of kernel function you use.

Gaussian process regression

Let the observed target be $ t_n = y_n + \ epsilon_n $. Here, $ y_n = {\ bf w} ^ {\ rm T} \ phi (x_n) $ and noise $ \ epsilon_n $ are noises that follow the Gaussian distribution added to the nth observation. With the precision parameter $ \ beta $

p(t_n|y_n) = \mathcal{N}(t_n|y_n,\beta^{(-1)})

It will be. Therefore, as $ {\ bf t} = (t_1, \ dots, t_N) ^ {\ rm T} $,

p({\bf t}|{\bf y}) = \mathcal{N}({\bf t}|{\bf y},\beta^{(-1)}{\bf I}_N)

It will be. Now that we have already defined the joint probability of $ {\ bf y} $ above, we can find the joint distribution of $ {\ bf t} $.

\begin{align}
p({\bf t}) &= \int p({\bf t}|{\bf y})p({\bf y})d{\bf y}\\
&= \int \mathcal{N}({\bf t}|{\bf y},\beta^{(-1)}{\bf I}_N)\mathcal{N}({\bf y}|{\bf 0},K)d{\bf y}\\
&= \mathcal{N}({\bf t}|{\bf 0},{\bf C}_N)
\end{align}

However, $ {\ bf C} \ _N = K + \ beta ^ {(-1)} {\ bf I} \ _N $.

As $ t_ {N + 1} $ to be newly predicted in addition to N observations $ {\ bf t} $, the simultaneous probability of $ {\ bf t}, t_ {N + 1} $ is above. From the discussion

p({\bf t},t_{N+1}) = \mathcal{N}({\bf t},t_{N+1}|{\bf 0}, {\bf C}_{N+1})

However, the covariance matrix is $ {\ bf k} = \ left (k (x_1, x_ {N + 1}), \ dots, k (x_N, x_ {N + 1}) \ right), c = k (x_ {N + 1}, x_ {N + 1}) as $

{\bf C}_{N+1} =
\begin{bmatrix}
{\bf C}_N & {\bf k}\\
{\bf k}^{\rm T} & c
\end{bmatrix}

It will be. Now that we know the joint distribution of $ {\ bf t}, t_ {N + 1} $, we can find the conditional distribution $ p (t_ {N + 1} | {\ bf t}) $.

p(t_{N+1}|{\bf t}) = \mathcal{N}(t_{N+1}|{\bf k}^{\rm T}{\bf C}_N^{-1}{\bf t},c-{\bf k}^{\rm T}{\bf C}_N^{-1}{\bf k})

Learning hyperparameters

The discussion above fixed the kernel function $ k (x, x') $. For example, in the kernel function $ k (x, x') = a \ exp (-b (x-x') ^ 2) $, $ a, b $ was set as a constant. Then, what should we do with the constants used for kernel functions? This parameter estimation corresponds to the hyperparameter estimation implemented in PRML Chapter 3. At that time, the evidence function $ p ({\ bf t} | \ theta) $ was maximized. $ \ Theta $ is a kernel function parameter, for example $ \ theta = (a, b) ^ {\ rm T} $. The logarithmic evidence function

\ln p({\bf t}|\theta) = -{1\over2}\ln |{\bf C}_N| - {1\over2}{\bf t}^{\rm T}{\bf C}_N^{-1}{\bf t} - {N\over2}\ln(2\pi)

And when this is differentiated with respect to the parameters,

{\partial\over\partial\theta_i}\ln p({\bf t}|\theta) = -{1\over2}{\rm Tr}({\bf C}_N^{-1}{\partial{\bf C}_N\over\partial\theta_i}) + {1\over2}{\bf t}^{\rm T}{\bf C}_N^{-1}{\partial{\bf C}_N\over\partial\theta_i}{\bf C}_N^{-1}{\bf t}

It will be. I updated the parameters using the gradient method.

Implementation

Library

import matplotlib.pyplot as plt
import numpy as np

There are two libraries to import: matplotlib, which is a library for drawing graphs, and numpy, which performs matrix calculations.

Kernel function

This time I used $ a \ exp (-{b \ over2} (x-x') ^ 2) $ as the kernel function.

#Kernel function class
class GaussianKernel(object):
    #Kernel function parameters a,Initialize b
    def __init__(self, params):
        assert np.shape(params) == (2,)
        self.__params = params

    #Kernel function parameters a,Returns b
    def get_params(self):
        return np.copy(self.__params)

     # x,Calculate kernel function value with y as input PRML expression(6.63)
    def __call__(self, x, y):
        return self.__params[0] * np.exp(-0.5 * self.__params[1] * (x - y) ** 2)

    # x,Calculate the derivative of the kernel function parameter when y is input
    def derivatives(self, x, y):
        sq_diff = (x - y) ** 2
        #Differentiation with parameter a
        delta_0 = np.exp(-0.5 * self.__params[1] * sq_diff)
        #Derivative on parameter b
        delta_1 = -0.5 * sq_diff * delta_0 * self.__params[0]
        return (delta_0, delta_1)

    #Updated kernel function parameters
    def update_parameters(self, updates):
        assert np.shape(updates) == (2,)
        self.__params += updates

Gaussian process regression

#A class that performs regression by Gaussian process
class GaussianProcessRegression(object):
    #Initialization of kernel functions and noise accuracy parameters
    def __init__(self, kernel, beta=1.):
        self.kernel = kernel
        self.beta = beta

    #Regression without parameter estimation of kernel functions
    def fit(self, x, t):
        self.x = x
        self.t = t
        #Gramian matrix calculation PRML formula(6.54)
        Gram = self.kernel(*np.meshgrid(x, x))
        #Covariance matrix calculation PRML formula(6.62)
        self.covariance = Gram + np.identity(len(x)) / self.beta
        #Precision matrix calculation
        self.precision = np.linalg.inv(self.covariance)

    #Regression for estimating kernel function parameters
    def fit_kernel(self, x, t, learning_rate=0.1, iter_max=10000):
        for i in xrange(iter_max):
            params = self.kernel.get_params()
            #Regression with current parameters of kernel function
            self.fit(x, t)
            #Differentiate the logarithmic evidence function with parameters
            gradients = self.kernel.derivatives(*np.meshgrid(x, x))
            #Calculate the amount of parameter updates PRML expression(6.70)
            updates = np.array(
                [-np.trace(self.precision.dot(grad)) + t.dot(self.precision.dot(grad).dot(self.precision).dot(t)) for grad in gradients])
            #Updated parameters
            self.kernel.update_parameters(learning_rate * updates)
            #Stop updating if the parameter update amount is small
            if np.allclose(params, self.kernel.get_params()):
                break
        else:
            #If the parameter update amount is not small even after updating the default number of updates, the following statement is output.
            print "parameters may not have converged"

    #Output predicted distribution
    def predict_dist(self, x):
        K = self.kernel(*np.meshgrid(x, self.x, indexing='ij'))
        #Calculate the mean of the predicted distribution PRML formula(6.66)
        mean = K.dot(self.precision).dot(self.t)
        #Calculate the variance of the predicted distribution PRML formula(6.67)
        var = self.kernel(x, x) + 1 / self.beta - np.sum(K.dot(self.precision) * K, axis=1)
        return mean.ravel(), np.sqrt(var.ravel())

Main function

Replacing regression.fit_kernel (...) with regression.fit (x, t) will regress without estimating the parameters.

def main():
    #Set the function that the training data follows
    def func(x):
        return np.sin(2 * np.pi * x)
    #Generate training data
    x, t = create_toy_data(func, high=0.7, std=0.1)

    #Kernel function settings and parameters are finally set
    kernel = GaussianKernel(params=np.array([1., 1.]))
    #Kernel functions and precision parameters used for Gaussian process regression(True value)settings of
    regression = GaussianProcessRegression(kernel=kernel, beta=100)
    #Regression, including estimation of kernel function parameters.fit(x,t)Regression without estimating parameters when replaced with
    regression.fit_kernel(x, t, learning_rate=0.1, iter_max=10000)

    #Output predicted distribution for test data
    x_test = np.linspace(0, 1, 100)
    y, y_std = regression.predict_dist(x_test)

    #Plot the results of the regression
    plt.scatter(x, t, alpha=0.5, color="blue", label="observation")
    plt.plot(x_test, func(x_test), color="blue", label="sin$(2\pi x)$")
    plt.plot(x_test, y, color="red", label="predict_mean")
    plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
    plt.legend(loc="lower left")
    plt.show()

Whole code

`gaussian_process_regression.py`


import matplotlib.pyplot as plt
import numpy as np


class GaussianKernel(object):

    def __init__(self, params):
        assert np.shape(params) == (2,)
        self.__params = params

    def get_params(self):
        return np.copy(self.__params)

    def __call__(self, x, y):
        return self.__params[0] * np.exp(-0.5 * self.__params[1] * (x - y) ** 2)

    def derivatives(self, x, y):
        sq_diff = (x - y) ** 2
        delta_0 = np.exp(-0.5 * self.__params[1] * sq_diff)
        delta_1 = -0.5 * sq_diff * delta_0 * self.__params[0]
        return (delta_0, delta_1)

    def update_parameters(self, updates):
        assert np.shape(updates) == (2,)
        self.__params += updates


class GaussianProcessRegression(object):

    def __init__(self, kernel, beta=1.):
        self.kernel = kernel
        self.beta = beta

    def fit(self, x, t):
        self.x = x
        self.t = t
        Gram = self.kernel(*np.meshgrid(x, x))
        self.covariance = Gram + np.identity(len(x)) / self.beta
        self.precision = np.linalg.inv(self.covariance)

    def fit_kernel(self, x, t, learning_rate=0.1, iter_max=10000):
        for i in xrange(iter_max):
            params = self.kernel.get_params()
            self.fit(x, t)
            gradients = self.kernel.derivatives(*np.meshgrid(x, x))
            updates = np.array(
                [-np.trace(self.precision.dot(grad)) + t.dot(self.precision.dot(grad).dot(self.precision).dot(t)) for grad in gradients])
            self.kernel.update_parameters(learning_rate * updates)
            if np.allclose(params, self.kernel.get_params()):
                break
        else:
            print "parameters may not have converged"

    def predict_dist(self, x):
        K = self.kernel(*np.meshgrid(x, self.x, indexing='ij'))
        mean = K.dot(self.precision).dot(self.t)
        var = self.kernel(x, x) + 1 / self.beta - np.sum(K.dot(self.precision) * K, axis=1)
        return mean.ravel(), np.sqrt(var.ravel())


def create_toy_data(func, low=0, high=1., n=10, std=1.):
    x = np.random.uniform(low, high, n)
    t = func(x) + np.random.normal(scale=std, size=n)
    return x, t


def main():

    def func(x):
        return np.sin(2 * np.pi * x)

    x, t = create_toy_data(func, high=0.7, std=0.1)

    kernel = GaussianKernel(params=np.array([1., 1.]))
    regression = GaussianProcessRegression(kernel=kernel, beta=100)
    regression.fit_kernel(x, t, learning_rate=0.1, iter_max=10000)

    x_test = np.linspace(0, 1, 100)
    y, y_std = regression.predict_dist(x_test)

    plt.scatter(x, t, alpha=0.5, color="blue", label="observation")
    plt.plot(x_test, func(x_test), color="blue", label="sin$(2\pi x)$")
    plt.plot(x_test, y, color="red", label="predict_mean")
    plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
    plt.legend(loc="lower left")
    plt.show()

if __name__ == '__main__':
    main()

result

You now have a diagram like Figure 6.8 in PRML. The variance of the predicted distribution is large in the region without training data, and small in the region with training data.

By the way, the result when the maximum likelihood estimation of hyperparameters is not performed with the same training data as above is as shown in the figure below. The above predicted distribution is more suitable for training data (and test data).

At the end

After all, what you are doing is similar to Bayesian linear regression. However, without explicitly using the feature vector, I used the kernel trick to regress using only the inner product of the feature vector. Although it is not described in PRML, it seems that the Gaussian process is used for a method called Bayesian optimization, which seems to be attracting attention recently. It seems that Bayesian optimization is used to determine parameters such as neural network learning coefficients. I would like to implement Bayesian optimization if I have a chance.

PRML Chapter 6 Gaussian Process Regression Python Implementation

Gaussian process

Gaussian process regression

Learning hyperparameters

Implementation

Library

Kernel function

Gaussian process regression

Main function

Whole code

gaussian_process_regression.py

result

At the end

`gaussian_process_regression.py`