PRML Chapter 6 Gaussian Process Regression Python Implementation

Plan to implement PRML algorithm almost exclusively with Numpy, but the Japanese version of PRML has already entered the second volume. In the kernel method of Chapter 6, the inner product of data points in the feature space is calculated using the kernel function (kernel trick), and linear regression or linear separation is performed in that space. I don't know much about the Gaussian process, so I think there are many mistakes. As for the Ichiou code, we have an implementation that works like that.

Gaussian process

In the linear regression we worked on in Chapter 3 etc., the parameter $ {\ bf w} $ often followed a Gaussian distribution. So $ y = {\ bf w} ^ {\ rm T} \ phi (x) $ also follows a one-dimensional Gaussian distribution, and $ {\ bf y} = {\ bf \ Phi} {\ bf w} $ It follows a multidimensional Gaussian distribution. However, $ {\ bf \ Phi} $ is a design matrix of $ \ {x_1, \ dots, x_N \} $. In such cases, $ p ({\ bf y}) $ is said to follow the Gaussian process. ** The Gaussian process follows a Gaussian distribution not only in one dimension ($ y $ is a scalar), in multiple dimensions ($ y $ is a finite number of vectors), but also in infinite dimensions ($ y $ is an infinite number of vectors). It can be interpreted as an infinite dimensional Gaussian distribution **.

The joint distribution of $ {\ bf y} $ is a Gaussian distribution with the mean $ {\ bf 0} $ and the covariance Gram matrix $ K $.

p({\bf y}) = \mathcal{N}({\bf y}|{\bf 0},K)

Will be. However, the elements of the Gram matrix are $ K_ {nm} = k (x_n, x_m) $ with $ k (x, x') $ as the kernel function. Gaussian functions $ k (x, x') = a \ exp \ left (-b (x --x') ^ 2 \ right) $ are often used as kernel functions with $ a and b $ as constants. I will. For $ x_n, x_m $ that are similar to each other, the correlation of $ y (x_n), y (x_m) $ is set to be high. This "similarity" depends on what kind of kernel function you use.

Gaussian process regression

Let the observed target be $ t_n = y_n + \ epsilon_n $. Here, $ y_n = {\ bf w} ^ {\ rm T} \ phi (x_n) $ and noise $ \ epsilon_n $ are noises that follow the Gaussian distribution added to the nth observation. With the precision parameter $ \ beta $

p(t_n|y_n) = \mathcal{N}(t_n|y_n,\beta^{(-1)})

It will be. Therefore, as $ {\ bf t} = (t_1, \ dots, t_N) ^ {\ rm T} $,

p({\bf t}|{\bf y}) = \mathcal{N}({\bf t}|{\bf y},\beta^{(-1)}{\bf I}_N)

It will be. Now that we have already defined the joint probability of $ {\ bf y} $ above, we can find the joint distribution of $ {\ bf t} $.

\begin{align}
p({\bf t}) &= \int p({\bf t}|{\bf y})p({\bf y})d{\bf y}\\
&= \int \mathcal{N}({\bf t}|{\bf y},\beta^{(-1)}{\bf I}_N)\mathcal{N}({\bf y}|{\bf 0},K)d{\bf y}\\
&= \mathcal{N}({\bf t}|{\bf 0},{\bf C}_N)
\end{align}

However, $ {\ bf C} \ _N = K + \ beta ^ {(-1)} {\ bf I} \ _N $.

As $ t_ {N + 1} $ to be newly predicted in addition to N observations $ {\ bf t} $, the simultaneous probability of $ {\ bf t}, t_ {N + 1} $ is above. From the discussion

p({\bf t},t_{N+1}) = \mathcal{N}({\bf t},t_{N+1}|{\bf 0}, {\bf C}_{N+1})

However, the covariance matrix is $ {\ bf k} = \ left (k (x_1, x_ {N + 1}), \ dots, k (x_N, x_ {N + 1}) \ right), c = k (x_ {N + 1}, x_ {N + 1}) as $

{\bf C}_{N+1} =
\begin{bmatrix}
{\bf C}_N & {\bf k}\\
{\bf k}^{\rm T} & c
\end{bmatrix}

It will be. Now that we know the joint distribution of $ {\ bf t}, t_ {N + 1} $, we can find the conditional distribution $ p (t_ {N + 1} | {\ bf t}) $.

p(t_{N+1}|{\bf t}) = \mathcal{N}(t_{N+1}|{\bf k}^{\rm T}{\bf C}_N^{-1}{\bf t},c-{\bf k}^{\rm T}{\bf C}_N^{-1}{\bf k})

Learning hyperparameters

The discussion above fixed the kernel function $ k (x, x') $. For example, in the kernel function $ k (x, x') = a \ exp (-b (x-x') ^ 2) $, $ a, b $ was set as a constant. Then, what should we do with the constants used for kernel functions? This parameter estimation corresponds to the hyperparameter estimation implemented in PRML Chapter 3. At that time, the evidence function $ p ({\ bf t} | \ theta) $ was maximized. $ \ Theta $ is a kernel function parameter, for example $ \ theta = (a, b) ^ {\ rm T} $. The logarithmic evidence function

\ln p({\bf t}|\theta) = -{1\over2}\ln |{\bf C}_N| - {1\over2}{\bf t}^{\rm T}{\bf C}_N^{-1}{\bf t} - {N\over2}\ln(2\pi)

And when this is differentiated with respect to the parameters,

{\partial\over\partial\theta_i}\ln p({\bf t}|\theta) = -{1\over2}{\rm Tr}({\bf C}_N^{-1}{\partial{\bf C}_N\over\partial\theta_i}) + {1\over2}{\bf t}^{\rm T}{\bf C}_N^{-1}{\partial{\bf C}_N\over\partial\theta_i}{\bf C}_N^{-1}{\bf t}

It will be. I updated the parameters using the gradient method.

Implementation

Library

import matplotlib.pyplot as plt
import numpy as np

There are two libraries to import: matplotlib, which is a library for drawing graphs, and numpy, which performs matrix calculations.

Kernel function

This time I used $ a \ exp (-{b \ over2} (x-x') ^ 2) $ as the kernel function.

#Kernel function class
class GaussianKernel(object):
    #Kernel function parameters a,Initialize b
    def __init__(self, params):
        assert np.shape(params) == (2,)
        self.__params = params

    #Kernel function parameters a,Returns b
    def get_params(self):
        return np.copy(self.__params)

     # x,Calculate kernel function value with y as input PRML expression(6.63)
    def __call__(self, x, y):
        return self.__params[0] * np.exp(-0.5 * self.__params[1] * (x - y) ** 2)

    # x,Calculate the derivative of the kernel function parameter when y is input
    def derivatives(self, x, y):
        sq_diff = (x - y) ** 2
        #Differentiation with parameter a
        delta_0 = np.exp(-0.5 * self.__params[1] * sq_diff)
        #Derivative on parameter b
        delta_1 = -0.5 * sq_diff * delta_0 * self.__params[0]
        return (delta_0, delta_1)

    #Updated kernel function parameters
    def update_parameters(self, updates):
        assert np.shape(updates) == (2,)
        self.__params += updates

Gaussian process regression

#A class that performs regression by Gaussian process
class GaussianProcessRegression(object):
    #Initialization of kernel functions and noise accuracy parameters
    def __init__(self, kernel, beta=1.):
        self.kernel = kernel
        self.beta = beta

    #Regression without parameter estimation of kernel functions
    def fit(self, x, t):
        self.x = x
        self.t = t
        #Gramian matrix calculation PRML formula(6.54)
        Gram = self.kernel(*np.meshgrid(x, x))
        #Covariance matrix calculation PRML formula(6.62)
        self.covariance = Gram + np.identity(len(x)) / self.beta
        #Precision matrix calculation
        self.precision = np.linalg.inv(self.covariance)

    #Regression for estimating kernel function parameters
    def fit_kernel(self, x, t, learning_rate=0.1, iter_max=10000):
        for i in xrange(iter_max):
            params = self.kernel.get_params()
            #Regression with current parameters of kernel function
            self.fit(x, t)
            #Differentiate the logarithmic evidence function with parameters
            gradients = self.kernel.derivatives(*np.meshgrid(x, x))
            #Calculate the amount of parameter updates PRML expression(6.70)
            updates = np.array(
                [-np.trace(self.precision.dot(grad)) + t.dot(self.precision.dot(grad).dot(self.precision).dot(t)) for grad in gradients])
            #Updated parameters
            self.kernel.update_parameters(learning_rate * updates)
            #Stop updating if the parameter update amount is small
            if np.allclose(params, self.kernel.get_params()):
                break
        else:
            #If the parameter update amount is not small even after updating the default number of updates, the following statement is output.
            print "parameters may not have converged"

    #Output predicted distribution
    def predict_dist(self, x):
        K = self.kernel(*np.meshgrid(x, self.x, indexing='ij'))
        #Calculate the mean of the predicted distribution PRML formula(6.66)
        mean = K.dot(self.precision).dot(self.t)
        #Calculate the variance of the predicted distribution PRML formula(6.67)
        var = self.kernel(x, x) + 1 / self.beta - np.sum(K.dot(self.precision) * K, axis=1)
        return mean.ravel(), np.sqrt(var.ravel())

Main function

Replacing regression.fit_kernel (...) with regression.fit (x, t) will regress without estimating the parameters.

def main():
    #Set the function that the training data follows
    def func(x):
        return np.sin(2 * np.pi * x)
    #Generate training data
    x, t = create_toy_data(func, high=0.7, std=0.1)

    #Kernel function settings and parameters are finally set
    kernel = GaussianKernel(params=np.array([1., 1.]))
    #Kernel functions and precision parameters used for Gaussian process regression(True value)settings of
    regression = GaussianProcessRegression(kernel=kernel, beta=100)
    #Regression, including estimation of kernel function parameters.fit(x,t)Regression without estimating parameters when replaced with
    regression.fit_kernel(x, t, learning_rate=0.1, iter_max=10000)

    #Output predicted distribution for test data
    x_test = np.linspace(0, 1, 100)
    y, y_std = regression.predict_dist(x_test)

    #Plot the results of the regression
    plt.scatter(x, t, alpha=0.5, color="blue", label="observation")
    plt.plot(x_test, func(x_test), color="blue", label="sin$(2\pi x)$")
    plt.plot(x_test, y, color="red", label="predict_mean")
    plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
    plt.legend(loc="lower left")
    plt.show()

Whole code

gaussian_process_regression.py


import matplotlib.pyplot as plt
import numpy as np


class GaussianKernel(object):

    def __init__(self, params):
        assert np.shape(params) == (2,)
        self.__params = params

    def get_params(self):
        return np.copy(self.__params)

    def __call__(self, x, y):
        return self.__params[0] * np.exp(-0.5 * self.__params[1] * (x - y) ** 2)

    def derivatives(self, x, y):
        sq_diff = (x - y) ** 2
        delta_0 = np.exp(-0.5 * self.__params[1] * sq_diff)
        delta_1 = -0.5 * sq_diff * delta_0 * self.__params[0]
        return (delta_0, delta_1)

    def update_parameters(self, updates):
        assert np.shape(updates) == (2,)
        self.__params += updates


class GaussianProcessRegression(object):

    def __init__(self, kernel, beta=1.):
        self.kernel = kernel
        self.beta = beta

    def fit(self, x, t):
        self.x = x
        self.t = t
        Gram = self.kernel(*np.meshgrid(x, x))
        self.covariance = Gram + np.identity(len(x)) / self.beta
        self.precision = np.linalg.inv(self.covariance)

    def fit_kernel(self, x, t, learning_rate=0.1, iter_max=10000):
        for i in xrange(iter_max):
            params = self.kernel.get_params()
            self.fit(x, t)
            gradients = self.kernel.derivatives(*np.meshgrid(x, x))
            updates = np.array(
                [-np.trace(self.precision.dot(grad)) + t.dot(self.precision.dot(grad).dot(self.precision).dot(t)) for grad in gradients])
            self.kernel.update_parameters(learning_rate * updates)
            if np.allclose(params, self.kernel.get_params()):
                break
        else:
            print "parameters may not have converged"

    def predict_dist(self, x):
        K = self.kernel(*np.meshgrid(x, self.x, indexing='ij'))
        mean = K.dot(self.precision).dot(self.t)
        var = self.kernel(x, x) + 1 / self.beta - np.sum(K.dot(self.precision) * K, axis=1)
        return mean.ravel(), np.sqrt(var.ravel())


def create_toy_data(func, low=0, high=1., n=10, std=1.):
    x = np.random.uniform(low, high, n)
    t = func(x) + np.random.normal(scale=std, size=n)
    return x, t


def main():

    def func(x):
        return np.sin(2 * np.pi * x)

    x, t = create_toy_data(func, high=0.7, std=0.1)

    kernel = GaussianKernel(params=np.array([1., 1.]))
    regression = GaussianProcessRegression(kernel=kernel, beta=100)
    regression.fit_kernel(x, t, learning_rate=0.1, iter_max=10000)

    x_test = np.linspace(0, 1, 100)
    y, y_std = regression.predict_dist(x_test)

    plt.scatter(x, t, alpha=0.5, color="blue", label="observation")
    plt.plot(x_test, func(x_test), color="blue", label="sin$(2\pi x)$")
    plt.plot(x_test, y, color="red", label="predict_mean")
    plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
    plt.legend(loc="lower left")
    plt.show()

if __name__ == '__main__':
    main()

result

You now have a diagram like Figure 6.8 in PRML. The variance of the predicted distribution is large in the region without training data, and small in the region with training data. result_mle.png

By the way, the result when the maximum likelihood estimation of hyperparameters is not performed with the same training data as above is as shown in the figure below. The above predicted distribution is more suitable for training data (and test data). result.png

At the end

After all, what you are doing is similar to Bayesian linear regression. However, without explicitly using the feature vector, I used the kernel trick to regress using only the inner product of the feature vector. Although it is not described in PRML, it seems that the Gaussian process is used for a method called Bayesian optimization, which seems to be attracting attention recently. It seems that Bayesian optimization is used to determine parameters such as neural network learning coefficients. I would like to implement Bayesian optimization if I have a chance.

Recommended Posts

PRML Chapter 6 Gaussian Process Regression Python Implementation
PRML Chapter 4 Bayesian Logistic Regression Python Implementation
PRML Chapter 10 Variational Gaussian Distribution Python Implementation
PRML Chapter 5 Neural Network Python Implementation
PRML Chapter 3 Evidence Approximation Python Implementation
PRML Chapter 7 Related Vector Machine Python Implementation for Regression Problems
PRML Chapter 8 Product Sum Algorithm Python Implementation
PRML Chapter 5 Mixed Density Network Python Implementation
PRML Chapter 14 Conditional Mixed Model Python Implementation
PRML Chapter 2 Student's t Distribution Python Implementation
PRML Chapter 1 Bayesian Curve Fitting Python Implementation
Gaussian process regression Numpy implementation and GPy
Implemented in Python PRML Chapter 3 Bayesian Linear Regression
PRML Chapter 11 Markov Chain Monte Carlo Python Implementation
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Gaussian process regression using GPy
Explanation and implementation of PRML Chapter 4
Gaussian process
PRML Chapter 13 Maximum Likelihood Estimating Python Implementation of Hidden Markov Models
"Gaussian process and machine learning" Gaussian process regression implemented only with Python numpy
PRML implementation Chapter 3 Linear basis function model
Implemented in Python PRML Chapter 7 Nonlinear SVM
Gaussian process regression with PyMC3 Personal notes
Implemented in Python PRML Chapter 5 Neural Networks
Implemented in Python PRML Chapter 1 Bayesian Inference
Implemented in Python PRML Chapter 1 Polynomial Curve Fitting
[Python] Implementation of clustering using a mixed Gaussian model
Implemented in Python PRML Chapter 4 Classification by Perceptron Algorithm
A python implementation of the Bayesian linear regression class
RNN implementation in python
ValueObject implementation in Python
Daemonize a Python process
Python: Supervised Learning (Regression)
[Python] Chapter 01-01 About Python (First Python)
SVM implementation in python
Gaussian process with pymc3
Regression analysis in Python
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
Implementation example to process python> test <CR> <LF> <ACK> <NAK> test2 <CR> <LF>