I decided to pick up some of the methods introduced in PRML for my own study and implement them in Python. Basically, we will code with the rule that only numpy can be used other than the Python standard library (of which scipy may be used). However, matplotlib etc. are exceptions because they have nothing to do with the algorithm part.

Bayesian curve fitting

Bayesian curve fitting is a method of fitting curves by treating it completely like Bayesian. In the prediction distribution by maximum likelihood estimation, the variance is the same for all points, whereas in Bayesian curve fitting, the prediction distribution is obtained in a Bayesian manner (using the addition and multiplication theorem of probabilities) using the posterior distribution of parameters. You can calculate the variance for each point with. It will calculate if you are relatively confident about each point.

Implementation

As stated at the very beginning, the algorithm implementation uses numpy.

import matplotlib.pyplot as plt
import numpy as np

Calculation of design matrix

{\bf\Phi} = \begin{bmatrix} \phi_0(x_0) & \phi_1(x_0) & \cdots & \phi_{M-1}(x_0)\\\ \phi_0(x_1) & \phi_1(x_1) & \cdots & \phi_{M-1}(x_1)\\\ \vdots & \vdots & \ddots & \vdots\\\ \phi_0(x_{N-1}) & \phi_1(x_{N-1}) & \cdots & \phi_{M-1}(x_{N-1})\\\ \end{bmatrix}

class PolynomialFeatures(object):

    def __init__(self, degree):
        self.degree = degree

    def transform(self, x):
        features = [x ** i for i in xrange(self.degree + 1)]
        return np.array(features).transpose()

Bayesian curve fitting estimation and prediction

First, estimate the posterior distribution of the parameter $ {\ bf w} . $ p({\bf w}|{\bf x}, {\bf t}, \alpha, \beta) \propto p({\bf t}|{\bf\Phi}{\bf w}, \beta)p({\bf w}|\alpha) $$ Here, $ {\ bf x} and {\ bf t} $ are training sets. Also, $ p $ on the right side assumes a Gaussian function, and the precision parameters are $ \ alpha $ and $ \ beta , respectively. Then, the probability distribution on the left side also becomes a Gaussian distribution. $ p({\bf w}|{\bf x}, {\bf t}, \alpha, \beta) = \mathcal{N}({\bf m}, {\bf S}) $ However $ {\bf m} = \beta{\bf S}{\bf\Phi}^{\rm T}{\bf t},~{\bf S}^{-1} = \alpha{\bf I} + \beta{\bf\Phi}^{\rm T}{\bf\Phi} $$ Using this posterior distribution, the probability distribution of $ t $ for the new point $ x $ is $ p(t|x,{\bf x}, {\bf t}, \alpha, \beta) = \int p(t|x,{\bf w},\beta)p({\bf w}|{\bf x}, {\bf t}, \alpha, \beta){\rm d}{\bf w} $ This is also calculated to have a Gaussian distribution. $ p(t|x,{\bf x}, {\bf t}, \alpha, \beta) = \mathcal{N}(t|{\bf m}^{\rm T}{\bf\phi}(x), {1\over\beta}+{\bf\phi}(x)^{\rm T}{\bf S}{\bf\phi}(x)) $

class BayesianRegression(object):

    def __init__(self, alpha=0.1, beta=0.25):
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        self.w_var = np.linalg.inv(
            self.alpha * np.identity(np.size(X, 1))
            + self.beta * X.T.dot(X))
        self.w_mean = self.beta * self.w_var.dot(X.T.dot(t))

    def predict(self, X):
        y = X.dot(self.w_mean)
        y_var = 1 / self.beta + np.sum(X.dot(self.w_var) * X, axis=1)
        y_std = np.sqrt(y_var)
        return y, y_std

Whole code

The code below is what I implemented including the above. The general flow is

Create training data (x, t = create_toy_data (func, ...))
Calculate the posterior distribution of the weight parameter $ {\ bf w} $ (regression.fit (X, t))
Calculate the predicted distribution using the posterior distribution (y, y_std = regression.predict (X_test))
Illustrate the results It has become.

`bayesian_curve_fitting.py`


import matplotlib.pyplot as plt
import numpy as np


class PolynomialFeatures(object):

    def __init__(self, degree):
        self.degree = degree

    def transform(self, x):
        features = [x ** i for i in xrange(self.degree + 1)]
        return np.array(features).transpose()


class BayesianRegression(object):

    def __init__(self, alpha=0.1, beta=0.25):
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        self.w_var = np.linalg.inv(
            self.alpha * np.identity(np.size(X, 1))
            + self.beta * X.T.dot(X))
        self.w_mean = self.beta * self.w_var.dot(X.T.dot(t))

    def predict(self, X):
        y = X.dot(self.w_mean)
        y_var = 1 / self.beta + np.sum(X.dot(self.w_var) * X, axis=1)
        y_std = np.sqrt(y_var)
        return y, y_std


def create_toy_data(func, low=0, high=1, size=10, sigma=1.):
    x = np.random.uniform(low, high, size)
    t = func(x) + np.random.normal(scale=sigma, size=size)
    return x, t


def main():

    def func(x):
        return np.sin(2 * np.pi * x)

    x, t = create_toy_data(func, low=0, high=1, size=10, sigma=0.25)
    plt.scatter(x, t, s=50, marker='o', alpha=0.5, label="observation")

    features = PolynomialFeatures(degree=5)

    regression = BayesianRegression(alpha=1e-3, beta=2)
    X = features.transform(x)
    regression.fit(X, t)

    x_test = np.linspace(0, 1, 100)
    X_test = features.transform(x_test)
    y, y_std = regression.predict(X_test)

    plt.plot(x_test, func(x_test), color='blue', label="sin($2\pi x$)")
    plt.plot(x_test, y, color='red', label="predict_mean")
    plt.fill_between(x_test, y - y_std, y + y_std,
                     color='pink', alpha=0.5, label="predict_std")
    plt.legend()
    plt.title("Predictive distribution")
    plt.xlabel("x")
    plt.ylabel("t")
    plt.show()


if __name__ == '__main__':
    main()

result

Bayesian curve fitting was performed from the data points (blue points) generated by adding noise to the function $ {\ rm sin} (2 \ pi x) $. The mean value of the predicted distribution at each $ x $ is shown in red, and the standard deviation is shown in pink. The width of the pink band corresponds to the uncertainty of the prediction. If the prediction distribution by maximum likelihood estimation is calculated, the width of the pink band will be constant regardless of $ x , and the prediction of any point will be unintuitive as well. In this result, I think that there are not many points ( x \ le0.1, 0.7 \ le x \ le0.9 $) in the area where I am relatively unsure of the prediction, but $ 0.4 \ le x \ le0.6 In the $ area, the standard deviation does not increase even if there are no data points. In addition, such behavior also depends on what kind of feature vector is used. This time, we used the features of polynomials, but other features using Gaussian functions are also often used.

Finally

This time, I chose the hyperparameters $ \ alpha and \ beta $ that seemed to work, but it seems that the hyperparameters can be estimated only from the data points by using the method called evidence approximation in Chapter 3 of PRML. .. I'm thinking of implementing that too.

PRML Chapter 1 Bayesian Curve Fitting Python Implementation