PRML Chapter 3 Evidence Approximation Python Implementation

This time, I implemented the evidence approximation mentioned a little at the end of First PRML implementation. This method is also called empirical Bayes, type 2 maximum likelihood estimation, and so on. When using machine learning algorithms, we often have to set hyperparameters. By using this method, hyperparameters that have been decided somehow until now are automatically decided from the data.

Bayesian linear regression

Before we get into the explanation of evidence approximation, let's explain Bayesian linear regression.

Maximum likelihood estimation

Given the training data $ \ {x_i, t_i \} _ {i = 1} ^ N $, prepare the feature vector $ {\ bf \ phi} $ (for example, $ {\ bf \ phi). } (x) = (1, x, x ^ 2, x ^ 3) ^ {\ rm T} $)

{\bf t} = 
\begin{bmatrix}
t_1\\
t_2\\
\vdots\\
t_N
\end{bmatrix}\\
{\bf\Phi} =
\begin{bmatrix}
\phi_1(x_1) & \phi_2(x_1) & \cdots & \phi_M(x_1)\\\
\phi_1(x_2) & \phi_2(x_2) & \cdots & \phi_M(x_2)\\\
\vdots & \vdots & \ddots & \vdots\\\
\phi_1(x_N) & \phi_2(x_N) & \cdots & \phi_M(x_N)
\end{bmatrix}\\
{\bf w} =
\begin{bmatrix}
w_1\\
w_2\\
\vdots\\
w_M
\end{bmatrix}\\

As a result, the following likelihood function is minimized for the parameter $ {\ bf w} $. However, $ {\ bf I} $ is the identity matrix of $ N \ times N $, and $ \ beta $ is the precision parameter.

E({\bf w}) = \mathcal{N}({\bf t}|{\bf\Phi}{\bf w}, \beta^{-1}{\bf I})

Ex-post distribution

Maximum likelihood estimation sets a prior distribution on the parameter $ {\ bf w} $ because it may overfit the training data. So, assuming that the Gaussian distribution with mean 0 and variance $ \ alpha ^ {-1} $ is prior distribution, we use Bayes' theorem.

\begin{align}
p({\bf w}|{\bf\Phi},{\bf t}, \alpha, \beta) &= {\mathcal{N}({\bf t}|{\bf\Phi}{\bf w}, \beta^{-1}{\bf I}) \mathcal{N}({\bf w}|{\bf 0}, \alpha^{-1}{\bf I})\over p({\bf t}|{\bf\Phi}, \alpha, \beta)}\\
&= \mathcal{N}({\bf w}|{\bf m}_N, {\bf S}_N)
\end{align}

As a posterior distribution for the parameter $ {\ bf w} $. However,

{\bf m}_N = \beta {\bf S}_N{\bf\Phi}^{\rm T}{\bf t}\\
{\bf S}_N^{-1} = \alpha{\bf I} + \beta{\bf\Phi}^{\rm T}{\bf\Phi}

Posterior predictive distribution

Using the posterior distribution obtained above, the posterior predictive distribution of $ t $ for the new input $ x $ is

p(t|x,{\bf\Phi},{\bf t},\alpha,\beta) = \int \mathcal{N}(t|{\bf w}^{\rm T}{\bf\phi}(x), \beta^{-1}) \mathcal{N}({\bf w}|{\bf m}_N, {\bf S}_N) {\rm d}{\bf w}\\
= \mathcal{N}(t|{\bf m}_N^{\rm T}{\bf\phi}(x), \sigma^2_N(x))

It can be obtained by integrating and eliminating the parameter $ {\ bf w} $ as in. However, the variance is

\sigma^2_N(x) = {1\over\beta} + \phi(x)^{\rm T}{\bf S}_N\phi(x)

By using the posterior predictive distribution obtained in this way, Bayesian linear regression was possible. The behavior of this regression depends on the hyperparameters $ \ alpha, \ beta $. If the hyperparameter values we set are too large or too small, the regression results may not be what you want.

Evidence approximation

In the above, the hyperparameters $ \ alpha, \ beta $ were set as some human-determined values, but these are estimated from the data. When finding the posterior distribution of the parameter $ {\ bf w} $

p({\bf w}|{\bf\Phi},{\bf t}, \alpha, \beta) = {\mathcal{N}({\bf t}|{\bf\Phi}{\bf w}, \beta^{-1}{\bf I}) \mathcal{N}({\bf w}|{\bf 0}, \alpha^{-1}{\bf I})\over p({\bf t}|{\bf\Phi}, \alpha, \beta)}

I used Bayes' theorem as. When finding the posterior distribution, we often focus only on the numerator on the right side, but in reality, the denominator $ p ({\ bf t} | {\ bf \ Phi}, \ alpha, \ beta) $ is the evidence function. It is called, and has the following relationship.

(Likelihood function)\times(Prior distribution)=(Evidence function)\times(Ex-post distribution)

Considering the meaning of the evidence function from $ p ({\ bf t} | {\ bf \ Phi}, \ alpha, \ beta) $, the evidence function has the given hyperparameters $ \ alpha, \ beta $. Shows how easy it is to generate training data $ {\ bf t} $. If the value of the evidence function is small, the $ \ alpha, \ beta $ is difficult to generate the data, and conversely, if the value of the evidence function is large, it is easy to generate the data. In other words, ** the higher the value of the evidence function, the better **. Approximately by substituting the hyperparameter values that maximize the value of the evidence function into the posterior predictive distribution formula as $ {\ hat \ alpha}, {\ hat \ beta} $.

p(t|x,{\bf\Phi},{\bf t}) \approx p(t|x,{\bf\Phi},{\bf t},{\hat\alpha},{\hat\beta})

Is an approximation of the evidence.

Maximize evidence

Find $ \ alpha, \ beta $ that maximizes the evidence function. Update $ \ alpha, \ beta, \ gamma $ with the eigenvalue of $ {\ bf \ Phi} ^ {\ rm T} {\ bf \ Phi} $ as $ \ lambda_i $ as follows.

  1. \gamma=\sum_i{\beta\lambda_i\over\alpha+\beta\lambda_i}
  2. \alpha={\gamma\over{\bf m}\_N^{\rm T}{\bf m}\_N}\beta={N-\gamma\over\sum\_n\left\\{t\_n - {\rm m}\_N^{\rm T}{\bf\phi}(x\_n)\right\\}^2}

By repeating these two steps, the evidence function converges to $ \ alpha = {\ hat \ alpha}, \ beta = {\ hat \ beta} $. See PRML for the derivation of this formula.

Implementation

I am reusing the code I made for Bayes curve fitting.

Design matrix

The PolynomialFeatures class transforms the input vector $ (x_1, \ cdots, x_N) ^ {\ rm T} $ into the design matrix $ {\ bf \ Phi} $ by setting the degree.

import matplotlib.pyplot as plt
import numpy as np


class PolynomialFeatures(object):

    def __init__(self, degree):
        assert type(degree) is int, "%s is not int" % type(degree)
        self.degree = degree

    def transform(self, x):
        features = [x ** i for i in xrange(self.degree + 1)]
        return np.array(features).transpose()
PolynomialFeatures Method description
__init__ Set the degree of polynomial features
transform Convert input to design matrix

Bayesian linear regression

The BayesianRegression class estimates posterior probabilities and calculates posterior predictive distribution as described in Bayesian Linear Regression.

class BayesianRegression(object):

    def __init__(self, alpha=1., beta=1.):
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        self.w_var = np.linalg.inv(
            self.alpha * np.identity(np.size(X, 1))
            + self.beta * X.T.dot(X))
        self.w_mean = self.beta * self.w_var.dot(X.T.dot(t))

    def predict(self, X):
        return X.dot(self.w_mean)

    def predict_dist(self, X):
        y = X.dot(self.w_mean)
        y_var = 1 / self.beta + np.sum(X.dot(self.w_var) * X, axis=1)
        y_std = np.sqrt(y_var)
        return y, y_std
BayesianRegression Method description
__init__ \alpha,\betaSet the value of
fit Parameters{\bf w}Calculate the mean and variance of the posterior distribution of
predict Calculate the mode of the posterior predictive distribution
predict_dist Calculation of mode and standard deviation of posterior prediction distribution

Evidence approximation

The evidence function is maximized by alternately updating the parameter $ {\ bf w} $ and the hyperparameters $ \ alpha, \ beta $. Updating the parameter $ {\ bf w} $ only does the same thing as in the Bayesian curve fitting above, so we're reusing the method with class inheritance. The evidence method also calculates the value of the logarithmic evidence function. By comparing this value, you can automatically select the best model among multiple models.

class EvidenceApproximation(BayesianRegression):

    def __init__(self, iter_max=100, alpha=1., beta=1.):
        self.iter_max = iter_max
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        M = X.T.dot(X)
        eigenvalues = np.linalg.eigvalsh(M)
        for i in xrange(self.iter_max):
            params = [self.alpha, self.beta]
            super(EvidenceApproximation, self).fit(X, t)
            self.gamma = np.sum(self.beta * eigenvalues / (self.alpha + self.beta * eigenvalues))
            self.alpha = self.gamma / self.w_mean.dot(self.w_mean)
            self.beta = (len(t) - self.gamma) / np.sum((t - X.dot(self.w_mean)) ** 2)
            if np.allclose(params, [self.alpha, self.beta]):
                break
        super(EvidenceApproximation, self).fit(X, t)

    def evidence(self, X, t):
        M = X.T.dot(X)
        return (len(M) * np.log(self.alpha)
                + len(t) * np.log(self.beta)
                - self.beta * np.sum((t - X.dot(self.w_mean)) ** 2)
                - np.linalg.slogdet(self.alpha + self.beta * M)[1])
EvidenceApproximation Method description
__init__ \alpha,\betaSet the values of and the maximum number of times to update them
fit \alpha,\betaEstimate and parameters{\bf w}Calculate the mean and variance of the posterior distribution of
evidence Calculate the value of the logarithmic evidence function

Whole code

Generate data points that obey the cubic polynomial $ x (x-5) (x + 5) $, and then perform curve fitting using evidence approximation on the model of the 0th to 7th order polynomials to obtain the model evidence. The posterior prediction distribution is calculated using the model with the largest value.

evidence_approximation.py


import matplotlib.pyplot as plt
import numpy as np


class PolynomialFeatures(object):

    def __init__(self, degree):
        assert type(degree) is int, "%s is not int" % type(degree)
        self.degree = degree

    def transform(self, x):
        features = [x ** i for i in xrange(self.degree + 1)]
        return np.array(features).transpose()


class BayesianRegression(object):

    def __init__(self, alpha=1., beta=1.):
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        self.w_var = np.linalg.inv(
            self.alpha * np.identity(np.size(X, 1))
            + self.beta * X.T.dot(X))
        self.w_mean = self.beta * self.w_var.dot(X.T.dot(t))

    def predict(self, X):
        return X.dot(self.w_mean)

    def predict_dist(self, X):
        y = X.dot(self.w_mean)
        y_var = 1 / self.beta + np.sum(X.dot(self.w_var) * X, axis=1)
        y_std = np.sqrt(y_var)
        return y, y_std


class EvidenceApproximation(BayesianRegression):

    def __init__(self, iter_max=100, alpha=1., beta=1.):
        self.iter_max = iter_max
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        M = X.T.dot(X)
        eigenvalues = np.linalg.eigvalsh(M)
        for i in xrange(self.iter_max):
            params = [self.alpha, self.beta]
            super(EvidenceApproximation, self).fit(X, t)
            self.gamma = np.sum(self.beta * eigenvalues / (self.alpha + self.beta * eigenvalues))
            self.alpha = self.gamma / self.w_mean.dot(self.w_mean)
            self.beta = (len(t) - self.gamma) / np.sum((t - X.dot(self.w_mean)) ** 2)
            if np.allclose(params, [self.alpha, self.beta]):
                break
        super(EvidenceApproximation, self).fit(X, t)

    def evidence(self, X, t):
        M = X.T.dot(X)
        return (len(M) * np.log(self.alpha)
                + len(t) * np.log(self.beta)
                - self.beta * np.sum((t - X.dot(self.w_mean)) ** 2)
                - np.linalg.slogdet(self.alpha + self.beta * M)[1])


def create_toy_data(func, low=0, high=1, size=10, sigma=1.):
    x = np.random.uniform(low, high, size)
    t = func(x) + np.random.normal(scale=sigma, size=size)
    return x, t


def main():

    def func(x):
        return x * (x - 5) * (x + 5)

    x, t = create_toy_data(func, low=-5, high=5, size=20, sigma=5.)
    plt.scatter(x, t, s=50, alpha=0.5, label="observation")

    evidences = []
    regressions = []
    for i in xrange(8):
        features = PolynomialFeatures(degree=i)
        X = features.transform(x)
        regression = EvidenceApproximation(alpha=100., beta=100.)
        regression.fit(X, t)
        evidences.append(regression.evidence(X, t))
        regressions.append(regression)
    degree = np.argmax(evidences)
    regression = regressions[degree]

    x_test = np.linspace(-5, 5, 100)
    X_test = PolynomialFeatures(degree=int(degree)).transform(x_test)
    y, y_std = regression.predict_dist(X_test)

    plt.plot(x_test, func(x_test), color="blue", label="x(x-5)(x+5)")
    plt.plot(x_test, y, color="red", label="predict_mean")
    plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
    plt.legend()
    plt.show()

    plt.plot(evidences)
    plt.title("Model evidence")
    plt.xlabel("M")
    plt.ylabel("evidence")
    plt.show()


if __name__ == '__main__':
    main()
  1. Define the cubic polynomial $ x (x-5) (x + 5) $ (def func ())
  2. Create a noisy data point according to the function (x, t = create_toy_data (...))
  3. Repeat steps 4 to 6 with $ i = 0, \ cdots, 7 $
  4. Create a design matrix of $ i $ order polynomial features (X = features.transform (x))
  5. Set the initial values of $ \ alpha and \ beta $ to 100 (values that would be useless if the evidence function is not maximized), and perform curve fitting by evidence approximation (regression.fit (X,,). t) )
  6. Calculate and save the value of the logarithmic evidence function in step 5 (ʻevidences.append (regression.evidence (X, t)) `)
  7. Get the order of the model with the highest value of the logarithmic evidence function (degree = np.argmax (evidences))
  8. Get the result of regression at that order (regression = regressions [degree])
  9. The above results are illustrated

result

As shown in Figure 3.14 of PRML, the value of the logarithmic evidence function when using a cubic polynomial as a model is the largest. model_evidence.png Also, the predicted distribution by the model of the cubic polynomial looks good. Since the initial values of the hyperparameters $ \ alpha and \ beta $ were both set to 100, in the case of normal Bayesian curve fitting, $ \ alpha $ is too large and each component of $ {\ bf w} $ The value is close to 0, the variance of the prediction distribution is small because $ \ beta $ is too large, and there should have been a posterior prediction distribution that could not be fitted to the data at all. Since we estimated $ \ alpha, \ beta $ to maximize the evidence function, the posterior predictive distribution also fits the data. predictive_distribution.png

At the end

By using this evidence approximation, he estimated that it is okay to set a bad hyperparameter as an initial value. I was looking for hyperparameters in order to verify some points of training data with grid search, but with evidence approximation, it is possible to estimate hyperparameters using all training data with a small number of trials. I can do it. In addition, this evidence approximation is used not only for linear regression models but also for logistic regression and estimation of neural network hyperparameters. If I have a chance, I will implement them as well.

Recommended Posts

PRML Chapter 3 Evidence Approximation Python Implementation
PRML Chapter 5 Neural Network Python Implementation
PRML Chapter 8 Product Sum Algorithm Python Implementation
PRML Chapter 4 Bayesian Logistic Regression Python Implementation
PRML Chapter 5 Mixed Density Network Python Implementation
PRML Chapter 9 Mixed Gaussian Distribution Python Implementation
PRML Chapter 14 Conditional Mixed Model Python Implementation
PRML Chapter 10 Variational Gaussian Distribution Python Implementation
PRML Chapter 6 Gaussian Process Regression Python Implementation
PRML Chapter 2 Student's t Distribution Python Implementation
PRML Chapter 1 Bayesian Curve Fitting Python Implementation
PRML Chapter 11 Markov Chain Monte Carlo Python Implementation
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Explanation and implementation of PRML Chapter 4
PRML Chapter 7 Related Vector Machine Python Implementation for Regression Problems
PRML Chapter 13 Maximum Likelihood Estimating Python Implementation of Hidden Markov Models
Implemented in Python PRML Chapter 7 Nonlinear SVM
Implemented in Python PRML Chapter 5 Neural Networks
Implemented in Python PRML Chapter 1 Bayesian Inference
Implemented in Python PRML Chapter 3 Bayesian Linear Regression
Implemented in Python PRML Chapter 1 Polynomial Curve Fitting
RNN implementation in python
ValueObject implementation in Python
[Python] Chapter 01-01 About Python (First Python)
SVM implementation in python
[Line / Python] Beacon implementation memo
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Python implementation of particle filters
Python implementation mixed Bernoulli distribution
Python for Data Analysis Chapter 2
Neural network implementation in python
Maxout description and implementation (Python)
Implementation of quicksort in Python
[Python] Chapter 03-02 turtle graphics (turtle behavior)
Python for Data Analysis Chapter 3
Python Crawling & Scraping Chapter 4 Summary