[PYTHON] PRML: Chapter 7 Kernel machine with sparse solution

Overview

PRML Chapter 7 Regression by related vector machine (RVM; relevance vector machine) is implemented in python.

Jupyter notebook summarizing the code and experimental results.

RVM model

Bayesian linear regression prior distribution modified to give a sparse solution

Conditional probability distribution of the real-valued target variable $ t $ for the given input vector $ \ mathbf {x} $

p(t|\mathbf{x}, \mathbf{w}, \beta) = \mathcal{N}(t|y(\mathbf{x}), \beta^{-1})

Here, $ \ beta = \ sigma ^ {-2} $ is defined by the noise accuracy parameter (the reciprocal of the noise variance), and the mean value is defined by the following linear model.

y(\mathbf{x})=\sum_{i=1}^{M}w_{i}\phi_{i}(\mathbf{x})=\mathbf{w}^{T}\mathbf{\phi}(\mathbf{x})

A kernel function with individual training data as one argument is used as the basis function.

y(\mathbf{x})=\sum_{n=1}^{N}w_{n}k(\mathbf{x}, \mathbf{x}_{n})+b

The total number of parameters is $ M = N + 1 $. The likelihood function is given by the following equation.

p(\mathbf{t}| \mathbf{X}, \mathbf{w}, \beta) = \prod_{n=1}^{N} p(t_{n}|\mathbf{x}_{n}, \mathbf{w}, \beta)

A Gauss prior distribution with an average of 0 is used as the prior distribution of the parameter vector $ \ mathbf {w} $. RVM uses different hyperparameters $ \ alpha {i} $ for each weight parameter $ w {i} $. __ That is, the prior distribution for weights is as follows.

p(\mathbf{w}|\mathbf{\alpha}) = \prod_{i=1}^{M} \mathcal{N} (w_{i}|0, \alpha_{i}^{-1})

Where $ \ alpha_ {i} $ represents the precision of the corresponding weight parameter $ w_ {i} $, $ \ mathbf {\ alpha} = (\ alpha_ {1}, \ dots, \ alpha_ {M}) ^ {T} $. Maximizing the evidence for these hyperparameters makes most of the hyperparameters infinite, concentrating the posterior distribution of the corresponding weight parameters at zero point. Then, the basis functions corresponding to these parameters (kernel functions representing the distances to the corresponding data points) do not play any role in the prediction, so they can be removed and a sparse model is obtained.

The posterior probabilities for the weight vector are again Gaussian and are expressed in the following form.

p(\mathbf{w}|\mathbf{t}, \mathbf{X}, \mathbf{\alpha}, \beta) = \mathcal{N}(\mathbf{w}|\mathbf{m}, \mathbf{\Sigma})

Here, the mean and covariance are given by the following equations.

\mathbf{m} = \beta \mathbf{\Sigma} \mathbf{\Phi}^{T} \mathbf{t}

\mathbf{\Sigma} = \left( \mathbf{A} + \beta \mathbf{\Phi}^{T} \mathbf{\Phi} \right)^{-1}

However, $ \ mathbf {\ Phi} $ is the element $ \ Phi_ {ni} = \ phi_ {i} (\ mathbf {x_ {n}}) $ for $ i = 1, \ dots, N $, and $ N \ times M $ design matrix with element $ \ Phi_ {nM} = 1 $ for $ n = 1, \ dots, N $, $ \ mathbf {A} = \ rm {diag} (\ alpha_ {i}) $.

The values of $ \ mathbf {\ alpha} $ and $ \ beta $ are obtained by a second type of maximum likelihood estimation, also known as __evidence approximation __. To perform the second type maximum likelihood estimation, first integrate the weight parameters.

p(\mathbf{t}|\mathbf{X}, \mathbf{\alpha}, \beta) = \int p(\mathbf{t}|\mathbf{X}, \mathbf{w}, \beta)p(\mathbf{w}|\mathbf{\alpha}) d\mathbf{w}

Since this equation is a convolutional integral of two Gaussian distributions, the integral can be executed analytically, and the log-likelihood can be obtained as follows.

\ln p(\mathbf{t}|\mathbf{X}, \mathbf{\alpha}, \beta) = \ln \mathcal{N} (\mathbf{t}|\mathbf{0}, \mathbf{C}) = -\frac{1}{2}\{ N \ln (2 \pi) + \ln |\mathbf{C}| + \mathbf{t}^{T}\mathbf{C}^{-1}\mathbf{t}\}

Here, $ \ mathbf {t} = (t_ {1}, \ dots, t_ {N}) ^ {T} $. We also defined the $ N \ times N $ matrix $ \ mathbf {C} $ as follows.

\mathbf{C} = \beta^{-1}\mathbf{I} + \mathbf{\Phi} \mathbf{A}^{-1} \mathbf{\Phi}^{T}

By setting the derivative of the obtained log-likelihood to 0, we obtain the hyperparameter update equation as follows.

\begin{split} \alpha_{i}^{new} &= \frac{\gamma_{i}}{m_{i}^{2}} \\ (\beta^{new})^{-1} &= \frac{\|\mathbf{t} - \mathbf{\Phi m}\|^{2}}{N - \Sigma_{i}\gamma_{i}} \end{split}

Here, $ \ gamma_ {i} $ is a quantity that indicates how well the corresponding weight parameter $ w_ {i} $ is specified in the data, and is defined by the following equation.

\gamma_{i} = 1 - \alpha_{i} \Sigma_{ii}

Learning hyperparameters proceeds as follows using the above results. First, the mean $ \ mathbf {m} $ and covariance $ \ mathbf {\ Sigma} $ of posterior probabilities are estimated from the initial values of $ \ mathbf {\ alpha} $ and $ \ beta $ selected appropriately. Next, we estimate the hyperparameters from the obtained values. This process is repeated alternately until the appropriate convergence conditions are met.

Experiment

Optimize RVM regression for artificial data (sine function) in Appendix A of PRML
Gaussian kernel standard deviation $ \ sigma = 0.2 $.

Training data

result

Design matrix and related vectors

For vectors unrelated to the output $ \ mathbf {t} $, the value of $ \ alpha $ is $ \ infinity $. In other words, the column with the large value of $ \ alpha ^ {-1} $ is the related vector.

RVM regression results

Consideration

I thought about implementing a sequential sparse Bayesian learning algorithm, but it helped me.
Since it is easier to obtain a sparse solution than SVM, we would like to apply it to a problem in which a sparse solution cannot be obtained with SVM and prediction is slow.
Since the algorithm is relatively simple, it can be used for sparse modeling.