In this article, we will implement Bayesian principal component analysis. I think the main use of Principal Component Analysis (PCA) is to find the projection from the observation space (high dimension) where the target data exists to the latent space (low dimension). If the purpose is visualization, the latent space is made into 2 (or 3) dimensions, but for data preprocessing, it is rare to know how many dimensions of the latent space should be set. There is also a method of calculating the contribution rate, but in the end we have to set the threshold value at that time. Therefore, in Bayesian principal component analysis, the dimension of the latent space is automatically determined by the automatic determination of the degree of relevance.

Stochastic principal component analysis

By probabilistically interpreting principal component analysis, it will be possible to handle it in a Bayesian manner later. In the probabilistic principal component analysis, the data $ x $ (D dimension) we observed is a projection of $ z $ (M dimension) sampled from the latent space with the matrix $ W $ ((D, M) matrix). Then, it is interpreted as moving in parallel ($ + \ mu $ (D dimension)) and adding noise ($ + \ epsilon $ (D dimension)).

{\bf x = Wz + \mu + \epsilon}

Now assume that $ z $ and $ \ epsilon $ follow a Gaussian distribution. There are three parameters to estimate, $ W, \ mu, \ sigma ^ 2 $ (variance of Gaussian distribution followed by $ \ epsilon $), and its likelihood function is as follows.

p({\bf x|W,\mu},\sigma^2) = \int p({\bf x|Wz+\mu},\sigma^2)p({\bf z}){\rm d}{\bf z}

Two methods for maximizing this likelihood function are introduced in PRML. The first is a method that simply uses singular value decomposition, and the second is when the posterior distribution update (E step) for $ z $ and the complete data $ x, z $ are given using the EM algorithm. It repeats the maximization (M step) of the likelihood function of.

Bayesian principal component analysis

In the above example, the dimension of the latent variable space was fixed to M. Bayesian principal component analysis uses automatic relevance determination to prun the extra dimensions. (However, the value of M does not actually decrease.) Therefore, the parameter $ W $ has the following prior distribution.

p({\bf W}|{\bf \alpha}) = \prod_{i=1}^M\left({\alpha_i\over2\pi}\right)^{D/2}\exp\left\{-{1\over2}\alpha_i{\bf w}_i^\top {\bf w}_i\right\}

Where $ {\ bf w} \ _i $ is the $ i $ th column vector of $ {\ bf W} $. And $ \ alpha \ _i $ acts as an accuracy parameter for the individual Gaussian distributions. Estimating $ \ alpha $, some components have very large values. In that case, the precision is very high, so the component of the corresponding column vector of $ {\ bf} W $ is only 0, that is, the dimension is pruned.

code

Library

Use only matplotlib and numpy as usual.

import matplotlib.pyplot as plt
import numpy as np

Principal component analysis by maximum likelihood method

Method using eigenvalue decomposition as usual

#Class for principal component analysis
class PCA(object):

    def __init__(self, n_component):
        #Specify the dimension of the latent space
        self.n_component = n_component

    #Principal component analysis by maximum likelihood method
    def fit(self, X):
        #PRML formula(12.1)Calculate maximum likelihood estimate of mu
        self.mean = np.mean(X, axis=0)

        #PRML formula(12.2)Data covariance matrix
        cov = np.cov(X, rowvar=False)

        #Eigenvalue decomposition
        values, vectors = np.linalg.eigh(cov)
        index = np.size(X, 1) - self.n_component

        #PRML formula(12.46) sigma^Maximum likelihood estimate of 2
        if index == 0:
            self.var = 0
        else:
            self.var = np.mean(values[:index])

        #PRML formula(12.45)Maximum likelihood estimate of the projection matrix W
        self.W = vectors[:, index:].dot(np.sqrt(np.diag(values[index:]) - self.var * np.eye(self.n_component)))

Bayesian principal component analysis

The methods below are all the methods in the PCA class. This time, for comparison, the parameters are first estimated with maximum likelihood, and they are used as initial values for pruning by automatic determination of relevance.

    #Bayesian principal component analysis
    def fit_bayesian(self, X, iter_max=100):
        #Data space dimensions
        self.ndim = np.size(X, 1)

        #Estimate the parameter by maximum likelihood estimation and use it as the initial value
        self.fit(X)

        #Initialization of precision parameters(First estimate)
        self.alpha = self.ndim / np.sum(self.W ** 2, axis=0)

        #Zero averaging of data
        D = X - self.mean

        #Repeat the EM algorithm a specified number of times
        for i in xrange(iter_max):
            #Sufficient statistic for step z
            Ez, Ezz = self.expectation(D)

            #M step W,sigma^2 estimates
            self.maximize(D, Ez, Ezz)

            #PRML formula(12.62)Hyperparameter updates
            self.alpha = self.ndim / np.sum(self.W ** 2, axis=0).clip(min=1e-10)

    #Sufficient statistic of E step z E[z]、E[zz^T]Calculation
    def expectation(self, D):
        #PRML formula(12.41)
        M = self.W.T.dot(self.W) + self.var * np.eye(self.n_component)
        Minv = np.linalg.inv(M)

        #PRML formula(12.54) E[z]
        Ez = D.dot(self.W).dot(Minv)

        #PRML formula(12.55) E[zz^T]
        Ezz = self.var * Minv + np.einsum('ni,nj->nij', Ez, Ez)
        return Ez, Ezz

    #M step W,sigma^2 estimates
    def maximize(self, D, Ez, Ezz):
        #PRML formula(12.63)Estimate of W
        self.W = D.T.dot(Ez).dot(np.linalg.inv(np.sum(Ezz, axis=0) + self.var * np.diag(self.alpha)))

        #PRML formula(12.57) sigma^2 estimates
        self.var = np.mean(
            np.mean(D ** 2, axis=-1)
            - 2 * np.mean(Ez.dot(self.W.T) * D, axis=-1)
            + np.trace(Ezz.dot(self.W.T).dot(self.W).T) / self.ndim)

Hinton diagram

This time, we will reproduce PRML Figure 12.14. To do this, we need a function to draw a Hinton diagram that represents each element of the matrix as a square. Page that came out by googled with matplotlib hinton is slightly modified.

def hinton(matrix, max_weight=None, ax=None):
    """Draw Hinton diagram for visualizing a weight matrix."""
    ax = ax if ax is not None else plt.gca()

    if not max_weight:
        max_weight = 2 ** np.ceil(np.log(np.abs(matrix).max()) / np.log(2))

    ax.patch.set_facecolor('gray')
    ax.set_aspect('equal', 'box')
    ax.xaxis.set_major_locator(plt.NullLocator())
    ax.yaxis.set_major_locator(plt.NullLocator())

    for (x, y), w in np.ndenumerate(matrix):
        color = 'white' if w > 0 else 'black'
        size = np.sqrt(np.abs(w) / max_weight)
        rect = plt.Rectangle([y - size / 2, x - size / 2], size, size,
                             facecolor=color, edgecolor=color)
        ax.add_patch(rect)

    ax.autoscale_view()
    ax.invert_yaxis()
    plt.xlim(-0.5, np.size(matrix, 1) - 0.5)
    plt.ylim(-0.5, len(matrix) - 0.5)
    plt.show()

Main function

def create_toy_data(sample_size=100, ndim_hidden=1, ndim_observe=2, std=1.):
    Z = np.random.normal(size=(sample_size, ndim_hidden))
    mu = np.random.uniform(-5, 5, size=(ndim_observe))
    W = np.random.uniform(-5, 5, (ndim_hidden, ndim_observe))

    #PRML formula(12.33)
    X = Z.dot(W) + mu + np.random.normal(scale=std, size=(sample_size, ndim_observe))
    return X


def main():
    #Create 100 points of data created by projecting from a 3D latent space to a 10D space
    X = create_toy_data(sample_size=100, ndim_hidden=3, ndim_observe=10, std=1.)

    #Perform PCA by maximum likelihood method with latent space as 9 dimensions
    pca = PCA(9)
    pca.fit(X)
    hinton(pca.W)

    #Perform Bayesian PCA by pruning from up to 9 dimensions
    pca.fit_bayesian(X)
    hinton(pca.W)

Whole code

`pca.py`


import matplotlib.pyplot as plt
import numpy as np


class PCA(object):

    def __init__(self, n_component):
        self.n_component = n_component

    def fit(self, X):
        self.mean = np.mean(X, axis=0)
        cov = np.cov(X, rowvar=False)
        values, vectors = np.linalg.eigh(cov)
        index = np.size(X, 1) - self.n_component
        if index == 0:
            self.var = 0
        else:
            self.var = np.mean(values[:index])
        self.W = vectors[:, index:].dot(np.sqrt(np.diag(values[index:]) - self.var * np.eye(self.n_component)))

    def fit_bayesian(self, X, iter_max=100):
        self.ndim = np.size(X, 1)
        self.fit(X)
        self.alpha = self.ndim / np.sum(self.W ** 2, axis=0)
        D = X - self.mean
        for i in xrange(iter_max):
            Ez, Ezz = self.expectation(D)
            self.maximize(D, Ez, Ezz)
            self.alpha = self.ndim / np.sum(self.W ** 2, axis=0).clip(min=1e-10)

    def expectation(self, D):
        M = self.W.T.dot(self.W) + self.var * np.eye(self.n_component)
        Minv = np.linalg.inv(M)
        Ez = D.dot(self.W).dot(Minv)
        Ezz = self.var * Minv + np.einsum('ni,nj->nij', Ez, Ez)
        return Ez, Ezz

    def maximize(self, D, Ez, Ezz):
        self.W = D.T.dot(Ez).dot(np.linalg.inv(np.sum(Ezz, axis=0) + self.var * np.diag(self.alpha)))
        self.var = np.mean(
            np.mean(D ** 2, axis=-1)
            - 2 * np.mean(Ez.dot(self.W.T) * D, axis=-1)
            + np.trace(Ezz.dot(self.W.T).dot(self.W).T) / self.ndim)


def hinton(matrix, max_weight=None, ax=None):
    """Draw Hinton diagram for visualizing a weight matrix."""
    ax = ax if ax is not None else plt.gca()

    if not max_weight:
        max_weight = 2 ** np.ceil(np.log(np.abs(matrix).max()) / np.log(2))

    ax.patch.set_facecolor('gray')
    ax.set_aspect('equal', 'box')
    ax.xaxis.set_major_locator(plt.NullLocator())
    ax.yaxis.set_major_locator(plt.NullLocator())

    for (x, y), w in np.ndenumerate(matrix):
        color = 'white' if w > 0 else 'black'
        size = np.sqrt(np.abs(w) / max_weight)
        rect = plt.Rectangle([y - size / 2, x - size / 2], size, size,
                             facecolor=color, edgecolor=color)
        ax.add_patch(rect)

    ax.autoscale_view()
    ax.invert_yaxis()
    plt.xlim(-0.5, np.size(matrix, 1) - 0.5)
    plt.ylim(-0.5, len(matrix) - 0.5)
    plt.show()


def create_toy_data(sample_size=100, ndim_hidden=1, ndim_observe=2, std=1.):
    Z = np.random.normal(size=(sample_size, ndim_hidden))
    mu = np.random.uniform(-5, 5, size=(ndim_observe))
    W = np.random.uniform(-5, 5, (ndim_hidden, ndim_observe))
    X = Z.dot(W) + mu + np.random.normal(scale=std, size=(sample_size, ndim_observe))
    return X


def main():
    X = create_toy_data(sample_size=100, ndim_hidden=3, ndim_observe=10, std=1.)

    pca = PCA(9)
    pca.fit(X)
    hinton(pca.W)

    pca.fit_bayesian(X)
    hinton(pca.W)


if __name__ == '__main__':
    main()

result

The result of estimating the projection matrix W by PCA by the maximum likelihood method is as follows. Then, when the dimension of the latent space is pruned using Bayesian principal component analysis, the projection matrix W becomes as shown in the figure below. The six columns on the left have disappeared and the corresponding dimensions have been pruned. It is possible to capture the fact that it has been projected from three dimensions. PRML The results shown in Figure 12.14 were obtained.

At the end

When PRML came to this point, it became more common to combine what we had learned so far and apply it to new models. This time, by combining the automatic determination of relevance in Chapter 6 and the EM algorithm in Chapter 9, the observation data is actually applied to the model that is projected from a lower dimensional space. It seems that it can be applied to data with missing values, so I would like to try that as well.

PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation

Stochastic principal component analysis

Bayesian principal component analysis

code

Library

Principal component analysis by maximum likelihood method

Bayesian principal component analysis

Hinton diagram

Main function

Whole code

pca.py

result

At the end

`pca.py`