In this article, we will implement Bayesian principal component analysis. I think the main use of Principal Component Analysis (PCA) is to find the projection from the observation space (high dimension) where the target data exists to the latent space (low dimension). If the purpose is visualization, the latent space is made into 2 (or 3) dimensions, but for data preprocessing, it is rare to know how many dimensions of the latent space should be set. There is also a method of calculating the contribution rate, but in the end we have to set the threshold value at that time. Therefore, in Bayesian principal component analysis, the dimension of the latent space is automatically determined by the automatic determination of the degree of relevance.
By probabilistically interpreting principal component analysis, it will be possible to handle it in a Bayesian manner later. In the probabilistic principal component analysis, the data $ x $ (D dimension) we observed is a projection of $ z $ (M dimension) sampled from the latent space with the matrix $ W $ ((D, M) matrix). Then, it is interpreted as moving in parallel ($ + \ mu $ (D dimension)) and adding noise ($ + \ epsilon $ (D dimension)).
{\bf x = Wz + \mu + \epsilon}
Now assume that $ z $ and $ \ epsilon $ follow a Gaussian distribution. There are three parameters to estimate, $ W, \ mu, \ sigma ^ 2 $ (variance of Gaussian distribution followed by $ \ epsilon $), and its likelihood function is as follows.
p({\bf x|W,\mu},\sigma^2) = \int p({\bf x|Wz+\mu},\sigma^2)p({\bf z}){\rm d}{\bf z}
Two methods for maximizing this likelihood function are introduced in PRML. The first is a method that simply uses singular value decomposition, and the second is when the posterior distribution update (E step) for $ z $ and the complete data $ x, z $ are given using the EM algorithm. It repeats the maximization (M step) of the likelihood function of.
In the above example, the dimension of the latent variable space was fixed to M. Bayesian principal component analysis uses automatic relevance determination to prun the extra dimensions. (However, the value of M does not actually decrease.) Therefore, the parameter $ W $ has the following prior distribution.
p({\bf W}|{\bf \alpha}) = \prod_{i=1}^M\left({\alpha_i\over2\pi}\right)^{D/2}\exp\left\{-{1\over2}\alpha_i{\bf w}_i^\top {\bf w}_i\right\}
Where $ {\ bf w} \ _i $ is the $ i $ th column vector of $ {\ bf W} $. And $ \ alpha \ _i $ acts as an accuracy parameter for the individual Gaussian distributions. Estimating $ \ alpha $, some components have very large values. In that case, the precision is very high, so the component of the corresponding column vector of $ {\ bf} W $ is only 0, that is, the dimension is pruned.
Use only matplotlib and numpy as usual.
import matplotlib.pyplot as plt
import numpy as np
Method using eigenvalue decomposition as usual
#Class for principal component analysis
class PCA(object):
def __init__(self, n_component):
#Specify the dimension of the latent space
self.n_component = n_component
#Principal component analysis by maximum likelihood method
def fit(self, X):
#PRML formula(12.1)Calculate maximum likelihood estimate of mu
self.mean = np.mean(X, axis=0)
#PRML formula(12.2)Data covariance matrix
cov = np.cov(X, rowvar=False)
#Eigenvalue decomposition
values, vectors = np.linalg.eigh(cov)
index = np.size(X, 1) - self.n_component
#PRML formula(12.46) sigma^Maximum likelihood estimate of 2
if index == 0:
self.var = 0
else:
self.var = np.mean(values[:index])
#PRML formula(12.45)Maximum likelihood estimate of the projection matrix W
self.W = vectors[:, index:].dot(np.sqrt(np.diag(values[index:]) - self.var * np.eye(self.n_component)))
The methods below are all the methods in the PCA class. This time, for comparison, the parameters are first estimated with maximum likelihood, and they are used as initial values for pruning by automatic determination of relevance.
#Bayesian principal component analysis
def fit_bayesian(self, X, iter_max=100):
#Data space dimensions
self.ndim = np.size(X, 1)
#Estimate the parameter by maximum likelihood estimation and use it as the initial value
self.fit(X)
#Initialization of precision parameters(First estimate)
self.alpha = self.ndim / np.sum(self.W ** 2, axis=0)
#Zero averaging of data
D = X - self.mean
#Repeat the EM algorithm a specified number of times
for i in xrange(iter_max):
#Sufficient statistic for step z
Ez, Ezz = self.expectation(D)
#M step W,sigma^2 estimates
self.maximize(D, Ez, Ezz)
#PRML formula(12.62)Hyperparameter updates
self.alpha = self.ndim / np.sum(self.W ** 2, axis=0).clip(min=1e-10)
#Sufficient statistic of E step z E[z]、E[zz^T]Calculation
def expectation(self, D):
#PRML formula(12.41)
M = self.W.T.dot(self.W) + self.var * np.eye(self.n_component)
Minv = np.linalg.inv(M)
#PRML formula(12.54) E[z]
Ez = D.dot(self.W).dot(Minv)
#PRML formula(12.55) E[zz^T]
Ezz = self.var * Minv + np.einsum('ni,nj->nij', Ez, Ez)
return Ez, Ezz
#M step W,sigma^2 estimates
def maximize(self, D, Ez, Ezz):
#PRML formula(12.63)Estimate of W
self.W = D.T.dot(Ez).dot(np.linalg.inv(np.sum(Ezz, axis=0) + self.var * np.diag(self.alpha)))
#PRML formula(12.57) sigma^2 estimates
self.var = np.mean(
np.mean(D ** 2, axis=-1)
- 2 * np.mean(Ez.dot(self.W.T) * D, axis=-1)
+ np.trace(Ezz.dot(self.W.T).dot(self.W).T) / self.ndim)
This time, we will reproduce PRML Figure 12.14. To do this, we need a function to draw a Hinton diagram that represents each element of the matrix as a square. Page that came out by googled with matplotlib hinton is slightly modified.
def hinton(matrix, max_weight=None, ax=None):
"""Draw Hinton diagram for visualizing a weight matrix."""
ax = ax if ax is not None else plt.gca()
if not max_weight:
max_weight = 2 ** np.ceil(np.log(np.abs(matrix).max()) / np.log(2))
ax.patch.set_facecolor('gray')
ax.set_aspect('equal', 'box')
ax.xaxis.set_major_locator(plt.NullLocator())
ax.yaxis.set_major_locator(plt.NullLocator())
for (x, y), w in np.ndenumerate(matrix):
color = 'white' if w > 0 else 'black'
size = np.sqrt(np.abs(w) / max_weight)
rect = plt.Rectangle([y - size / 2, x - size / 2], size, size,
facecolor=color, edgecolor=color)
ax.add_patch(rect)
ax.autoscale_view()
ax.invert_yaxis()
plt.xlim(-0.5, np.size(matrix, 1) - 0.5)
plt.ylim(-0.5, len(matrix) - 0.5)
plt.show()
def create_toy_data(sample_size=100, ndim_hidden=1, ndim_observe=2, std=1.):
Z = np.random.normal(size=(sample_size, ndim_hidden))
mu = np.random.uniform(-5, 5, size=(ndim_observe))
W = np.random.uniform(-5, 5, (ndim_hidden, ndim_observe))
#PRML formula(12.33)
X = Z.dot(W) + mu + np.random.normal(scale=std, size=(sample_size, ndim_observe))
return X
def main():
#Create 100 points of data created by projecting from a 3D latent space to a 10D space
X = create_toy_data(sample_size=100, ndim_hidden=3, ndim_observe=10, std=1.)
#Perform PCA by maximum likelihood method with latent space as 9 dimensions
pca = PCA(9)
pca.fit(X)
hinton(pca.W)
#Perform Bayesian PCA by pruning from up to 9 dimensions
pca.fit_bayesian(X)
hinton(pca.W)
pca.py
import matplotlib.pyplot as plt
import numpy as np
class PCA(object):
def __init__(self, n_component):
self.n_component = n_component
def fit(self, X):
self.mean = np.mean(X, axis=0)
cov = np.cov(X, rowvar=False)
values, vectors = np.linalg.eigh(cov)
index = np.size(X, 1) - self.n_component
if index == 0:
self.var = 0
else:
self.var = np.mean(values[:index])
self.W = vectors[:, index:].dot(np.sqrt(np.diag(values[index:]) - self.var * np.eye(self.n_component)))
def fit_bayesian(self, X, iter_max=100):
self.ndim = np.size(X, 1)
self.fit(X)
self.alpha = self.ndim / np.sum(self.W ** 2, axis=0)
D = X - self.mean
for i in xrange(iter_max):
Ez, Ezz = self.expectation(D)
self.maximize(D, Ez, Ezz)
self.alpha = self.ndim / np.sum(self.W ** 2, axis=0).clip(min=1e-10)
def expectation(self, D):
M = self.W.T.dot(self.W) + self.var * np.eye(self.n_component)
Minv = np.linalg.inv(M)
Ez = D.dot(self.W).dot(Minv)
Ezz = self.var * Minv + np.einsum('ni,nj->nij', Ez, Ez)
return Ez, Ezz
def maximize(self, D, Ez, Ezz):
self.W = D.T.dot(Ez).dot(np.linalg.inv(np.sum(Ezz, axis=0) + self.var * np.diag(self.alpha)))
self.var = np.mean(
np.mean(D ** 2, axis=-1)
- 2 * np.mean(Ez.dot(self.W.T) * D, axis=-1)
+ np.trace(Ezz.dot(self.W.T).dot(self.W).T) / self.ndim)
def hinton(matrix, max_weight=None, ax=None):
"""Draw Hinton diagram for visualizing a weight matrix."""
ax = ax if ax is not None else plt.gca()
if not max_weight:
max_weight = 2 ** np.ceil(np.log(np.abs(matrix).max()) / np.log(2))
ax.patch.set_facecolor('gray')
ax.set_aspect('equal', 'box')
ax.xaxis.set_major_locator(plt.NullLocator())
ax.yaxis.set_major_locator(plt.NullLocator())
for (x, y), w in np.ndenumerate(matrix):
color = 'white' if w > 0 else 'black'
size = np.sqrt(np.abs(w) / max_weight)
rect = plt.Rectangle([y - size / 2, x - size / 2], size, size,
facecolor=color, edgecolor=color)
ax.add_patch(rect)
ax.autoscale_view()
ax.invert_yaxis()
plt.xlim(-0.5, np.size(matrix, 1) - 0.5)
plt.ylim(-0.5, len(matrix) - 0.5)
plt.show()
def create_toy_data(sample_size=100, ndim_hidden=1, ndim_observe=2, std=1.):
Z = np.random.normal(size=(sample_size, ndim_hidden))
mu = np.random.uniform(-5, 5, size=(ndim_observe))
W = np.random.uniform(-5, 5, (ndim_hidden, ndim_observe))
X = Z.dot(W) + mu + np.random.normal(scale=std, size=(sample_size, ndim_observe))
return X
def main():
X = create_toy_data(sample_size=100, ndim_hidden=3, ndim_observe=10, std=1.)
pca = PCA(9)
pca.fit(X)
hinton(pca.W)
pca.fit_bayesian(X)
hinton(pca.W)
if __name__ == '__main__':
main()
The result of estimating the projection matrix W by PCA by the maximum likelihood method is as follows. Then, when the dimension of the latent space is pruned using Bayesian principal component analysis, the projection matrix W becomes as shown in the figure below. The six columns on the left have disappeared and the corresponding dimensions have been pruned. It is possible to capture the fact that it has been projected from three dimensions. PRML The results shown in Figure 12.14 were obtained.
When PRML came to this point, it became more common to combine what we had learned so far and apply it to new models. This time, by combining the automatic determination of relevance in Chapter 6 and the EM algorithm in Chapter 9, the observation data is actually applied to the model that is projected from a lower dimensional space. It seems that it can be applied to data with missing values, so I would like to try that as well.
Recommended Posts