We will compare the theory of principal component analysis (PCA) and its application (Kernel PCA, 2DPCA [^ 1]), which are well-known methods for dimension reduction and feature extraction, and their implementation in Python.
$ M $ data $ a_1, \ ldots, a_M \ in \ mathbb {R} ^ n $, the first $ 1 $ principal component $ x_1 \ in \ mathbb {R} ^ n $ is the centralized data
\begin{align}
\max_{x\in\mathbb{R}^n} &\quad \frac1M\sum_{i=1}^M\left((a_i - \bar{a})^\top x\right)^2\\
\text{s.t.} &\quad \|x\|_2 = 1
\end{align}
Here, if the variance-covariance matrix of the data is $ C: = \ frac1M \ sum_ {i = 1} ^ M (a_i-\ bar {a}) (a_i-\ bar {a}) ^ \ top $
\begin{align}
\max_{x\in\mathbb{R}^n} \ \frac1M\sum_{i=1}^M\left((a_i - \bar{a})^\top x\right)^2 &= \max_{x\in\mathbb{R}^n} \ x^\top\left[\frac1M\sum_{i=1}^M(a_i - \bar{a})(a_i - \bar{a})^\top\right] x\\
&= \max_{x\in\mathbb{R}^n}\ x^\top C x
\end{align}
Therefore, the first $ 1 $ principal component $ x_1 $ is
x_1 = \mathop{\mathrm{arg~max}}_{x\in\mathbb{R^n}}\left\{x^\top C x \mid \|x\|_2 = 1\right\}
And matches the eigenvector corresponding to the maximum eigenvalue of the variance-covariance matrix $ C $.
After subtracting the direction of the first principal component from the centralized data $ a_i-\ bar {a} $, the same calculation as for the first principal component is performed. In other words
\begin{align}
\max_{x\in\mathbb{R}^n} &\quad \frac1M\sum_{i=1}^M\left(\left(a_i - \bar{a} - \left(x_1^\top (a_i - \bar{a})\right)x_1\right)^\top x\right)^2\\
\text{s.t.} &\quad \|x\|_2 = 1
\end{align}
Is dealt with. here,
\begin{align}
a_i - \bar{a} - \left(x_1^\top (a_i - \bar{a})\right)x_1 &= a_i - \bar{a} - \left(x_1x_1^\top\right)(a_i - \bar{a})\\
&= \left(I_n - x_1x_1^\top\right)(a_i - \bar{a})
\end{align}
The formula can be transformed. Let $ P_1 = I_n-x_1x_1 ^ \ top $, the eigenvalues of the covariance matrix $ C $ be $ \ lambda_1 \ geq \ cdots \ geq \ lambda_n $, and the corresponding eigenvectors be $ v_1, \ ldots, v_n $. From the orthogonality of the eigenvectors of the real symmetric matrix
\begin{cases}
P_1v_i = 0 & (i = 1)\\
P_1v_i = v_i & (i \neq 1)
\end{cases}
Holds. Also,
\begin{align}
\max_{x\in\mathbb{R}^n} \ \frac1M\sum_{i=1}^M\left(\left(a_i - \bar{a} - \left(x_1^\top (a_i - \bar{a})\right)x_1\right)^\top x\right)^2 &= \max_{x\in\mathbb{R}^n} \ \frac1M\sum_{i=1}^M\left((a_i - \bar{a})^\top P_1 x\right)^2\\
&= \max_{x\in\mathbb{R}^n}\ (P_1x)^\top C (P_1x)
\end{align}
Therefore, the second $ 2 $ principal component $ x_2 $ is
x_2 = \mathop{\mathrm{arg~max}}_{x\in\mathbb{R^n}}\left\{(P_1x)^\top C (P_1x) \mid \|x\|_2 = 1\right\}
And matches the eigenvector corresponding to the second largest eigenvalue $ \ lambda_2 $ in the covariance matrix $ C $.
A similar argument shows that the principal component $ x_k $ of the $ k $ matches the eigenvector corresponding to $ \ lambda_k $, and the principal component analysis of the data $ a_1, \ ldots, a_M $ is the variance-covariance matrix $ C. It is reduced to the eigenvalue problem of $.
When $ d $ principal components $ x_1, \ ldots, x_d $ are obtained from the data $ a_1, \ ldots, a_M $, the dimensionality reduction of the new data $ a \ in \ mathbb {R} ^ n $ is reduced. Think. Considering approximating $ a $ with a linear combination of $ d $ principal components, using $ X_d = [x_1, \ ldots, x_d] $,
a \approx X_d y + \bar{a}
Call. Therefore, from the orthogonality of the principal components $ \ left (X_d ^ \ top X_d = I_d \ right) $,
y = X_d^\top (a - \bar{a})
And can be reduced to the $ d $ dimension.
Let $ \ phi $ be the mapping from the data space $ \ mathbb {R} ^ n $ to the reproducing kernel Hilbert Space (RKHS) $ \ mathcal {H} $. further,
\bar{\phi} = \frac1M \sum_{i=1}^M \phi(a_i)\\
\Phi = [\phi(a_1),\ldots, \phi(a_M)]\\
\tilde{\Phi} = \left[\phi(a_1) - \bar{\phi},\ldots, \phi(a_M) - \bar{\phi}\right]
Then, the objective function of PCA is
\begin{align}
\frac1M\sum_{i=1}^M\left(\left(\phi(a_i) - \bar{\phi}\right)^\top x\right)^2 &= \frac1M x^\top\left[\sum_{i=1}^M\left(\phi(a_i) - \bar{\phi}\right)\left(\phi(a_i) - \bar{\phi}\right)^\top\right] x\\
&= \frac1M x^\top\tilde\Phi\tilde\Phi^\top x
\end{align}
It becomes. Principal component analysis results in an eigenvalue problem of $ \ frac1M \ tilde \ Phi \ tilde \ Phi ^ \ top $.
\frac1M\tilde\Phi\tilde\Phi^\top x = \lambda x
think of. If $ z = \ tilde \ Phi ^ \ top x $,
x = \frac1{M\lambda}\tilde\Phi z
Call. Substituting this into the left and right sides of the eigen equation, respectively,
\begin{align}
\frac1M\tilde\Phi\tilde\Phi^\top x &= \frac1M\tilde\Phi\tilde\Phi^\top\left(\frac1{M\lambda}\tilde\Phi z\right)\\
&= \frac1{M^2\lambda}\tilde\Phi\left(\tilde\Phi^\top\tilde\Phi\right)z\\
\lambda x &= \frac1M\tilde\Phi z
\end{align}
And
\begin{align}
\frac1{M^2\lambda}\tilde\Phi\left(\tilde\Phi^\top\tilde\Phi\right)z = \frac1M\tilde\Phi z \iff \tilde\Phi\left(\frac1M\tilde\Phi^\top\tilde\Phi\right)z = \tilde\Phi \lambda z
\end{align}
Call. Where the matrix $ \ tilde \ Phi ^ \ top \ tilde \ Phi \ in \ mathbb {R} ^ {M \ times M} $ is $ k (a_i, a_j) = \ phi (a_i) ^ \ top \ As phi (a_j) $,
\begin{align}
\left(\tilde\Phi^\top\tilde\Phi\right)_{i,j} &= (\phi(a_i) - \bar{\phi})^\top(\phi(a_j) - \bar{\phi})\\
&= k(a_i, a_j) - \frac1M\sum_{\ell=1}^Mk(a_i, a_\ell) - \frac1M\sum_{k=1}^Mk(a_k, a_j) + \frac1{M^2}\sum_{k, \ell=1}^Mk(a_k, a_\ell)
\end{align}
Therefore, let the matrix with all components $ 1 $ be $ \ mathbf {1} \ in \ mathbb {R} ^ {M \ times M} $, $ \ Phi ^ \ top \ Phi = K $.
\tilde\Phi^\top\tilde\Phi = K - \frac1M K\mathbf{1} - \frac1M\mathbf{1} K+ \frac1{M^2}\mathbf{1} K \mathbf{1}
It is calculated by. From the above, the eigenvector $ x $ corresponding to the eigenvalue $ \ lambda $ of $ \ frac1M \ tilde \ Phi \ tilde \ Phi ^ \ top $ is the $ M $ dimensional square matrix $ \ frac1M \ tilde \ Phi ^ \ top \ Using the eigenvector $ z \ in \ mathbb {R} ^ M $ corresponding to the eigenvalue $ \ lambda $ of tilde \ Phi $,
x = \frac1{M\lambda}\tilde\Phi z
Is calculated.
When $ d $ principal components $ x_1, \ ldots, x_d $ are obtained from the data $ a_1, \ ldots, a_M $, the dimensionality reduction of the new data $ a \ in \ mathbb {R} ^ n $ is reduced. Think. Considering approximating $ \ phi (a) $ with a linear combination of $ d $ principal components, using $ X_d = [x_1, \ ldots, x_d] $,
\phi(a) \approx X_d y + \bar{\phi}
Call. Therefore, from the orthogonality of the principal components $ \ left (X_d ^ \ top X_d = I_d \ right) $,
y = X_d^\top \left(\phi(a) - \bar{\phi}\right)
And can be reduced to the $ d $ dimension.
(Supplement)
\left<x_i, x_j\right>_{\mathcal H} = \delta_{i, j}
When orthogonalizing with
\left<x_i, x_i\right>_{\mathcal H} = \frac1{M\lambda_i^2}z_i^\top \left(\frac1M\Phi^\top\Phi \right)z_i = \frac{1}{M\lambda_i}\|z_i\|_2^2 = 1
$ Z_i $ is standardized to satisfy.
Given 2D data $ A_1, \ ldots, A_M \ in \ mathbb {R} ^ {m \ times n} $ like an image, PCA on $ \ mathbb {R} ^ {mn} $ While it is converted to a one-dimensional vector and handled, Two-Dimensional PCA (2DPCA) handles it as it is in two dimensions.
Consider mapping the random variable $ A \ in \ mathbb {R} ^ {m \ times n} $ to the $ m $ dimension by the $ n $ dimension vector $ x $. In other words
y = A x \tag{1}
Consider obtaining the feature vector $ y \ in \ mathbb {R} ^ m $ by. At this time, the variance-covariance matrix of $ y $ as an index of $ x $
\begin{align}
S_x &= \mathbb{E}\left[\left(y - \mathbb{E}[y]\right)\left(y - \mathbb{E}[y]\right)^\top\right]\\
&= \mathbb{E}\left[\left(Ax - \mathbb{E}[Ax]\right)\left(Ax - \mathbb{E}[Ax]\right)^\top\right]\\
&= \mathbb{E}\left[\left(A - \mathbb{E}[A]\right)x(\left(A - \mathbb{E}[A]\right)x)^\top\right]\\
\end{align}
Trace $ \ mathrm {tr} , S_x $, that is, maximize the variance of $ y $.
\begin{align}
\mathrm{tr}\,S_x &= \mathrm{tr}\,\mathbb{E}\left[\left(A - \mathbb{E}[A]\right)x(\left(A - \mathbb{E}[A]\right)x)^\top\right]\\
&= \mathrm{tr}\,\mathbb{E}\left[x^\top\left(A - \mathbb{E}[A]\right)^\top\left(A - \mathbb{E}[A]\right)x\right]\\
&= x^\top\mathbb{E}\left[\left(A - \mathbb{E}[A]\right)^\top\left(A - \mathbb{E}[A]\right)\right]x
\end{align}
Therefore, when the data $ A_1, \ ldots, A_M \ in \ mathbb {R} ^ {m \ times n} $ is observed, $ \ mathrm {tr} , S_x $ is
\bar{A} = \frac1M\sum_{i=1}^MA_i\\
G_t = \frac1M\sum_{i=1}^M\left(A_i - \bar{A}\right)^\top\left(A_i - \bar{A}\right)
Using
\mathrm{tr}\,S_x = x^\top G_t x
At this time,
x_1 = \mathop{\mathrm{arg~max}}_{x\in\mathbb{R}^n}\left\{x^\top G_t x \mid \|x\|_2 = 1\right\}
And $ G_t $ result in the maximum eigenvalue problem.
When $ d $ eigenvectors $ x_1, \ ldots, x_d $ of $ G_t $ are obtained from the data $ A_1, \ ldots, A_M $, new data $ A \ in \ mathbb {R} ^ {m \ Consider the dimensionality reduction of times n} $. If $ X_d = [x_1, \ ldots, x_d] \ in \ mathbb {R} ^ {n \ times d} $, then from equation (1)
Y_d = AX_d
And $ m \ times d $ dimensions can be reduced. Also,
A \approx Y_dX_d^\top
Is approximately restored.
PCA | 2DPCA | |
---|---|---|
Input dimension | ||
Compression dimension | ||
Dimensions of the covariance matrix | ||
PCA and [Kernel PCA](https://scikit-learn.org/stable/modules/generated/ sklearn.decomposition.KernelPCA.html) used the implementation of scikit-learn, and 2DPCA used the implementation by himself. The implemented code is described at the end.
As an execution example, we used the 28x28 handwritten digit image data set MNIST. As shown below, 6902 images of 0 were used for learning, and one image of 0 and one image of 1 were used for verification.
from sklearn import datasets
mnist = datasets.fetch_openml('mnist_784', data_home='../../data/')
mnist_data = mnist.data / 255.
mnist_0 = mnist_data[mnist.target == '0']
mnist_shape = (28, 28)
train = mnist_0[:-1]
test_0 = mnist_0[-1].reshape(1, -1)
test_1 = mnist_data[mnist.target == '1'][-1].reshape(1, -1)
Five eigenvectors were learned from the variance-covariance matrix. Note that KernelPCA
cannot be restored from dimensionality reduction unless fit_inverse_transform = True
.
from sklearn.decomposition import PCA, KernelPCA
n_components = 5
%time pca = PCA(n_components=n_components).fit(train)
%time kpca = KernelPCA(n_components=n_components, fit_inverse_transform=True).fit(train)
%time tdpca = TwoDimensionalPCA(n_components=n_components).fit(train.reshape(-1, *mnist_shape))
The learning time is as follows in the order of PCA, Kernel PCA, and 2DPCA. It turns out that Kernel PCA is particularly heavy.
CPU times: user 545 ms, sys: 93.9 ms, total: 639 ms
Wall time: 310 ms
CPU times: user 10.9 s, sys: 1.22 s, total: 12.1 s
Wall time: 7.8 s
CPU times: user 97.6 ms, sys: 59.3 ms, total: 157 ms
Wall time: 195 ms
Dimensionality reduction was performed with transform
, and 0 images were restored with ʻinverse_transform`.
%time pca_result_0 = pca.inverse_transform(pca.transform(test_0))
%time kpca_result_0 = kpca.inverse_transform(kpca.transform(test_0))
%time tdpca_result_0 = tdpca.inverse_transform(tdpca.transform(test_0.reshape(1, *mnist_shape)))
The calculation time is as follows in the order of PCA, Kernel PCA, and 2DPCA.
CPU times: user 1.17 ms, sys: 2.72 ms, total: 3.89 ms
Wall time: 4.26 ms
CPU times: user 14.3 ms, sys: 2.24 ms, total: 16.5 ms
Wall time: 11.5 ms
CPU times: user 90 µs, sys: 3 µs, total: 93 µs
Wall time: 103 µs
Restoration result
The dimension reduction / restoration was also performed for the image of 1.
The executed notebook is on Github.
from typing import Union
import numpy as np
from scipy.linalg import eigh
from sklearn.exceptions import NotFittedError
class TwoDimensionalPCA:
def __init__(self, n_components: Union[int, None] = None):
self.n_components_ = n_components
self.components_ = None
self.mean_ = None
def fit(self, X: np.ndarray):
"""
Parameters
----------
X : np.ndarray, shape (n_samples, height, width)
"""
if X.ndim != 3:
raise ValueError(f"Expected 3D array, got {X.ndim}D array instead")
self.mean_ = np.mean(X, axis=0)
cov = np.mean([x.T @ x for x in X - self.mean_], axis=0)
n_features = cov.shape[0]
if self.n_components_ is None or self.n_components_ > n_features:
self.n_components_ = n_features
self.components_ = eigh(cov, eigvals=(n_features - self.n_components_, n_features - 1))[1][:, ::-1]
return self
def transform(self, X: np.ndarray) -> np.ndarray:
"""
Parameters
----------
X : np.ndarray, shape (n_samples, height, width)
Returns
-------
: np.ndarray, shape (n_samples, height, n_components)
"""
if self.components_ is None:
raise NotFittedError(
"This PCA instance is not fitted yet. "
"Call 'fit' with appropriate arguments before using this estimator.")
if X.ndim != 3:
raise ValueError(f"Expected 3D array, got {X.ndim}D array instead")
return np.array([x @ self.components_ for x in X])
def inverse_transform(self, X: np.ndarray) -> np.ndarray:
"""
Parameters
----------
X : np.ndarray, shape (n_samples, height, n_components)
Returns
-------
: np.ndarray, shape (n_samples, height, width)
"""
return np.array([x @ self.components_.T for x in X])
def fit_transform(self, X: np.ndarray) -> np.ndarray:
"""
Parameters
----------
X : np.ndarray, shape (n_samples, height, width)
Returns
-------
: np.ndarray, shape (n_samples, height, n_components)
"""
self.fit(X)
return self.transform(X)
[^ 2]: Kenji Fukumizu, "Introduction to the Kernel Method-Data Analysis with a Positive Value Kernel-", Asakura Shoten, 2010.
Recommended Posts