[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)

We will compare the theory of principal component analysis (PCA) and its application (Kernel PCA, 2DPCA [^ 1]), which are well-known methods for dimension reduction and feature extraction, and their implementation in Python.

Principal component analysis theory

1st main component

$ M $ data $ a_1, \ ldots, a_M \ in \ mathbb {R} ^ n $, the first $ 1 $ principal component $ x_1 \ in \ mathbb {R} ^ n $ is the centralized data $a_i - \bar{a},\quad \bar{a} = \frac1M\sum_{i=1}^M a_i$ It is calculated as a vector of length $ 1 $ in the direction that maximizes the variance of. In other words, it can be interpreted as the optimal solution for the following maximization problem.

\begin{align}
\max_{x\in\mathbb{R}^n} &\quad \frac1M\sum_{i=1}^M\left((a_i - \bar{a})^\top x\right)^2\\
\text{s.t.} &\quad \|x\|_2 = 1
\end{align}

Here, if the variance-covariance matrix of the data is $ C: = \ frac1M \ sum_ {i = 1} ^ M (a_i-\ bar {a}) (a_i-\ bar {a}) ^ \ top $

\begin{align}
\max_{x\in\mathbb{R}^n} \ \frac1M\sum_{i=1}^M\left((a_i - \bar{a})^\top x\right)^2 &= \max_{x\in\mathbb{R}^n} \ x^\top\left[\frac1M\sum_{i=1}^M(a_i - \bar{a})(a_i - \bar{a})^\top\right] x\\
&= \max_{x\in\mathbb{R}^n}\ x^\top C x
\end{align}

Therefore, the first $ 1 $ principal component $ x_1 $ is

x_1 = \mathop{\mathrm{arg~max}}_{x\in\mathbb{R^n}}\left\{x^\top C x \mid \|x\|_2 = 1\right\}

And matches the eigenvector corresponding to the maximum eigenvalue of the variance-covariance matrix $ C $.

Second main component

After subtracting the direction of the first principal component from the centralized data $ a_i-\ bar {a} $, the same calculation as for the first principal component is performed. In other words

\begin{align}
\max_{x\in\mathbb{R}^n} &\quad \frac1M\sum_{i=1}^M\left(\left(a_i - \bar{a} - \left(x_1^\top (a_i - \bar{a})\right)x_1\right)^\top x\right)^2\\
\text{s.t.} &\quad \|x\|_2 = 1
\end{align}

Is dealt with. here,

\begin{align}
a_i - \bar{a} - \left(x_1^\top (a_i - \bar{a})\right)x_1 &= a_i - \bar{a} - \left(x_1x_1^\top\right)(a_i - \bar{a})\\
&= \left(I_n - x_1x_1^\top\right)(a_i - \bar{a})
\end{align}

The formula can be transformed. Let $ P_1 = I_n-x_1x_1 ^ \ top $, the eigenvalues of the covariance matrix $ C $ be $ \ lambda_1 \ geq \ cdots \ geq \ lambda_n $, and the corresponding eigenvectors be $ v_1, \ ldots, v_n $. From the orthogonality of the eigenvectors of the real symmetric matrix

\begin{cases}
P_1v_i = 0 & (i = 1)\\
P_1v_i = v_i & (i \neq 1)
\end{cases}

Holds. Also,

\begin{align}
\max_{x\in\mathbb{R}^n} \ \frac1M\sum_{i=1}^M\left(\left(a_i - \bar{a} - \left(x_1^\top (a_i - \bar{a})\right)x_1\right)^\top x\right)^2 &= \max_{x\in\mathbb{R}^n} \ \frac1M\sum_{i=1}^M\left((a_i - \bar{a})^\top P_1 x\right)^2\\
&= \max_{x\in\mathbb{R}^n}\ (P_1x)^\top C (P_1x)
\end{align}

Therefore, the second $ 2 $ principal component $ x_2 $ is

x_2 = \mathop{\mathrm{arg~max}}_{x\in\mathbb{R^n}}\left\{(P_1x)^\top C (P_1x) \mid \|x\|_2 = 1\right\}

And matches the eigenvector corresponding to the second largest eigenvalue $ \ lambda_2 $ in the covariance matrix $ C $.

A similar argument shows that the principal component $ x_k $ of the $ k $ matches the eigenvector corresponding to $ \ lambda_k $, and the principal component analysis of the data $ a_1, \ ldots, a_M $ is the variance-covariance matrix $ C. It is reduced to the eigenvalue problem of $.

Dimensionality reduction

When $ d $ principal components $ x_1, \ ldots, x_d $ are obtained from the data $ a_1, \ ldots, a_M $, the dimensionality reduction of the new data $ a \ in \ mathbb {R} ^ n $ is reduced. Think. Considering approximating $ a $ with a linear combination of $ d $ principal components, using $ X_d = [x_1, \ ldots, x_d] $,

a \approx X_d y + \bar{a}

Call. Therefore, from the orthogonality of the principal components $ \ left (X_d ^ \ top X_d = I_d \ right) $,

y = X_d^\top (a - \bar{a})

And can be reduced to the $ d $ dimension.

Kernel PCA theory [^ 2]

Main component

Let $ \ phi $ be the mapping from the data space $ \ mathbb {R} ^ n $ to the reproducing kernel Hilbert Space (RKHS) $ \ mathcal {H} $. further,

\bar{\phi} = \frac1M \sum_{i=1}^M \phi(a_i)\\
\Phi = [\phi(a_1),\ldots, \phi(a_M)]\\
\tilde{\Phi} = \left[\phi(a_1) - \bar{\phi},\ldots, \phi(a_M) - \bar{\phi}\right]

Then, the objective function of PCA is

\begin{align}
\frac1M\sum_{i=1}^M\left(\left(\phi(a_i) - \bar{\phi}\right)^\top x\right)^2 &= \frac1M x^\top\left[\sum_{i=1}^M\left(\phi(a_i) - \bar{\phi}\right)\left(\phi(a_i) - \bar{\phi}\right)^\top\right] x\\
&= \frac1M x^\top\tilde\Phi\tilde\Phi^\top x
\end{align}

It becomes. Principal component analysis results in an eigenvalue problem of $ \ frac1M \ tilde \ Phi \ tilde \ Phi ^ \ top $.

\frac1M\tilde\Phi\tilde\Phi^\top x = \lambda x

think of. If $ z = \ tilde \ Phi ^ \ top x $,

x = \frac1{M\lambda}\tilde\Phi z

Call. Substituting this into the left and right sides of the eigen equation, respectively,

\begin{align}
\frac1M\tilde\Phi\tilde\Phi^\top x &= \frac1M\tilde\Phi\tilde\Phi^\top\left(\frac1{M\lambda}\tilde\Phi z\right)\\
&= \frac1{M^2\lambda}\tilde\Phi\left(\tilde\Phi^\top\tilde\Phi\right)z\\
\lambda x &= \frac1M\tilde\Phi z
\end{align}

And

\begin{align}
\frac1{M^2\lambda}\tilde\Phi\left(\tilde\Phi^\top\tilde\Phi\right)z = \frac1M\tilde\Phi z \iff \tilde\Phi\left(\frac1M\tilde\Phi^\top\tilde\Phi\right)z = \tilde\Phi \lambda z
\end{align}

Call. Where the matrix $ \ tilde \ Phi ^ \ top \ tilde \ Phi \ in \ mathbb {R} ^ {M \ times M} $ is $ k (a_i, a_j) = \ phi (a_i) ^ \ top \ As phi (a_j) $,

\begin{align}
\left(\tilde\Phi^\top\tilde\Phi\right)_{i,j} &= (\phi(a_i) - \bar{\phi})^\top(\phi(a_j) - \bar{\phi})\\
&= k(a_i, a_j) - \frac1M\sum_{\ell=1}^Mk(a_i, a_\ell) - \frac1M\sum_{k=1}^Mk(a_k, a_j) + \frac1{M^2}\sum_{k, \ell=1}^Mk(a_k, a_\ell)
\end{align}

Therefore, let the matrix with all components $ 1 $ be $ \ mathbf {1} \ in \ mathbb {R} ^ {M \ times M} $, $ \ Phi ^ \ top \ Phi = K $.

\tilde\Phi^\top\tilde\Phi = K - \frac1M K\mathbf{1} - \frac1M\mathbf{1} K+ \frac1{M^2}\mathbf{1} K \mathbf{1}

It is calculated by. From the above, the eigenvector $ x $ corresponding to the eigenvalue $ \ lambda $ of $ \ frac1M \ tilde \ Phi \ tilde \ Phi ^ \ top $ is the $ M $ dimensional square matrix $ \ frac1M \ tilde \ Phi ^ \ top \ Using the eigenvector $ z \ in \ mathbb {R} ^ M $ corresponding to the eigenvalue $ \ lambda $ of tilde \ Phi $,

x = \frac1{M\lambda}\tilde\Phi z

Is calculated.

Dimensionality reduction

When $ d $ principal components $ x_1, \ ldots, x_d $ are obtained from the data $ a_1, \ ldots, a_M $, the dimensionality reduction of the new data $ a \ in \ mathbb {R} ^ n $ is reduced. Think. Considering approximating $ \ phi (a) $ with a linear combination of $ d $ principal components, using $ X_d = [x_1, \ ldots, x_d] $,

\phi(a) \approx X_d y + \bar{\phi}

Call. Therefore, from the orthogonality of the principal components $ \ left (X_d ^ \ top X_d = I_d \ right) $,

y = X_d^\top \left(\phi(a) - \bar{\phi}\right)

And can be reduced to the $ d $ dimension.

(Supplement)

\left<x_i, x_j\right>_{\mathcal H} = \delta_{i, j}

When orthogonalizing with

\left<x_i, x_i\right>_{\mathcal H} = \frac1{M\lambda_i^2}z_i^\top \left(\frac1M\Phi^\top\Phi \right)z_i = \frac{1}{M\lambda_i}\|z_i\|_2^2 = 1

$ Z_i $ is standardized to satisfy.

2 DPCA theory [^ 1]

Given 2D data $ A_1, \ ldots, A_M \ in \ mathbb {R} ^ {m \ times n} $ like an image, PCA on $ \ mathbb {R} ^ {mn} $ While it is converted to a one-dimensional vector and handled, Two-Dimensional PCA (2DPCA) handles it as it is in two dimensions.

Main component

Consider mapping the random variable $ A \ in \ mathbb {R} ^ {m \ times n} $ to the $ m $ dimension by the $ n $ dimension vector $ x $. In other words

y = A x \tag{1}

Consider obtaining the feature vector $ y \ in \ mathbb {R} ^ m $ by. At this time, the variance-covariance matrix of $ y $ as an index of $ x $

\begin{align}
S_x &= \mathbb{E}\left[\left(y - \mathbb{E}[y]\right)\left(y - \mathbb{E}[y]\right)^\top\right]\\
&= \mathbb{E}\left[\left(Ax - \mathbb{E}[Ax]\right)\left(Ax - \mathbb{E}[Ax]\right)^\top\right]\\
&= \mathbb{E}\left[\left(A - \mathbb{E}[A]\right)x(\left(A - \mathbb{E}[A]\right)x)^\top\right]\\
\end{align}

Trace $ \ mathrm {tr} , S_x $, that is, maximize the variance of $ y $.

\begin{align}
\mathrm{tr}\,S_x &= \mathrm{tr}\,\mathbb{E}\left[\left(A - \mathbb{E}[A]\right)x(\left(A - \mathbb{E}[A]\right)x)^\top\right]\\
&= \mathrm{tr}\,\mathbb{E}\left[x^\top\left(A - \mathbb{E}[A]\right)^\top\left(A - \mathbb{E}[A]\right)x\right]\\
&= x^\top\mathbb{E}\left[\left(A - \mathbb{E}[A]\right)^\top\left(A - \mathbb{E}[A]\right)\right]x
\end{align}

Therefore, when the data $ A_1, \ ldots, A_M \ in \ mathbb {R} ^ {m \ times n} $ is observed, $ \ mathrm {tr} , S_x $ is

\bar{A} = \frac1M\sum_{i=1}^MA_i\\
G_t = \frac1M\sum_{i=1}^M\left(A_i - \bar{A}\right)^\top\left(A_i - \bar{A}\right)

Using

\mathrm{tr}\,S_x = x^\top G_t x

At this time,

x_1 = \mathop{\mathrm{arg~max}}_{x\in\mathbb{R}^n}\left\{x^\top G_t x \mid \|x\|_2 = 1\right\}

And $ G_t $ result in the maximum eigenvalue problem.

Dimensionality reduction

When $ d $ eigenvectors $ x_1, \ ldots, x_d $ of $ G_t $ are obtained from the data $ A_1, \ ldots, A_M $, new data $ A \ in \ mathbb {R} ^ {m \ Consider the dimensionality reduction of times n} $. If $ X_d = [x_1, \ ldots, x_d] \ in \ mathbb {R} ^ {n \ times d} $, then from equation (1)

Y_d = AX_d

And $ m \ times d $ dimensions can be reduced. Also,

A \approx Y_dX_d^\top

Is approximately restored.

PCA and 2DPCA comparison table

	PCA	2DPCA
Input dimension	\mathbb{R}^{mn}	\mathbb{R}^{m\times n}
Compression dimension	\mathbb{R}^{d}	\mathbb{R}^{m\times d}
Dimensions of the covariance matrix	\mathbb{R}^{mn\times mn}	\mathbb{R}^{n\times n}
X_dDimension	\mathbb{R}^{mn\times d}	\mathbb{R}^{n\times d}

Execution example in Python

PCA and [Kernel PCA](https://scikit-learn.org/stable/modules/generated/ sklearn.decomposition.KernelPCA.html) used the implementation of scikit-learn, and 2DPCA used the implementation by himself. The implemented code is described at the end.

Data set preparation

As an execution example, we used the 28x28 handwritten digit image data set MNIST. As shown below, 6902 images of 0 were used for learning, and one image of 0 and one image of 1 were used for verification.

from sklearn import datasets
mnist = datasets.fetch_openml('mnist_784', data_home='../../data/')
mnist_data = mnist.data / 255.
mnist_0 = mnist_data[mnist.target == '0']

mnist_shape = (28, 28)
train = mnist_0[:-1]
test_0 = mnist_0[-1].reshape(1, -1)
test_1 = mnist_data[mnist.target == '1'][-1].reshape(1, -1)

PCA learning

Five eigenvectors were learned from the variance-covariance matrix. Note that KernelPCA cannot be restored from dimensionality reduction unless fit_inverse_transform = True.

from sklearn.decomposition import PCA, KernelPCA

n_components = 5
%time pca = PCA(n_components=n_components).fit(train)
%time kpca = KernelPCA(n_components=n_components, fit_inverse_transform=True).fit(train)
%time tdpca = TwoDimensionalPCA(n_components=n_components).fit(train.reshape(-1, *mnist_shape))

The learning time is as follows in the order of PCA, Kernel PCA, and 2DPCA. It turns out that Kernel PCA is particularly heavy.

CPU times: user 545 ms, sys: 93.9 ms, total: 639 ms
Wall time: 310 ms
CPU times: user 10.9 s, sys: 1.22 s, total: 12.1 s
Wall time: 7.8 s
CPU times: user 97.6 ms, sys: 59.3 ms, total: 157 ms
Wall time: 195 ms

Dimensionality reduction and image restoration with PCA

Dimensionality reduction was performed with transform, and 0 images were restored with ʻinverse_transform`.

%time pca_result_0 = pca.inverse_transform(pca.transform(test_0))
%time kpca_result_0 = kpca.inverse_transform(kpca.transform(test_0))
%time tdpca_result_0 = tdpca.inverse_transform(tdpca.transform(test_0.reshape(1, *mnist_shape)))

The calculation time is as follows in the order of PCA, Kernel PCA, and 2DPCA.

CPU times: user 1.17 ms, sys: 2.72 ms, total: 3.89 ms
Wall time: 4.26 ms
CPU times: user 14.3 ms, sys: 2.24 ms, total: 16.5 ms
Wall time: 11.5 ms
CPU times: user 90 µs, sys: 3 µs, total: 93 µs
Wall time: 103 µs

Restoration result

The dimension reduction / restoration was also performed for the image of 1.

2DPCA source code

The executed notebook is on Github.

from typing import Union

import numpy as np
from scipy.linalg import eigh
from sklearn.exceptions import NotFittedError


class TwoDimensionalPCA:
    def __init__(self, n_components: Union[int, None] = None):
        self.n_components_ = n_components

        self.components_ = None
        self.mean_ = None

    def fit(self, X: np.ndarray):
        """
        Parameters
        ----------
        X : np.ndarray, shape (n_samples, height, width)
        """
        if X.ndim != 3:
            raise ValueError(f"Expected 3D array, got {X.ndim}D array instead")

        self.mean_ = np.mean(X, axis=0)
        cov = np.mean([x.T @ x for x in X - self.mean_], axis=0)
        n_features = cov.shape[0]

        if self.n_components_ is None or self.n_components_ > n_features:
            self.n_components_ = n_features
        self.components_ = eigh(cov, eigvals=(n_features - self.n_components_, n_features - 1))[1][:, ::-1]
        return self

    def transform(self, X: np.ndarray) -> np.ndarray:
        """
        Parameters
        ----------
        X : np.ndarray, shape (n_samples, height, width)

        Returns
        -------
        : np.ndarray, shape (n_samples, height, n_components)
        """
        if self.components_ is None:
            raise NotFittedError(
                "This PCA instance is not fitted yet. "
                "Call 'fit' with appropriate arguments before using this estimator.")

        if X.ndim != 3:
            raise ValueError(f"Expected 3D array, got {X.ndim}D array instead")

        return np.array([x @ self.components_ for x in X])

    def inverse_transform(self, X: np.ndarray) -> np.ndarray:
        """
        Parameters
        ----------
        X : np.ndarray, shape (n_samples, height, n_components)

        Returns
        -------
        : np.ndarray, shape (n_samples, height, width)
        """
        return np.array([x @ self.components_.T for x in X])

    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        """
        Parameters
        ----------
        X : np.ndarray, shape (n_samples, height, width)

        Returns
        -------
        : np.ndarray, shape (n_samples, height, n_components)
        """
        self.fit(X)
        return self.transform(X)

[^ 2]: Kenji Fukumizu, "Introduction to the Kernel Method-Data Analysis with a Positive Value Kernel-", Asakura Shoten, 2010.