[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)

We will compare the theory of principal component analysis (PCA) and its application (Kernel PCA, 2DPCA [^ 1]), which are well-known methods for dimension reduction and feature extraction, and their implementation in Python.

Principal component analysis theory

1st main component

$ M $ data $ a_1, \ ldots, a_M \ in \ mathbb {R} ^ n $, the first $ 1 $ principal component $ x_1 \ in \ mathbb {R} ^ n $ is the centralized data $a_i - \bar{a},\quad \bar{a} = \frac1M\sum_{i=1}^M a_i$ It is calculated as a vector of length $ 1 $ in the direction that maximizes the variance of. image.png In other words, it can be interpreted as the optimal solution for the following maximization problem.

\begin{align}
\max_{x\in\mathbb{R}^n} &\quad \frac1M\sum_{i=1}^M\left((a_i - \bar{a})^\top x\right)^2\\
\text{s.t.} &\quad \|x\|_2 = 1
\end{align}

Here, if the variance-covariance matrix of the data is $ C: = \ frac1M \ sum_ {i = 1} ^ M (a_i-\ bar {a}) (a_i-\ bar {a}) ^ \ top $

\begin{align}
\max_{x\in\mathbb{R}^n} \ \frac1M\sum_{i=1}^M\left((a_i - \bar{a})^\top x\right)^2 &= \max_{x\in\mathbb{R}^n} \ x^\top\left[\frac1M\sum_{i=1}^M(a_i - \bar{a})(a_i - \bar{a})^\top\right] x\\
&= \max_{x\in\mathbb{R}^n}\ x^\top C x
\end{align}

Therefore, the first $ 1 $ principal component $ x_1 $ is

x_1 = \mathop{\mathrm{arg~max}}_{x\in\mathbb{R^n}}\left\{x^\top C x \mid \|x\|_2 = 1\right\}

And matches the eigenvector corresponding to the maximum eigenvalue of the variance-covariance matrix $ C $.

Second main component

After subtracting the direction of the first principal component from the centralized data $ a_i-\ bar {a} $, the same calculation as for the first principal component is performed. In other words

\begin{align}
\max_{x\in\mathbb{R}^n} &\quad \frac1M\sum_{i=1}^M\left(\left(a_i - \bar{a} - \left(x_1^\top (a_i - \bar{a})\right)x_1\right)^\top x\right)^2\\
\text{s.t.} &\quad \|x\|_2 = 1
\end{align}

Is dealt with. here,

\begin{align}
a_i - \bar{a} - \left(x_1^\top (a_i - \bar{a})\right)x_1 &= a_i - \bar{a} - \left(x_1x_1^\top\right)(a_i - \bar{a})\\
&= \left(I_n - x_1x_1^\top\right)(a_i - \bar{a})
\end{align}

The formula can be transformed. Let $ P_1 = I_n-x_1x_1 ^ \ top $, the eigenvalues of the covariance matrix $ C $ be $ \ lambda_1 \ geq \ cdots \ geq \ lambda_n $, and the corresponding eigenvectors be $ v_1, \ ldots, v_n $. From the orthogonality of the eigenvectors of the real symmetric matrix

\begin{cases}
P_1v_i = 0 & (i = 1)\\
P_1v_i = v_i & (i \neq 1)
\end{cases}

Holds. Also,

\begin{align}
\max_{x\in\mathbb{R}^n} \ \frac1M\sum_{i=1}^M\left(\left(a_i - \bar{a} - \left(x_1^\top (a_i - \bar{a})\right)x_1\right)^\top x\right)^2 &= \max_{x\in\mathbb{R}^n} \ \frac1M\sum_{i=1}^M\left((a_i - \bar{a})^\top P_1 x\right)^2\\
&= \max_{x\in\mathbb{R}^n}\ (P_1x)^\top C (P_1x)
\end{align}

Therefore, the second $ 2 $ principal component $ x_2 $ is

x_2 = \mathop{\mathrm{arg~max}}_{x\in\mathbb{R^n}}\left\{(P_1x)^\top C (P_1x) \mid \|x\|_2 = 1\right\}

And matches the eigenvector corresponding to the second largest eigenvalue $ \ lambda_2 $ in the covariance matrix $ C $.

A similar argument shows that the principal component $ x_k $ of the $ k $ matches the eigenvector corresponding to $ \ lambda_k $, and the principal component analysis of the data $ a_1, \ ldots, a_M $ is the variance-covariance matrix $ C. It is reduced to the eigenvalue problem of $.

Dimensionality reduction

When $ d $ principal components $ x_1, \ ldots, x_d $ are obtained from the data $ a_1, \ ldots, a_M $, the dimensionality reduction of the new data $ a \ in \ mathbb {R} ^ n $ is reduced. Think. Considering approximating $ a $ with a linear combination of $ d $ principal components, using $ X_d = [x_1, \ ldots, x_d] $,

a \approx X_d y + \bar{a}

Call. Therefore, from the orthogonality of the principal components $ \ left (X_d ^ \ top X_d = I_d \ right) $,

y = X_d^\top (a - \bar{a})

And can be reduced to the $ d $ dimension.

Kernel PCA theory [^ 2]

Main component

Let $ \ phi $ be the mapping from the data space $ \ mathbb {R} ^ n $ to the reproducing kernel Hilbert Space (RKHS) $ \ mathcal {H} $. further,

\bar{\phi} = \frac1M \sum_{i=1}^M \phi(a_i)\\
\Phi = [\phi(a_1),\ldots, \phi(a_M)]\\
\tilde{\Phi} = \left[\phi(a_1) - \bar{\phi},\ldots, \phi(a_M) - \bar{\phi}\right]

Then, the objective function of PCA is

\begin{align}
\frac1M\sum_{i=1}^M\left(\left(\phi(a_i) - \bar{\phi}\right)^\top x\right)^2 &= \frac1M x^\top\left[\sum_{i=1}^M\left(\phi(a_i) - \bar{\phi}\right)\left(\phi(a_i) - \bar{\phi}\right)^\top\right] x\\
&= \frac1M x^\top\tilde\Phi\tilde\Phi^\top x
\end{align}

It becomes. Principal component analysis results in an eigenvalue problem of $ \ frac1M \ tilde \ Phi \ tilde \ Phi ^ \ top $.

\frac1M\tilde\Phi\tilde\Phi^\top x = \lambda x

think of. If $ z = \ tilde \ Phi ^ \ top x $,

x = \frac1{M\lambda}\tilde\Phi z

Call. Substituting this into the left and right sides of the eigen equation, respectively,

\begin{align}
\frac1M\tilde\Phi\tilde\Phi^\top x &= \frac1M\tilde\Phi\tilde\Phi^\top\left(\frac1{M\lambda}\tilde\Phi z\right)\\
&= \frac1{M^2\lambda}\tilde\Phi\left(\tilde\Phi^\top\tilde\Phi\right)z\\
\lambda x &= \frac1M\tilde\Phi z
\end{align}

And

\begin{align}
\frac1{M^2\lambda}\tilde\Phi\left(\tilde\Phi^\top\tilde\Phi\right)z = \frac1M\tilde\Phi z \iff \tilde\Phi\left(\frac1M\tilde\Phi^\top\tilde\Phi\right)z = \tilde\Phi \lambda z
\end{align}

Call. Where the matrix $ \ tilde \ Phi ^ \ top \ tilde \ Phi \ in \ mathbb {R} ^ {M \ times M} $ is $ k (a_i, a_j) = \ phi (a_i) ^ \ top \ As phi (a_j) $,

\begin{align}
\left(\tilde\Phi^\top\tilde\Phi\right)_{i,j} &= (\phi(a_i) - \bar{\phi})^\top(\phi(a_j) - \bar{\phi})\\
&= k(a_i, a_j) - \frac1M\sum_{\ell=1}^Mk(a_i, a_\ell) - \frac1M\sum_{k=1}^Mk(a_k, a_j) + \frac1{M^2}\sum_{k, \ell=1}^Mk(a_k, a_\ell)
\end{align}

Therefore, let the matrix with all components $ 1 $ be $ \ mathbf {1} \ in \ mathbb {R} ^ {M \ times M} $, $ \ Phi ^ \ top \ Phi = K $.

\tilde\Phi^\top\tilde\Phi = K - \frac1M K\mathbf{1} - \frac1M\mathbf{1} K+ \frac1{M^2}\mathbf{1} K \mathbf{1}

It is calculated by. From the above, the eigenvector $ x $ corresponding to the eigenvalue $ \ lambda $ of $ \ frac1M \ tilde \ Phi \ tilde \ Phi ^ \ top $ is the $ M $ dimensional square matrix $ \ frac1M \ tilde \ Phi ^ \ top \ Using the eigenvector $ z \ in \ mathbb {R} ^ M $ corresponding to the eigenvalue $ \ lambda $ of tilde \ Phi $,

x = \frac1{M\lambda}\tilde\Phi z

Is calculated.

Dimensionality reduction

When $ d $ principal components $ x_1, \ ldots, x_d $ are obtained from the data $ a_1, \ ldots, a_M $, the dimensionality reduction of the new data $ a \ in \ mathbb {R} ^ n $ is reduced. Think. Considering approximating $ \ phi (a) $ with a linear combination of $ d $ principal components, using $ X_d = [x_1, \ ldots, x_d] $,

\phi(a) \approx X_d y + \bar{\phi}

Call. Therefore, from the orthogonality of the principal components $ \ left (X_d ^ \ top X_d = I_d \ right) $,

y = X_d^\top \left(\phi(a) - \bar{\phi}\right)

And can be reduced to the $ d $ dimension.

(Supplement)

\left<x_i, x_j\right>_{\mathcal H} = \delta_{i, j}

When orthogonalizing with

\left<x_i, x_i\right>_{\mathcal H} = \frac1{M\lambda_i^2}z_i^\top \left(\frac1M\Phi^\top\Phi \right)z_i = \frac{1}{M\lambda_i}\|z_i\|_2^2 = 1

$ Z_i $ is standardized to satisfy.

2 DPCA theory [^ 1]

Given 2D data $ A_1, \ ldots, A_M \ in \ mathbb {R} ^ {m \ times n} $ like an image, PCA on $ \ mathbb {R} ^ {mn} $ While it is converted to a one-dimensional vector and handled, Two-Dimensional PCA (2DPCA) handles it as it is in two dimensions.

Main component

Consider mapping the random variable $ A \ in \ mathbb {R} ^ {m \ times n} $ to the $ m $ dimension by the $ n $ dimension vector $ x $. In other words

y = A x \tag{1}

Consider obtaining the feature vector $ y \ in \ mathbb {R} ^ m $ by. At this time, the variance-covariance matrix of $ y $ as an index of $ x $

\begin{align}
S_x &= \mathbb{E}\left[\left(y - \mathbb{E}[y]\right)\left(y - \mathbb{E}[y]\right)^\top\right]\\
&= \mathbb{E}\left[\left(Ax - \mathbb{E}[Ax]\right)\left(Ax - \mathbb{E}[Ax]\right)^\top\right]\\
&= \mathbb{E}\left[\left(A - \mathbb{E}[A]\right)x(\left(A - \mathbb{E}[A]\right)x)^\top\right]\\
\end{align}

Trace $ \ mathrm {tr} , S_x $, that is, maximize the variance of $ y $.

\begin{align}
\mathrm{tr}\,S_x &= \mathrm{tr}\,\mathbb{E}\left[\left(A - \mathbb{E}[A]\right)x(\left(A - \mathbb{E}[A]\right)x)^\top\right]\\
&= \mathrm{tr}\,\mathbb{E}\left[x^\top\left(A - \mathbb{E}[A]\right)^\top\left(A - \mathbb{E}[A]\right)x\right]\\
&= x^\top\mathbb{E}\left[\left(A - \mathbb{E}[A]\right)^\top\left(A - \mathbb{E}[A]\right)\right]x
\end{align}

Therefore, when the data $ A_1, \ ldots, A_M \ in \ mathbb {R} ^ {m \ times n} $ is observed, $ \ mathrm {tr} , S_x $ is

\bar{A} = \frac1M\sum_{i=1}^MA_i\\
G_t = \frac1M\sum_{i=1}^M\left(A_i - \bar{A}\right)^\top\left(A_i - \bar{A}\right)

Using

\mathrm{tr}\,S_x = x^\top G_t x

At this time,

x_1 = \mathop{\mathrm{arg~max}}_{x\in\mathbb{R}^n}\left\{x^\top G_t x \mid \|x\|_2 = 1\right\}

And $ G_t $ result in the maximum eigenvalue problem.

Dimensionality reduction

When $ d $ eigenvectors $ x_1, \ ldots, x_d $ of $ G_t $ are obtained from the data $ A_1, \ ldots, A_M $, new data $ A \ in \ mathbb {R} ^ {m \ Consider the dimensionality reduction of times n} $. If $ X_d = [x_1, \ ldots, x_d] \ in \ mathbb {R} ^ {n \ times d} $, then from equation (1)

Y_d = AX_d

And $ m \ times d $ dimensions can be reduced. Also,

A \approx Y_dX_d^\top

Is approximately restored.

PCA and 2DPCA comparison table

PCA 2DPCA
Input dimension \mathbb{R}^{mn} \mathbb{R}^{m\times n}
Compression dimension \mathbb{R}^{d} \mathbb{R}^{m\times d}
Dimensions of the covariance matrix \mathbb{R}^{mn\times mn} \mathbb{R}^{n\times n}
X_dDimension \mathbb{R}^{mn\times d} \mathbb{R}^{n\times d}

Execution example in Python

PCA and [Kernel PCA](https://scikit-learn.org/stable/modules/generated/ sklearn.decomposition.KernelPCA.html) used the implementation of scikit-learn, and 2DPCA used the implementation by himself. The implemented code is described at the end.

Data set preparation

As an execution example, we used the 28x28 handwritten digit image data set MNIST. As shown below, 6902 images of 0 were used for learning, and one image of 0 and one image of 1 were used for verification.

from sklearn import datasets
mnist = datasets.fetch_openml('mnist_784', data_home='../../data/')
mnist_data = mnist.data / 255.
mnist_0 = mnist_data[mnist.target == '0']

mnist_shape = (28, 28)
train = mnist_0[:-1]
test_0 = mnist_0[-1].reshape(1, -1)
test_1 = mnist_data[mnist.target == '1'][-1].reshape(1, -1)

PCA learning

Five eigenvectors were learned from the variance-covariance matrix. Note that KernelPCA cannot be restored from dimensionality reduction unless fit_inverse_transform = True.

from sklearn.decomposition import PCA, KernelPCA

n_components = 5
%time pca = PCA(n_components=n_components).fit(train)
%time kpca = KernelPCA(n_components=n_components, fit_inverse_transform=True).fit(train)
%time tdpca = TwoDimensionalPCA(n_components=n_components).fit(train.reshape(-1, *mnist_shape))

The learning time is as follows in the order of PCA, Kernel PCA, and 2DPCA. It turns out that Kernel PCA is particularly heavy.

CPU times: user 545 ms, sys: 93.9 ms, total: 639 ms
Wall time: 310 ms
CPU times: user 10.9 s, sys: 1.22 s, total: 12.1 s
Wall time: 7.8 s
CPU times: user 97.6 ms, sys: 59.3 ms, total: 157 ms
Wall time: 195 ms

Dimensionality reduction and image restoration with PCA

Dimensionality reduction was performed with transform, and 0 images were restored with ʻinverse_transform`.

%time pca_result_0 = pca.inverse_transform(pca.transform(test_0))
%time kpca_result_0 = kpca.inverse_transform(kpca.transform(test_0))
%time tdpca_result_0 = tdpca.inverse_transform(tdpca.transform(test_0.reshape(1, *mnist_shape)))

The calculation time is as follows in the order of PCA, Kernel PCA, and 2DPCA.

CPU times: user 1.17 ms, sys: 2.72 ms, total: 3.89 ms
Wall time: 4.26 ms
CPU times: user 14.3 ms, sys: 2.24 ms, total: 16.5 ms
Wall time: 11.5 ms
CPU times: user 90 µs, sys: 3 µs, total: 93 µs
Wall time: 103 µs

Restoration result image.png

The dimension reduction / restoration was also performed for the image of 1. image.png

2DPCA source code

The executed notebook is on Github.

from typing import Union

import numpy as np
from scipy.linalg import eigh
from sklearn.exceptions import NotFittedError


class TwoDimensionalPCA:
    def __init__(self, n_components: Union[int, None] = None):
        self.n_components_ = n_components

        self.components_ = None
        self.mean_ = None

    def fit(self, X: np.ndarray):
        """
        Parameters
        ----------
        X : np.ndarray, shape (n_samples, height, width)
        """
        if X.ndim != 3:
            raise ValueError(f"Expected 3D array, got {X.ndim}D array instead")

        self.mean_ = np.mean(X, axis=0)
        cov = np.mean([x.T @ x for x in X - self.mean_], axis=0)
        n_features = cov.shape[0]

        if self.n_components_ is None or self.n_components_ > n_features:
            self.n_components_ = n_features
        self.components_ = eigh(cov, eigvals=(n_features - self.n_components_, n_features - 1))[1][:, ::-1]
        return self

    def transform(self, X: np.ndarray) -> np.ndarray:
        """
        Parameters
        ----------
        X : np.ndarray, shape (n_samples, height, width)

        Returns
        -------
        : np.ndarray, shape (n_samples, height, n_components)
        """
        if self.components_ is None:
            raise NotFittedError(
                "This PCA instance is not fitted yet. "
                "Call 'fit' with appropriate arguments before using this estimator.")

        if X.ndim != 3:
            raise ValueError(f"Expected 3D array, got {X.ndim}D array instead")

        return np.array([x @ self.components_ for x in X])

    def inverse_transform(self, X: np.ndarray) -> np.ndarray:
        """
        Parameters
        ----------
        X : np.ndarray, shape (n_samples, height, n_components)

        Returns
        -------
        : np.ndarray, shape (n_samples, height, width)
        """
        return np.array([x @ self.components_.T for x in X])

    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        """
        Parameters
        ----------
        X : np.ndarray, shape (n_samples, height, width)

        Returns
        -------
        : np.ndarray, shape (n_samples, height, n_components)
        """
        self.fit(X)
        return self.transform(X)

[^ 2]: Kenji Fukumizu, "Introduction to the Kernel Method-Data Analysis with a Positive Value Kernel-", Asakura Shoten, 2010.

Recommended Posts

[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)
Principal component analysis (PCA) and independent component analysis (ICA) in python
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Robot grip position (Python PCA principal component analysis)
Principal component analysis (Principal component analysis: PCA)
Recognize the contour and direction of a shaped object with OpenCV3 and Python3 (Principal component analysis: PCA, eigenvectors)
Clustering and principal component analysis by K-means method (beginner)
Challenge principal component analysis of text data with Python
Implementation of independent component analysis
Visualize the correlation matrix by principal component analysis in Python
Python: Unsupervised Learning: Principal Component Analysis
[Python] I thoroughly explained the theory and implementation of logistic regression
[Python] I thoroughly explained the theory and implementation of decision trees
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Comparison of calculation speed by implementation of python mpmath (arbitrary precision calculation) (Note)
[Python] Implementation of Nelder–Mead method and saving of GIF images by matplotlib
Derivation of multivariate t distribution and implementation of random number generation by python
Example of 3D skeleton analysis by Python
Principal component analysis with Power BI + Python
Analysis of X-ray microtomography image by Python
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
Build a python environment to learn the theory and implementation of deep learning
[Python] I thoroughly explained the theory and implementation of support vector machine (SVM)
Comparison of k-means implementation examples of scikit-learn and pyclustering
Thorough comparison of three Python morphological analysis libraries
Implementation of TRIE tree with Python and LOUDS
Comparison of R and Python writing (Euclidean algorithm)
Explanation of edit distance and implementation in Python
Comparison of Python and Ruby (Environment / Grammar / Literal)
Principal component analysis
Analysis of financial data by pandas and its visualization (2)
2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
Collaborative filtering with principal component analysis and K-means clustering
Analysis of financial data by pandas and its visualization (1)
Merge sort implementation / complexity analysis and experiment in Python
Python implementation comparison of multi-index moving averages (DEMA, TEMA)
Mathematical understanding of principal component analysis from the beginning
A quick comparison of Python and node.js test libraries
Principal component analysis Analyze handwritten numbers using PCA. Part 2
Comparison table of frequently used processes of Python and Clojure
Principal component analysis using python from nim with nimpy
Implementation of DB administrator screen by Flask-Admin and Flask-Login
Principal component analysis Analyze handwritten numbers using PCA. Part 1
Overview of generalized linear models and implementation in Python
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)
Comparison of CoffeeScript with JavaScript, Python and Ruby grammar
Python implementation of CSS3 blend mode and talk of color space
Comparison of how to use higher-order functions in Python 2 and 3
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Theory and implementation of multiple regression models-why regularization is needed-