Introduction

The other day, I read up to Chapter 3 of Professor Kenji Fukumizu's introduction to the kernel method at a seminar. Since it was handwritten in the seminar, I would like to reorganize the presentation materials and give a brief introduction to the kernel method. Finally, we will implement kernel principal component analysis in python (aside from the detailed theory). Since my presentation range was the definite kernel in the first half and the final implementation part, there are some major omissions in the middle, but I don't think there is a big problem.

Note ・ Atmosphere is more important than theory. ・ The numbers are the same as the numbers in Fukumizu-sensei's book.

Prerequisite knowledge ・ Simple statistics ·linear algebra ・ Functional analysis (Hilbert space theory)

-[Outline of kernel method](#Overview of kernel method) -[Definite Kernel](# Definite Kernel) ・ [Reproducing kernel Hilbert space](#Reproducing kernel Hilbert space) -[Kernel principal component analysis (Kernel PCA)](# Kernel principal component analysis) -[Kernel canonical correlation analysis (Kernel CCA)](#kernel canonical correlation analysis) ← Add if you feel like it.

(Many people may wonder why they don't do support vector machines, but the seminar just didn't go that far. I think I'll read it myself if I have time.)

Overview of the kernel method

Take principal component analysis as an example. N pieces of data $ X_1, ..., X_N \ in \ chi = \ mathbb {R} ^ m $ of $ m $ dimension, projected with the unit vector $ u $ to $ u ^ \ mathsf { Look for $ u $ that has the p-th largest variance of T} X $. This is called the p-th principal component, and dimension compression and noise removal can be performed by taking an axis less than $ m $. If you write this in a mathematical formula

\frac{1}{N}\sum_{i=1}^{N}\left\{u^{\mathsf{T}}X_{i}-\frac{1}{N}\left(\sum_{j=1}^{N}u^{\mathsf{T}}X_{j}\right)\right\}^{2}

Is the pth largest\\|u\\|=1It will be moved to satisfy. However, this does not allow non-linear feature extraction because it only performs linear calculations. Therefore, data with a nonlinear functionX_iIs sent to another space to perform non-linear calculations.

The core of the above calculation is the dot product of $ u $ and $ X $. In the above example, we were thinking on $ \ mathbb {R} ^ m $, so the normal inner product $ \ langle u, X \ rangle_ {\ mathbb {R} ^ m} = u ^ \ mathsf {T} X $ It has been decided. This can be calculated even if the data $ X_i $ is sent to another space $ H $ by the nonlinear map $ \ phi: \ chi → H $, if the inner product is determined in that space. In this case, the inner product of $ u $ and $ X $ $ \ langle u, X \ rangle_ {\ mathbb {R} ^ m} $ is used instead of $ f \ in H $ and $ \ langle f It means that the principal component analysis should be performed as, \ phi (X) \ rangle_H $.

In other words, the kernel method is to use the mapping $ \ phi $ to jump to another space and think about the original problem in that space. In the conventional method, it was the mainstream to perform non-linear calculation using power expansion $ (X, Y, Z ...) \ rightarrow (X ^ 2, Y ^ 2, Z ^ 2, XY ...) $. However, there was a problem that the dimension of data handled increased with the change of the times and the amount of calculation became enormous. The kernel method uses a kernel function that efficiently calculates the inner product of the space sent by the feature map $ \ phi $, enabling nonlinear operations with a calculation amount that is not much different from ordinary problems.

From the above, the current goal is to successfully define the feature map $ \ phi $ and the space $ H $.

Definite kernel

Let the data space be $ \ chi $ and define the feature map $ \ phi $ as $ \ phi: \ chi → H $. Where $ H $ is the inner product space with the inner product $ \ langle ·, · \ rangle_H $.

It would be nice if the dot product could be easily calculated with a function $ k (x, y) = \ langle \ phi (x), \ phi (y) \ rangle_H $. By doing this, you can display only $ k $ without considering the direct representation of $ \ phi $ and $ H $ (which will be dealt with in the concrete example described later), and the calculation cost is $ \ phi $. You can see that it does not depend on.

Here we define a definite kernel and consider the properties it satisfies.

Def definite kernel

When the function $ k: \ chi × \ chi → \ mathbb {R} $ satisfies the following two properties, $ k $ is called a definite kernel on $ \ chi $.

-Symmetry: $ k (x, y) = k (y, x) $ for any $ x, y \ in \ chi $ -Definite matrix: $ \ sum_ {i, j for any $ n \ in \ mathbb {N}, x_1 ... x_N \ in \ chi, c_1 ... c_N \ in \ mathbb {R} $ = 1} ^ {n} c_ic_jk (x_i, x_j) \ geq0 $

Assuming symmetry, canonicality matches the definition of a semi-fixed value of a $ N × N $ square matrix such that the $ ij $ component is $ k (x_i, x_j) . We write definite here because it is customary to call the above definition definite. 　 Prop2.1 The following properties hold for definite kernels: (1)k(x,x)\geq0 \quad(\forall x \in \chi)$ (2)|k(x,y)| \leq k(x,x)k(y,y) \quad(\forall x,y \in \chi) (3) For the subset \ Upsilon of $ \ chi, the limitation of \ Upsilon × \ Upsilon of k is a definite kernel on \ Upsilon. $

proof(2)

A = \left(\begin{matrix}
 k(x,x) & k(x,y)\\k(x,y) & k(y,y) \end{matrix}
 \right)

Is a semi-normal definite matrix, and all eigenvalues are real and non-negative, so if you consider the determinant diagonally, $ det (A) \ geq0 $. Also, as defined, the determinant may be calculated and the non-negative inequality may be transformed.

Prop2.2 The definite kernel is closed for the following operations: Let $ k_i $ be a definite kernel, (1) $ ak_i + bk_j $ for $ a, b \ in \ mathbb {R_ {\ geq0}} $ (2)k_ik_j (3)\lim_{n \to \infty}k_n

proof:(2) Since it is sufficient if the $ N × N $ matrix whose component is $ k_ik_j $ is a semi-definite value, consider the Hadamard product of two matrices whose components are $ k_i and k_j $, and diagonalize one matrix to the definite value. Substitute in the definition formula of sex.

Prop2.6 Let $ V be an inner product space with an inner product \ langle ·, · \ rangle_V $. Given a map $ \ phi: \ chi \ rightarrow V, x \ mapsto \ phi (x) $, $ k (x, y) = \ langle \ phi (x), \ phi (y) \ rangle_V $ Is a definite kernel.

proof It can be calculated according to the definition of the definite kernel.

I wanted the feature map $ \ phi $ to satisfy these properties. If you set $ V $ appropriately and prepare a map, you should be able to easily calculate it with the definite kernel $ k $. However, in fact, when $ k $ is defined, $ V $ is uniquely determined, and as mentioned above, the direct expression of $ \ phi $ is not necessary in the actual calculation. Therefore, the definite kernel $ k $ should be defined. I understand this.

Reproducing kernel Hilbert space

Here, we confirm that the inner product space $ H $, which has a certain property, has a one-to-one correspondence with the definite kernel $ k $.

Def reproducing kernel Hilbert space

The complete inner product space $ H $ is called the Hilbert space. A Hilbert space $ H $ consisting of functions on $ \ chi $ with $ \ chi $ as a set and satisfying the following properties is called a reproducing kernel Hilbert space. -Reproducibility: $ \ forall x \ in \ chi, \ forall f \ in H, \ exists k_x \ in H s.t. \ langle f, k_x \ rangle_H = f (x) $

For $ k_x \ in H $ defined above, the kernel $ k $ determined by $ k (y, x) = k_x (y) $ is called the regenerated kernel of $ H $.

prop2.7 Reproducing kernel on $ \ chi $ Reproducing kernel in Hilbert space $ H $ $ k $ is a definite kernel on $ \ chi $, and the regenerating kernel in $ H $ is unique.

proof Since it can be transformed as $ k (x, y) = k_y (x) = \ langle k_y, k_x \ rangle_H = \ langle k (・, y), k (・, x) \ rangle_H $ You can see $ \ sum_ {i, j = 1} ^ {n} c_ic_jk (x_i, x_j) \ geq0 $. Uniqueness can be shown as $ k = \ overline {k} $ by using $ k and \ overhead {k} $ as regenerated kernels on $ H $ and using symmetry.

prop2.8 $||k(・,x)||_H=\sqrt{k(x,x)} $

proof The square of the left side may be calculated using reproducibility.

prop2.11 Reproducing kernel Hilbert spaces on $ \ chi $ that satisfy the following three properties exist uniquely for the definite kernel $ k $ on the set $ \ chi $. (1) $ \ forall x \ in \ chi, k (・, x) \ in H $ (2) The subspace (linear span) stretched by $ k (・, x) (x \ in \ chi) $ is dense within $ H $. (3) $ k $ is the regenerated kernel of $ H $

Combining prop2.7 and prop2.11, we can see that there is a one-to-one correspondence between the definite kernel and the reproducing kernel Hilbert space.

Here, if the feature map $ \ phi: \ chi → H $ is defined as $ x \ mapsto k (・, x) $, if reproducibility is used,

\begin{align}
&\begin{split}
\langle\phi(x),\phi(y)\rangle_H &= \langle k(・,x), k(・,y)\rangle_H \\
&= \langle k_x, k_y \rangle_H \\
&= k_x(y) \\
&= k(y,x) \\
&= k(x,y)
 \end{split}\\
\end{align}

So I found that the dot product I wanted to calculate can be calculated with the definite kernel $ k $.

Here are some commonly used examples of definite kernels.

-Linear kernel (normal Euclidean dot product)

k_0(x,y) =　x^\mathsf{T} y

・ Exponential type

k_E(x,y)=exp(\beta x^\mathsf{T} y) \ (\beta>0)

・ Gauss kernel

k_G(x,y) =exp \left(-\frac{\|x-y\|^2}{2 \sigma^2} \right)

Kernel principal component analysis

I have briefly introduced the theory, so it is an application example. Normal principal component analysisuTo increase the variance when projected by||u||=1Was to ask.

\frac{1}{N}\sum_{i=1}^{N}\left\{u^{\mathsf{T}}X_{i}-\frac{1}{N}\left(\sum_{j=1}^{N}u^{\mathsf{T}}X_{j}\right)\right\}^{2}

Feature mapping in kernel principal component analysis\phi:\chi→HData usingX_iToHFly up,HInner product\langle ・,・\rangle_HTo使って計算すれば非線形な特徴To捉えることができます。つまり以下の式To最大化するf (||f||_H=1)To求めることになります。

\frac{1}{N}\sum_{i=1}^{N}\left\{\langle f,\phi\left(X_{i}\right)\rangle_{H}\ -\ \frac{1}{N}\left(\sum_{j=1}^{N}\left\langle f,\ \phi\left(X_{j}\right)\right\rangle_H \right)\right\}^{2}

If you centralize it with $ \ tilde {\ phi} (X_i) = \ phi (X_i)-\ frac {1} {N} \ sum_ {i = j} ^ {N} \ phi (X_j) $

\frac{1}{N}\sum_{i=1}^{N}\left(\langle f,\ \tilde{\phi}\left(X_{i}\right)\rangle_{H} \right)^{2}

It turns out that you should think of $ f $ that maximizes. $ ||f||_{H}=1$Because it isfIs\tilde{\phi} \left(X_{1}\right)...\tilde{\phi} \left(X_{N}\right)Paste withHSubspace

Span\left\{\tilde{\phi}\left(X_{1}\right),...,\tilde{\phi}\left(X_{N}\right)\right\} \subset H

You can think of it as an element of. Therefore,

f = \sum_{i=1}^{N}\alpha_{i}\tilde{\phi}\left(X_{i}\right)

Can be expressed as. In other words, the inner product $ \ angle f, \ tilde {\ phi} (X_ {i}) \ rangle_H $ is $ \ langle \ tilde {\ phi} (X_i), \ tilde {\ phi} (X_j) \ rangle It can be represented by a linear combination of $, and if the center of $ \ tilde {\ phi} $ is removed, it is $ k (X_i, X_j) = \ langle \ phi (X_i), \ phi (X_j) \ rangle $. It can be represented by a linear combination of k $, and the coefficients can also be represented using $ \ alpha $. Therefore, by finding $ \ alpha $, $ f $ is determined as a function on $ H $, and maximization can be achieved.

Here for simplicity

\tilde{k}(X_i,X_j)= \langle \tilde{\phi}(X_i), \tilde{\phi}(X_j) \rangle_H

And let $ \ tilde {K} $ be the matrix that has $ \ tilde {k} (X_i, X_j) $ as the $ ij $ component. This is called the centralized Gram matrix.

The problem to be maximized can be transformed as follows and can be reduced to the generalized eigenvalue problem.

\begin{align}
&\begin{split}

\frac{1}{N}\sum_{i=1}^{N}\left(\langle f,\ \tilde{\phi}\left(X_{i}\right)\rangle_{H} \right)^{2} 

&= \frac{1}{N}\sum_{i=1}^{N}\left\{\sum_{j=1}^{N}\alpha_j\tilde{k}(X_j,X_i)\right\}^{2}\\
&=\frac{1}{N}\sum_{i=1}^{N}\left\{\sum_{j=1}^{N}\sum_{k=1}^{N}\alpha_{j}\alpha_{k}\tilde{k}(X_j,X_i)\tilde{k}(X_k,X_i)\right\} \\
&= \frac{1}{N}\sum_{j=1}^{N}\sum_{k=1}^{N}\left\{\alpha_{j}\alpha_{k}\sum_{i=1}^{N}\tilde{k}(X_j,X_i)\tilde{k}(X_i,X_k)\right\}\\
&= \frac{1}{N}\sum_{j=1}^{N}\sum_{k=1}^{N}\left(\alpha_{j}\alpha_{k}\tilde{K}_{jk}^2\right)\\
&=\frac{1}{N}\alpha^\mathsf{T} \tilde{K}^2 \alpha\\
\end{split}\\

&\begin{split}
||f||_H &= \langle f, f \rangle \\
&= \langle \sum_{i=1}^{N}\alpha_{i}\tilde{\phi}\left(X_{i}\right),  \sum_{j=1}^{N}\alpha_{j}\tilde{\phi}\left(X_{j}\right) \rangle \\
&= \sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_i \alpha_j \langle \tilde{\phi}(X_i), \tilde{\phi}(X_j) \rangle \\
&= \alpha^\mathsf{T} \tilde{K} \alpha
\end{split}\\ 
\end{align}

From the above, the maximization problem is

\newcommand{\argmax}{\mathop{\rm arg~max}\limits}

\argmax_{\alpha \in \mathbb{R}^N} \, \alpha^\mathsf{T} \tilde{K}^2 \alpha \ \ s.t. \ \alpha^\mathsf{T} \tilde{K} \alpha = 1

Since $ \ tilde {K} $ is a semi-regular definite matrix, when eigenvalue decomposition (diagonalization) is performed, the unitary matrix $ U $ and the diagonal matrix $ in which $ N $ vertical vectors are arranged. Using \ Lambda = diag (\ lambda_1, ..., \ lambda_N) \ (\ lambda_1 \ geq \ lambda_2 \ geq ... \ geq \ lambda_N \ geq 0) $,

\tilde{K} = U \Lambda U^{*}

Can be disassembled. Because $ \ tilde {K} ^ 2 = U \ Lambda ^ 2 U ^ {*} $

\alpha^\mathsf{T} \tilde{K}^2 \alpha = \sum_{i=1}^{N} (\alpha^\mathsf{T} u^i)^2 \lambda_i^2

Therefore, considering $ \ lambda_1 \ geq \ lambda_2 \ geq ... \ geq \ lambda_N \ geq 0 $, the p-th largest $ \ alpha $ is when $ \ alpha \ parallel u ^ p $. The length can be adjusted to satisfy $ \ alpha ^ \ mathsf {T} \ tilde {K} \ alpha = 1 $. If you put $ \ alpha = c_p u ^ p $, be aware that $ u ^ i $ is a column vector of a unitary matrix.

\begin{align}
&\begin{split}
\alpha^\mathsf{T} \tilde{K} \alpha &= \sum_{i=1}^{N} (c_p u^{p^{\mathsf{T}}} u^i)^2 \lambda_i　\\
&= c_p^2 \lambda_p
\end{split}\\ 
\end{align}

So $ c_p = \ frac {1} {\ sqrt {\ lambda_p}} $. Therefore, $ f $ (p-th spindle) that makes the variance the p-th largest is

f^p = \sum_{i=1}^{N}\frac{1}{\sqrt{\lambda_p}} u_i^p \tilde{\phi}(X_i)

And the pth principal component of the data $ X_i $ is

\begin{align}
&\begin{split}
\langle \tilde{\phi}(X_i), f^p \rangle_H &= \sum_{j=1}^N \frac{1}{\sqrt{\lambda_p}} u_j^p  \langle \tilde{\phi}(X_i), \tilde{\phi}(X_j) \rangle_H \\
&= \sum_{j=1}^N \frac{1}{\sqrt{\lambda_p}} u_j^p  \tilde{K}_{ij} \\
&= \sum_{j=1}^N \frac{1}{\sqrt{\lambda_p}}  u_j^p \sum_{k=1}^N \lambda_k u_i^k u_j^k \\
&= \frac{1}{\sqrt{\lambda_p}} \sum_{k=1}^N \lambda_k u_i^k  \sum_{j=1}^N  u_j^p  u_j^k \\
&= \frac{1}{\sqrt{\lambda_p}} \sum_{k=1}^N \lambda_k u_i^k \delta(p,k) \\
&= \sqrt{\lambda_p} u_i^p
\end{split}\\ 
\end{align}

In other words, it is sufficient to know the eigenvalues and unitary matrix when the matrix $ \ tilde {K} $ having $ \ tilde {k} (X_i, X_j) $ as the $ ij $ component is decomposed into eigenvalues.

Implement the above using Python. The data includes wine data (http://archive.ics.uci.edu/ml/datasets/Wine) To use. We will use the Gaussian kernel for kernel functions.

import numpy as np
import matplotlib.pyplot as plt

#Data reading
wine = np.loadtxt("wine.csv", delimiter=",")
N = wine.shape[0] #178
labels = np.copy(wine[:,0]).flatten()
wine = wine[:, 1:]

#Standardization
mean = wine.mean(axis=0)
wine -= mean
std = wine.std(axis=0)
wine /= std

def kernel(x,y, σ=4): return np.exp(-((x-y)**2).sum() / (2*σ**2))

#Calculation of centralized Gram matrix K
K = np.zeros(shape=(N,N))
for i,j in np.ndindex(N,N):
    K[i,j] = kernel(wine[i,:], wine[j,:])

Q = np.identity(N) - np.full(shape=(N,N), fill_value=1/N) 
K_tilde = Q @ K @ Q

#Eigendecomposition
λs, us = np.linalg.eig(K_tilde)

#Sort in descending order of eigenvalues
λ_index = np.argsort(λs)
D = np.zeros(shape=(N))
U = np.zeros(shape=(N,N))
for i, index in enumerate(λ_index[::-1]):
    D[i] = λs[index]
    U[:,i] = us[:,index]

#Nth principal component of data
X_1 = np.sqrt(D[0]) * U[:,0]
X_2 = np.sqrt(D[1]) * U[:,1]


#Creating a graph
clasters = dict([(i,[[],[]]) for i in range(1,4)])

for i,label in enumerate(labels): 
    clasters[label][0].append(X_1[i])
    clasters[label][1].append(X_2[i])

for label,marker in zip(range(1,4), ["+","s","o"]):
    plt.scatter(clasters[label][0], clasters[label][1], marker=marker)
plt.show()

You can see that it is classified in a good way. By the way, the shape of the graph changes greatly depending on how you select the kernel function. Even with the same Gauss kernel, it changes depending on the hyperparameter settings, so you can do something like grid search.

References

Kenji Fukumizu, Introduction to Kernel Method [Data Analysis with Positive Fixed Value Kernel], Asakura Shoten, 2010

Kernel Method with Python