Principal Component Analysis (PCA) It is one of the powerful methods of summarizing data (representing the original data with a small number of data).
As an example, using principal component analysis, compress the math and national language score data (2D) of 10 students. The image to be converted to one dimension is shown.

Rather than explaining each individual's score with only the mathematical score (one dimension) as shown in the figure on the right. It is possible to explain with a smaller error by preparing one new axis and creating new one-dimensional data as shown in the figure on the left.
The figure on the left is a diagram of data compression using principal component analysis. Using principal component analysis, the axis (the axis of the first principal component) that can explain all the data most efficiently An axis (axis of the second principal component) that explains the data that cannot be explained by itself most efficiently is created.
Because the first principal component can express the original data well Data can be compressed efficiently by discarding (not using) the information of the second principal component.
As a practical example of principal component analysis, scoring and comparison of products and services (compressed to one dimension) Data visualization (compressed to 2D and 3D), regression analysis preprocessing, etc.
Principal component analysis is highly practical and has become one of the important themes in the field of machine learning.
Using principal component analysis, perform data compression (feature conversion) according to the following procedure.
The figure below is an image of a wine dataset that has been feature-transformed and the data summarized from 13 dimensions to 2 dimensions.

We will perform principal component analysis. The data used is wine data published in the "UCI Machine Learning Repository". Represents grape type data (labels 1-3) and wine chemistry for 178 lines of wine sample It consists of feature data (13 types).
Get the data as follows.
import pandas as pd
df_wine = pd.read_csv("./5030_unsupervised_learning_data/wine.csv", header = None)
#Feature data is stored in X and label data is stored in y.
# df_The first column of wine is label data, and the second and subsequent columns are feature data.
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
print(X.shape)
The wine data is converted in advance so that the average is 0 and the variance is 1 for each feature. This is called standardization.
Standardization makes it possible to handle various types of data with different units and standard values, such as alcohol content and wine hue, in the same way.
import numpy as np
#Standardization
X = (X - X.mean(axis=0)) / X.std(axis=0)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df_wine = pd.read_csv("./5030_unsupervised_learning_data/wine.csv", header=None)
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
#Visualize data before standardization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))
ax1.set_title('before')
ax2.set_title('before')
ax1.scatter(X[:, 0], X[:, 1])
ax2.scatter(X[:, 5], X[:, 6])
plt.show()
print("before")
print("mean: ", X.mean(axis=0), "\nstd: ", X.std(axis=0))
#Substitute X-standardized data
X = (X - X.mean(axis=0)) / X.std(axis=0)
#Visualize data after standardization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))
ax1.set_title('after')
ax2.set_title('after')
ax1.scatter(X[:, 0], X[:, 1])
ax2.scatter(X[:, 5], X[:, 6])
plt.show()
print("after")
print("mean: ", X.mean(axis=0), "\nstd: ", X.std(axis=0))
Calculate the correlation matrix of the data to check the similarity of each feature.
The correlation coefficient is an indicator of the strength of the linear relationship between two data and takes a value between -1 and 1. When the correlation coefficient is close to 1 (the positive correlation is strong), one of the two data increases as shown in the figure on the left. It has a linear distribution in which the other also increases.
When the negative correlation is strong, it has a linear distribution in which one increases and the other decreases.
When the correlation coefficient is close to 0, there is not much linear relationship as shown in the figure below (r = 0).

Here, the correlation coefficient of each of the 13 types of wine characteristic data is retained. Find the 13x13 correlation matrix. The correlation matrix to be obtained has the following form.

Get the correlation matrix as follows. The corrcoef () function is not the correlation between columns (horizontally) Make each row-to-row (vertical) correlation a correlation matrix. Therefore, if nothing is done, the correlation matrix between the data will be I will be asked.
X is transposed in X.T to find the correlation matrix between features rather than data.
import numpy as np
R = np.corrcoef(X.T)
It's an evolutionary story, but the correlation matrix itself can be calculated in the same way using X before standardization. Pre-standardized for later.
There is also principal component analysis using the covariance matrix instead of the correlation matrix.
The code to create the identity matrix of a square matrix is as follows.
import numpy as np
# np.identity(Matrix size)
identity = np.identity(3)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
As a practical method
import pandas as pd
import numpy as np
df_wine = pd.read_csv("./5030_unsupervised_learning_data/wine.csv", header=None)
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
#Create a correlation matrix (13x13)
R = np.corrcoef(X.T)
#Diagonal component 0
_R = R - np.identity(13)
#Get the index that takes the maximum correlation coefficient
index = np.where(_R == _R.max())
print(R[index[0][0], index[1][0]])
print(index)
Next, apply a mathematical method called eigenvalue decomposition to the obtained correlation matrix. Gets the eigenvectors and eigenvalues.
After eigenvalue decomposition, the original 13x13 dimensional matrix R is 13 special 13 dimensional vectors.
(Eigenvector) And 13 special numbers
(Eigenvalue)
And 13 special numbers
(Eigenvalue) Will be disassembled into.
Will be disassembled into.
Intuitively, the original matrix has information concentrated in the direction of the eigenvectors. The corresponding eigenvalues can be said to indicate the degree of information concentration.
You can use numpy to calculate the eigenvalue decomposition as follows: 13 eigenvalues of R and 13 eigenvectors are stored in eigvals and eigvecs, respectively.
import numpy as np
#Get the eigenpair from the correlation matrix. numpy.linalg.eigh returns them in ascending order of eigenvalues
eigvals, eigvecs = np.linalg.eigh(R)
The correlation matrix R, the matrix V in which the eigenvectors are arranged, and the diagonal matrix D in which the eigenvalues are arranged satisfy the following equations.
 The elements are as follows.
The elements are as follows.

The eigenvector of the correlation matrix represents the principal component vector The components of the eigenvectors show the effect of each feature on the principal components.
Also, the eigenvectors corresponding to the larger eigenvalues are deeply involved in the composition of the original matrix.
In other words, by ignoring the eigenvectors (principal component vectors) that correspond to the small eigenvalues Information loss can be suppressed while reducing features.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df_wine = pd.read_csv("./5030_unsupervised_learning_data/wine.csv", header=None)
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
#Create a correlation matrix (13x13)
R = np.corrcoef(X.T)
#Eigenvalue decomposition
eigvals, eigvecs = np.linalg.eigh(R)
#Visualization
plt.bar(range(13), eigvals)
plt.title("distribution of eigvals")
plt.xlabel("index")
plt.ylabel("eigvals")
plt.show()
#Please do not erase it. It is used to check the execution result.
print(eigvals)
In the previous session, we decomposed the correlation matrix into eigenvalues and eigenvectors. Then, using the two eigenvectors corresponding to the largest and second largest eigenvalues.
Create a 13x2 matrix W that converts 13-dimensional features to 2D, and create Wine data X with 13-dimensional features. Converts to new Wine data X'with only two-dimensional features of the first and second principal components.
Create the transformation matrix W as follows.
#Concatenates the eigenvectors corresponding to the largest and second largest eigenvalues in the column direction
W = np.c_[eigvecs[:,-1], eigvecs[:,-2]]
From the above, we have created a 13-by-2 matrix. Furthermore, by multiplying this matrix W with the original data X, we can generate a matrix X'compressed with X.

The calculation of the matrix product is performed by the following code
import numpy as np
X_pca = X.dot(W)
Using the eigenvectors and eigenvalues, the following equation also holds:

Let the two eigenvalues from the largest be λ1, λ2 and the corresponding eigenvectors be v1 and v2. These v1 and v2 are the transformation matrix.
Multiplying R by the eigenvector v1 yields new data that extends well in the v1 direction (large variance).
Multiplying R by the eigenvector v2 is orthogonal to the v1 vector (cannot be explained by v1) You can get new data that grows well in the v2 direction (large variance).
Similarly, multiplying X by the eigenvectors v1 and v2 You can get well-extended feature data in two orthogonal directions.
So far, we have implemented feature conversion using principal component analysis. In fact, you can easily do the same thing with the PCA class in sklearn.decomposition. Use the PCA class as follows.
from sklearn.decomposition import PCA
#Create an instance of PCA by specifying the number of principal components. Specify the number of dimensions after conversion with an argument.
pca = PCA(n_components=2)
#Learn the transformation model from the data and transform.
X_pca = pca.fit_transform(X)
The fit_transform () method automatically generates a transformation matrix internally.
Applying what you have learned so far, you can apply principal component analysis to the preprocessing of regression analysis. By compressing the data in advance, it is possible to generate a more versatile regression analysis model that is resistant to disturbances such as outliers.
First, split the data into training data and test data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
When performing feature conversion, if different transformation matrices are obtained for training data and test data and feature conversion is performed. It is not possible to compare the data after feature conversion because the transformation matrix is different.
The same is true for standardization. This can be inconvenient
When performing standardization and principal component analysis, use common criteria for training and test data.
When standardizing, it is convenient to use the StandardScalar class as follows.
from sklearn.preprocessing import StandardScaler
#Create an instance for standardization
sc = StandardScaler()
#Learn the transformation model from the training data and apply the same model to the test data
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)
When performing principal component analysis, use the PCA class as follows.
from sklearn.decomposition import PCA
#Create an instance of principal component analysis
pca = PCA(n_components=2)
#Learn the transformation model from the training data and apply the same model to the test data
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
As a review, the regression analysis is done as follows.
from sklearn.linear_model import LogisticRegression
#Instantiate logistic regression
lr = LogisticRegression()
#Learn classification model
lr.fit(X, y)
#Display score
print(lr.score(X, y))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
df_wine = pd.read_csv("./5030_unsupervised_learning_data/wine.csv", header=None)
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0)
#Create an instance for standardization
sc = StandardScaler()
#Learn transformation model from training data and apply to test data
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)
#Create an instance of principal component analysis
pca = PCA(n_components=2)
#Learn transformation model from training data and apply to test data
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
#Instantiate logistic regression
lr = LogisticRegression()
#Learn classification model with training data after dimension reduction
lr.fit(X_train_pca, y_train)
#Display score
print(lr.score(X_train_pca, y_train))
print(lr.score(X_test_pca, y_test))
Many machine learning algorithms, such as regression analysis, can be linearly separated It is assumed that the data will be given. However, as a practical matter, data that is difficult to linearly separate In other words, most of the data needs to be non-linearly separated. In this session, a kernelized PCA that can handle data that needs to be non-linearly separated.
"Kernel PCA(kernel PCA)Will be covered in this section.
First of all with kernel PCA Data X of a given NxM (number of data x type of feature) Remake it into data K of completely new NxM´ (number of data x type of features) (kernel trick).
When kernel tricks are used, the types of features generally increase (features are expanded). It makes it easier to perform linear separation.
It is known that using principal component analysis on highly non-linear data does not work. Principal component analysis can be performed by expanding the data to the kernel matrix K.
The figure below shows two-dimensional data distributed in a circle after increasing the features using kernel tricks. It is a diagram that is plotted by performing principal component analysis and returning the features to two.
Using kernel PCA, the following data that cannot be linearly separated in two-dimensional space It can be converted to linearly separable data.

First, calculate the kernel (similarity) matrix K. The following matrix is called a kernel matrix, and the similarity is calculated for each pair of sample data. The kernel matrix of data X of N x M (number of data x type of features) is It becomes N x N (number of data x number of data). You can treat the kernel matrix K like data and perform analysis such as regression and classification.

Shown here Is
 Is
It is called a "kernel function" and there are several types.
This time, we will use the kernel function called "Gaussian kernel" in "Radial basis function (RBF)" and expressed by the following formula. Two data with this formula Represents the similarity.
 Represents the similarity.
 Like X, K represents individual data in rows and features (similarity to other data) in columns.
Like X, K represents individual data in rows and features (similarity to other data) in columns.
Here It is a function of the graph like.
 It is a function of the graph like.

 を大きくするとより近接するものだけに注目したような特徴量行列Kが作られます。
を大きくするとより近接するものだけに注目したような特徴量行列Kが作られます。
Implement the following functions used for kernel tricks.

You can calculate the kernel matrix as follows:
#Calculate the square of the distance between data (square Euclidean distance)
M = np.sum((X - X[:, np.newaxis])**2, axis=2)
#Calculate kernel matrix
K = np.exp(-gamma * M)
Here, to get the distance between data, a function called broadcasting a numpy array (Automatically expands the matrix, aligns the shape of the matrix, and executes the operation).
import numpy as np
np.random.seed(39)
X = np.random.rand(8, 3)
#Calculate the square Euclidean distance for each pair
M =np.sum((X - X[:, np.newaxis])**2, axis=2)
#Calculate kernel matrix
gamma = 15
K = np.exp(-gamma * M)
# ---------------------------
#K is a numpy array.
#You can get and display the size of the numpy array A as follows:
print(K.shape)
# 
# ---------------------------
print(M)  #Please do not erase it. It is used to check the execution result.
print(K)  #Please do not erase it. It is used to check the execution result.
Substituting the kernel matrix K for the original data X, for K as in the standard principal component analysis method Eigenvalue decomposition, feature conversion, etc. can be performed to convert to linearly separable data X'.
Originally, K is an expansion of the features of X, so the matrix obtained by converting the features of K is It can be treated as a transformed matrix of X features.
Using what we have learned so far, we will transform the circular data using kernel principal component analysis.

As a development The transformation matrix W created from the eigenvectors of the kernel matrix K is You can also treat X as it is as X'compressed and summarized.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
#Get data where the data is distributed in a circle
X, y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)
#Calculate the square Euclidean distance for each pair
M = np.sum((X - X[:, np.newaxis])**2, axis=2)
#Calculate symmetric kernel matrix
gamma = 15
K = np.exp(-gamma * M)
#Get the unique pair from the kernel matrix. numpy.linalg.eigh returns them in ascending order of eigenvalues
eigvals, eigvecs = np.linalg.eigh(K)
#Top k eigenvectors(Projected sample)Collect
W = np.column_stack((eigvecs[:, -1], eigvecs[:, -2]))
#Find the inner product of K and W to obtain linearly separable data.
X_kpca = K.dot(W)
#Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))
ax1.scatter(X[y == 0, 0], X[y == 0, 1], color="r", marker="^")
ax1.scatter(X[y == 1, 0], X[y == 1, 1], color="b", marker="o")
ax2.scatter(X_kpca[y == 0, 0], X_kpca[y == 0, 1], color="r", marker="^")
ax2.scatter(X_kpca[y == 1, 0], X_kpca[y == 1, 1], color="b", marker="o")
ax1.set_title("circle_data")
ax2.set_title("kernel_pca")
plt.show()
print(M)  #Please do not erase it. It is used to check the execution result.
print(K)  #Please do not erase it. It is used to check the execution result.
print(W)  #Please do not erase it. It is used to check the execution result.
print(X_kpca)  #Please do not erase it. It is used to check the execution result.
Kernel principal component analysis is similar to standard PCA
sklearn.It is easy to implement using decomposition.
Usage is almost the same as standard PCA. Arguments allow you to specify the number of compressed dimensions and the kernel type not found in standard PCA.
from sklearn.decomposition import KernelPCA
#The kernel (radial basis function) used this time is kernel="rbf"Can be specified with.
kpca = KernelPCA(n_components=2, kernel="rbf", gamma=15)
X_kpca = kpca.fit_transform(X)
Here, the moon-shaped data is separated as shown in the figure below. Get the data as follows:
from sklearn.datasets import make_moons
#Get moon data
X, y = make_moons(n_samples=100, random_state=123)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
#Import Kernel PCA
# ---------------------------
from sklearn.decomposition import KernelPCA
# ---------------------------
#Get moon data
X, y = make_moons(n_samples=100, random_state=123)
#Instantiate KernelPCA class
kpca = KernelPCA(n_components=2, kernel="rbf", gamma=15)
#Convert data X using Kernel PCA
X_kpca = kpca.fit_transform(X)
#Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))
ax1.scatter(X[y == 0, 0], X[y == 0, 1], c="r")
ax1.scatter(X[y == 1, 0], X[y == 1, 1], c="b")
ax1.set_title("moon_data")
ax2.scatter(X_kpca[y == 0, 0], X_kpca[y == 0, 1], c="r")
ax2.scatter(X_kpca[y == 1, 0], X_kpca[y == 1, 1], c="b")
ax2.set_title("kernel_PCA")
plt.show()
print(X_kpca)  #Please do not erase it. It is used to check the execution result.