Principal component analysis with python (Scikit-learn version, pandas & numpy version) ([High school information department information II] teaching materials for teacher training)

Introduction

Principal component analysis is a method of creating principal components (features) by grouping multiple variables from multivariate data that represent the features of an object. It is a method that can reduce dimensions, making it easier to see the structure of data, and reducing the amount of calculation in machine learning to improve the calculation speed. Simply put, it is a method of expressing correlated multivariate data by a function with less correlation, focusing on the direction and magnitude of the variation. This time, I would like to confirm the mechanism while implementing what is taken up in "Principal component analysis and dimension reduction" in the teacher training materials of Information II published on the page of the Ministry of Education, Culture, Sports, Science and Technology with python. think.

Teaching materials

[High School Information Department "Information II" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/mext_00742.html "High School Information Department "Information II" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 3 Information and Data Science First Half (PDF: 8.9MB)")

environment

What I want to do this time

After giving an overview of principal component analysis without using mathematical formulas as much as possible,

Learning 14 Principal component analysis and dimensionality reduction: "4. Let's analyze physical fitness data with R"

I would like to see how it works while implementing the source code written in R in python.

Principal component analysis

Outline of principal component analysis (from teaching materials)

To briefly explain the mechanism of principal component analysis (PCA), a new feature quantity called principal component that maximizes the difference between individual objects by grouping variables with strong covariance and correlation between variables It is to create a variable).

As shown, from p variables $ (X_1, X_2, \ cdots, X_p) $, p mutually independent total variables are obtained by linear combination (weighted sum) without loss of information. Create as principal component $ (Z_1, Z_2, \ cdots, Z_p) $.

Figure 1.png

The principal components are created in the order in which the variance is the largest, and the variation (variance) of the value of the principal component (main component score) between the objects becomes smaller as the lower principal component becomes.

That is, by discarding the lower component with a smaller contribution and adopting only the upper component, it is possible to profile the feature of the target with a number smaller than the original number p, and reduce the dimension. It is called (dimensional reduction).

Consider the following simple example (results data for exams in 5 subjects).

Figure 2.png

Consider the composite variables created as a, b, c, d, e by generalizing the coefficients of each subject.

Total variables = a x national language + b x English + c x mathematics + d x physics + e x chemistry

By giving various numerical values to a, b, c, d, and e, it is possible to create various total scores other than the simple total score. The principal component created by principal component analysis has coefficients a, b, c, d, and e so that the value ** (principal component score) ** of the total variable of 50 students varies most, that is, the variance is It is one variable that is determined to be the maximum.

Large variability corresponds to the amount of information in that variable. The first principal component to be obtained is called the first principal component, and the second principal component to be obtained next is calculated so that the variance is maximized under the condition that it is independent of the first component (uncorrelated). The features do not correlate with each other and the information does not overlap.

Discussion of standardization in principal component analysis

From teaching materials

There are two positions when creating synthetic variables.

--(1) Align the score levels (positions) of the five subjects (make the average the same, make the deviation from the average).

--(2) Make both the level (position) of the scores of the five subjects and the magnitude of the variation uniform (make the average and standard deviation the same, standardize (standardize), or make the deviation value).

When dealing with multiple variables with different units, it is necessary to standardize (2) and align the units. If the units are the same, analysis can be performed from the standpoints of both ① and ②. In (1), a coefficient that considers the magnitude of the variance of each variable is obtained, and in (2), a coefficient that ignores the difference in the variance of each variable is obtained.

For the discussion on which of (1) and (2) should be adopted, the information summarized below is detailed.

https://qiita.com/koshian2/items/2e69cb4981ae8fbd3bda

--It is not always good to standardize (divide by standard deviation) by principal component analysis --On the contrary, it has the negative effect of dividing by the standard deviation. --Whether it should be standardized depends on where you use principal component analysis (whether you use it as a visualization or as a pipeline), so think about it properly.

Data to be analyzed this time

Principal component analysis is performed by extracting items from physical fitness data (grip strength, upper body raising, etc.). I would like to use the following preprocessed data created here.

https://gist.github.com/ereyester/5bf6d48fe966238632eca537756a06b0/raw/805c2eea83c7608d4d85ec15e56761133dc5ff4d/high_male2.csv

python: Implementation example and result

python: Implementation example (data capture)

import numpy as np
import pandas as pd
import urllib.request 
import matplotlib.pyplot as plt
import sklearn    #Machine learning library
from sklearn.decomposition import PCA   #Principal component analyzer
from sklearn.preprocessing import StandardScaler
from IPython.display import display

high_male2 = pd.read_csv('/content/high_male2.csv')
high_male3 = high_male2[['Grip strength', 'Raise your upper body', 'Long-seat forward bending', 'Repeated side jump', 'Shuttle run', 'X50m run', 'Standing long jump', 'Handball throw']]
display(high_male3)

This time, I would like to implement two methods, one with the scikit-learn module and one without. The scikit-learn module method uses sklearn.decomposition.PCA and sklearn.preprocessing.StandardScaler. The method that does not use the scikit-learn module is to implement the processing by yourself with numpy and pandas.

python: Output result (data capture)

py01.png

This time, I would like to actively use pandas.DataFrame to proceed with the implementation.

python: Implementation example (PCA for scikit-learn use)

#Matrix standardization
#Standardization(How to use Standard Scaler)
std_sc = StandardScaler()
std_sc.fit(high_male3)
std_data = std_sc.transform(high_male3)
std_data_df = pd.DataFrame(std_data, columns = high_male3.columns)
display(std_data_df)

#Perform principal component analysis
pca = PCA()
pca.fit(std_data_df)
#Mapping data to principal component space
pca_cor = pca.transform(std_data_df)

#print(pca.get_covariance()) #Covariance matrix


#Matrix display of eigenvectors
eig_vec = pd.DataFrame(pca.components_.T, index = high_male3.columns, \
                          columns = ["PC{}".format(x + 1) for x in range(len(std_data_df.columns))])
display(eig_vec)

#eigenvalue
eig = pd.DataFrame(pca.explained_variance_, index=["PC{}".format(x + 1) for x in range(len(std_data_df.columns))], columns=['eigenvalue']).T
display(eig)

#In the source code by R, the standard deviation is calculated instead of the eigenvalue (variance).
#Standard deviation of principal components
dv = np.sqrt(eig)
dv = dv.rename(index = {'eigenvalue':'Standard deviation of principal components'})
display(dv)

#Contribution rate
ev = pd.DataFrame(pca.explained_variance_ratio_, index=["PC{}".format(x + 1) for x in range(len(std_data_df.columns))], columns=['Contribution rate']).T
display(ev)

#Cumulative contribution rate
t_ev = pd.DataFrame(pca.explained_variance_ratio_.cumsum(), index=["PC{}".format(x + 1) for x in range(len(std_data_df.columns))], columns=['Cumulative contribution rate']).T
display(t_ev)

#Main component score
print('Main component score')
cor = pd.DataFrame(pca_cor, columns=["PC{}".format(x + 1) for x in range(len(std_data_df.columns))])
display(cor)

First, we use sklearn.preprocessing.StandardScaler to standardize the data. By standardization, it is converted to data with an average of 0 and a variance of 1.

Principal component analysis is performed on the standardized data using sklearn.decomposition.PCA. In principal component analysis, the results of eigenvectors, eigenvalues, contribution ratios, and principal component scores can be confirmed.

python: Output result (PCA for scikit-learn)

Standardized data

標準化1.png

Results of eigenvectors (PC1: 1st principal component PC2: 2nd principal component ....)

固有ベクトル1.png

Eigenvalue results

固有値1.png

Principal component standard deviation (eigenvalue converted to standard deviation)

主成分の標準偏差.png

Contribution rate

寄与率.png

Cumulative contribution rate

累積寄与率.png

Main component score

主成分得点.png

From the result

From these results, it was shown that the output result in R was almost the same.

Here, the relationship between the principal component score, the eigenvector, and the standardized data is as follows.

Main component score = eigenvector (grip strength) x grip strength after standardization + eigenvector (upper body raising) x upper body raising value after standardization + ... + eigenvector (handball throwing) x handball throwing value after standardization

The teaching materials describe the criteria for reducing dimensions.

--Cumulative contribution rate: As a guide, use the cumulative contribution rate from 70% to 80%. --Eigenvalue (variance) is 1 or more (Kaiser criterion): Principal component with eigenvalue greater than 1 is adopted (when analyzing correlation matrix = in this case), contribution rate is (1 / number of variables) x 100 (%) ) Adopt up to the above principal components (when analyzing the covariance matrix). --Screep plot: From the left in descending order of eigenvalues (variance), the main components up to before the decrease in variance becomes small (gradual decrease) are adopted for the line graph (scree plot).

python: Implementation example (scikit-learn unused version PCA)

#Standardization(How to not use Standard Scaler)
# default:axis = 0(Column orientation application)
std_data_df_noskl = (high_male3 - high_male3.mean()) / high_male3.std(ddof = 0)
display(std_data_df_noskl)

#Covariance matrix
cov_vec = np.cov(std_data_df_noskl.T, bias = 0)

#Find eigenvalues and eigenvectors
# np.linalg.svd: Singular value decomposition
# np.linalg.eig:Eigenvalue decomposition
#Here, select singular value decomposition
eig_vec_noskl_l, eig_noskl_l, _ = np.linalg.svd(cov_vec)

#Main component score
pca_cor_noskl = np.dot(std_data_df_noskl, eig_vec_noskl_l)
#display(cor_noskl)

#Matrix display of eigenvalue vector
eig_vec_noskl = pd.DataFrame(eig_vec_noskl_l, index = high_male3.columns, \
                          columns = ["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))])
display(eig_vec_noskl)

#eigenvalue
eig_noskl = pd.DataFrame(eig_noskl_l, index=["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))], columns=['eigenvalue']).T
display(eig_noskl)

#In the source code by R, the standard deviation is calculated instead of the eigenvalue (variance).
#Standard deviation of principal components
dv_noskl = np.sqrt(eig_noskl)
dv_noskl = dv_noskl.rename(index = {'eigenvalue':'Standard deviation of principal components'})
display(dv_noskl)

#Contribution rate
ev_noskl_l = eig_noskl_l / eig_noskl_l.sum()
ev_noskl = pd.DataFrame(ev_noskl_l, index=["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))], columns=['Contribution rate']).T
display(ev_noskl)

#Cumulative contribution rate
t_ev_noskl = pd.DataFrame(ev_noskl_l.cumsum(), index=["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))], columns=['Cumulative contribution rate']).T
display(t_ev_noskl)

#Main component score
print('Main component score')
cor = pd.DataFrame(pca_cor_noskl, columns=["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))])
display(cor)

First, we are standardizing

The standardization formula is as follows.

X_{i,j}^{(std)} ← \frac{x_{i,j} - \mu}{\sigma}
\mu:\frac{1}{n}\sum_{i=1}^{n} X_{i,j}
\sigma:\sqrt{\frac{1}{n}\sum_{i=1}^{n} (X_{i,j} - \mu)^2 }
{\boldsymbol{X}}=
\begin{pmatrix}
X_{1,1}^{(std)} & \cdots & X_{135,1}^{(std)} \\
\vdots & \ddots & \vdots \\
X_{1,8}^{(std)} & \cdots & X_{135,8}^{(std)}
\end{pmatrix}

Here, x = 1,…, 135 = N, j = 1,…, 8

In this python code, the standard deviation is calculated with 0 degrees of freedom. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html It should be noted that in pandas.DataFrame.std, it is ddofint, default 1, and if no argument is specified, the standard deviation due to unbiased variance with 1 degree of freedom will be used. By the way, https://numpy.org/doc/stable/reference/generated/numpy.std.html numpy.std has dd of int, optional default ddof is zero, with 0 degrees of freedom, as well. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html Note that the standard deviation used during the calculation of sklearn.preprocessing.StandardScaler is ddof = 0.

Next, for the standardized data, the formula of the variance-covariance matrix (autocovariance matrix) is

E[\boldsymbol{X}]=0

Than,

cov({\boldsymbol{X,X}})=E[\boldsymbol{X^TX}]-E[{\boldsymbol{X}}]^TE[{\boldsymbol{X}}]=E[\boldsymbol{X^TX}]=\frac{1}{N-1}\boldsymbol{X^TX}
cov({\boldsymbol{X,X}})=\Sigma

Then $ U $: $ 135 \ times 135 $ Orthogonal Matrix $ S $: $ 135 \ times 8 $ real diagonal matrix (component non-negative) $ V_ {svd} $: $ 8 \ times 8 $ Orthogonal matrix

\boldsymbol{X}=U S {V^T_{svd}}
\boldsymbol{X^TX}=[U S {V^T_{svd}}]^T[U S {V^T_{svd}}]=V_{svd} S^T SV^T_{svd}
\Sigma=\frac{V_{svd} S^T SV^T_{svd}}{N-1}

$ S ^ 2 $: Eigenvalues arranged diagonally $ V_ {svd} $: $ \ boldsymbol {X ^ TX} $ eigenvector

The following article is detailed about singular value decomposition.

https://qiita.com/kidaufo/items/0f3da4ca4e19dc0e987e

The following article is detailed about the relationship between principal component analysis and singular value decomposition.

https://qiita.com/horiem/items/71380db4b659fb9307b4 https://qiita.com/sakami/items/50b8485159312573e3c7

Whether to use eigenvalue decomposition or singular value decomposition was selected from the following viewpoints.

https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html

Returns: w: ... The eigenvalues are not necessarily ordered. In numpy.linalg.eig, the eigenvalues are not necessarily ordered, so we want to order PC1 PC2 ... in descending order of eigenvalues, so an ordering operation is required. Will be.

https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html

Returns: Vector (s) with the singular values, within each vector sorted in descending order. This is easier because you can get the ones sorted in descending order.

python: Output result (scikit-learn unused version PCA)

Standardized data

SnapCrab_NoName_2020-7-31_1-5-42_No-00.png

Results of eigenvectors (PC1: 1st principal component PC2: 2nd principal component ....)

SnapCrab_NoName_2020-7-31_1-6-33_No-00.png

Eigenvalue results

SnapCrab_NoName_2020-7-31_1-7-4_No-00.png

Principal component standard deviation (eigenvalue converted to standard deviation)

SnapCrab_NoName_2020-7-31_1-7-46_No-00.png

Contribution rate

SnapCrab_NoName_2020-7-31_1-31-40_No-00.png

Cumulative contribution rate

SnapCrab_NoName_2020-7-31_1-32-39_No-00.png

Main component score

SnapCrab_NoName_2020-7-31_1-33-4_No-00.png

R: [Reference] Implementation example and results (from teaching materials)

R: Implementation example

#Principal component analysis using 8 variables from grip strength to handball throwing, excluding height, weight, and sitting height #Correlation matrix #data set(Columns 4-11)Import high_male2 <- read.csv("high_male2.csv") high_male3 <- high_male2[c(4:11)] #Execution of principal component analysis prcomp based on correlation matrix (res2 <- prcomp(high_male3, scale=T)) summary(res2) pc <- res2$x #Sweeping the main component score to a file write.csv(pc, file = "pca_cor.csv")

The argument scale = T of prcomp () indicates scale = True. This is a parameter that specifies that principal component analysis should be performed using the data correlation matrix. Regarding the events handled this time, it is said that the principal component analysis was performed on the data that was substantially standardized. Expected to be equivalent.

R: Output result

Standard deviations (1, .., p=8): [1] 1.9677462 0.9524611 0.9096410 0.8553785 0.7327083 0.6857621 0.6399714 [8] 0.4949534

Rotation (n x k) = (8 x 8): PC1 PC2 PC3 PC4 PC5 Grip strength 0.3252018 0.2682176 -0.53297421 0.39993408 -0.3653663 Raise your upper body 0.3141190 0.4351668 0.42225757 0.40834395 0.4032249 Long-seat forward bending 0.3077864 0.3745785 0.01503113 -0.75987597 -0.2411453 Repeated side jump 0.3933948 0.1203619 0.05183489 -0.20404673 0.4967487 Shuttle run 0.3132617 -0.4444223 0.59760197 0.01703693 -0.3900527 X50m run-0.4057185 0.4620511 0.11729178 -0.10636452 -0.0709927 Standing long jump 0.3681042 -0.3669386 -0.40018514 -0.13933339 0.3055848 Handball throw 0.3844997 0.1955680 0.06075045 0.15245958 -0.3852838 PC6 PC7 PC8 Grip strength-0.31441687 0.34209544 -0.17004275 Raise your upper body-0.33321281 -0.29431157 0.08168542 Long-seat forward bending-0.28776668 -0.10238851 0.18941208 Repeated side jump 0.35638527 0.61198108 -0.19529718 Shuttle run-0.21759749 0.17541898 -0.34157859 X50m run 0.04215936 -0.08597965 -0.76329592 Standing long jump-0.10049579 -0.50594605 -0.43684157 Handball throw 0.72184877 -0.34234695 0.01636705 Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 1.968 0.9525 0.9096 0.85538 0.73271 0.68576 0.6400 Proportion of Variance 0.484 0.1134 0.1034 0.09146 0.06711 0.05878 0.0512 Cumulative Proportion 0.484 0.5974 0.7008 0.79229 0.85940 0.91818 0.9694 PC8 Standard deviation 0.49495 Proportion of Variance 0.03062 Cumulative Proportion 1.00000

For pca_cor.csv, the output is as follows.

pca_cor.png

Source code

python version https://gist.github.com/ereyester/3c2173eb61cbcd64b61f23b3d4d6480c

R version https://gist.github.com/ereyester/cd309a9ab46c37a3f594963d1ad55495

Recommended Posts

Principal component analysis with python (Scikit-learn version, pandas & numpy version) ([High school information department information II] teaching materials for teacher training)
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
Text mining by word2vec etc. by python ([High school information department information II] teaching materials for teacher training)
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
Text mining by word2vec etc. by python ([High school information department information II] teaching materials for teacher training)
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
Principal component analysis with python (Scikit-learn version, pandas & numpy version) ([High school information department information II] teaching materials for teacher training)
[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
I tried object detection using Python and OpenCV
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
Principal component analysis with Power BI + Python
Challenge principal component analysis of text data with Python
Principal component analysis using python from nim with nimpy
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)