2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)

In the previous section, we learned the concept of principal component analysis using scikit-learn. Here, let's think about the calculation mechanism of principal component analysis without scikit-learn.

The same example. Assuming that the scores of the tests of 5 subjects of mathematics, science, social studies, English, and Japanese are for 20 students per class, 5 subjects = 5 dimensions are compressed, and the student's academic ability is simply evaluated in a smaller dimension. I want to do it.

5 subjects = 5 variables are $ x_ {1}, x_ {2}, x_ {3}, x_ {4}, x_ {5} $, and the desired principal component $ z_ {1} $ is shown.

z = w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3} + w_{4}x_{4} + w_{5}x_{5}

The principal component score $ z $ is the sum of the values of each component with the coefficients $ w_ {1}, w_ {2}, w_ {3}, w_ {4}, w_ {5} $ added to each variable. .. What I want to know is that the amount of information = variance is maximized by dividing this coefficient $ w_ {1}, w_ {2}, w_ {3}, w_ {4}, w_ {5} $ by what ratio. That is. Therefore, the following conditions are added.

{w_{1}}^2 + {w_{2}}^2 + {w_{3}}^2 + {w_{4}}^2 + {w_{5}}^2 = 1

Find the distribution ratio that maximizes the amount of information = variance under the condition that the sum of the squared values of the coefficients is $ 1 $. In this way, there is the ** Lagrange's undetermined multiplier method ** as a method for finding the maximum and minimum values of a multivariable function under certain constraints.

** Lagrange's undetermined multiplier method **

When the variables $ x, y $ move while satisfying the constraint $ g (x, y) = 0 $, the function $ z = f (x, y) $ becomes $ x, y $ which is the maximum / minimum value. The following equation holds.

F(x, y, λ) = f(x, y) - λg(x, y)Then
\displaystyle \frac{∂F}{∂x} = 0, \frac{∂F}{∂y} = 0, \frac{∂F}{∂λ} = 0

The above is the concept of Lagrange's undetermined multiplier method with the simplest two variables and one condition. The symbol is $ λ $ (lambda), but here we just understand that this is a constant. Also, the symbol $ ∂ $, which has many names such as round, partial, and del, represents ** partial differential **. To give one example, $ \ displaystyle \ frac {∂F} {∂x} = 0 $ means that the value of the function $ F $ partially differentiated by the variable $ x $ is $ 0 $.


** Partial differential **

First, let's check the meaning of ordinary differentiation. Let's draw a curve of the function $ y = f (x) $ on the $ x, y $ plane. Focusing on the point $ x $ on the $ x $ coordinate, if you look at the height at this point, it is $ f (x) $ because it is a point on $ y = f (x) $. Also note that $ x + Δx $ is slightly off $ x $. The height at this point is the function $ y = f (x) $ plus $ x + Δx $, so the $ y $ coordinate is $ f (x + Δx) $. Then, the slope between the points is calculated. The slope is the amount of change in $ y $ for the amount of change in $ x $. The amount of change in $ x $ in the denominator is $ Δx $, and the amount of change in $ y $ in the numerator is $ f (x + Δx) -f (x) $ before subtraction. 002_0301_013.PNG

What is differentiation is to bring this $ Δx $ closer to $ 0 $. In other words, when $ x + Δx $ approaches $ x $ to the limit, a ** tangent ** that represents the momentary slope at the point of $ x $ can be drawn. This slope is the meaning of differentiation. 002_0301_014.PNG

\frac{dy}{dx}=\lim_{\Delta x \to 0} \frac{f(x+\Delta x)-f(x)}{\Delta x}

How much $ y $ changes when $ x $ changes a little, and the derivative is the ratio of the change of $ y $ to the minute change of $ x $.

So far, it has been the derivative in the case of a one-variable function, where $ y $ is determined if one $ x $ is determined. However, there is a two-variable function $ f (x, y) $ whose value is determined only after two values are determined, and a three-variable function $ f (x, y, z) whose value is determined only when three values are determined. There is also $. Generally, there is a multivariable function, and in fact, it is ** partial differential ** that considers the differentiation of multivariable functions. Let's draw a graph of a two-variable function. 002_0301_015.PNG Once $ x $ and $ y $ are determined, the height $ z $ is determined. There is a height $ z = f (x, y) $ corresponding to every point on the lower plane, which makes it look like a curved surface. And the height $ z $ changes depending on whether it advances a little in the $ x $ axis direction or a little in the $ y $ axis direction. There are two main directions for how the height changes when you move a little. What is it on the $ x $ axis and what is it on the $ y $ axis is ** partial differential **. 002_0301_016.PNG The blue line is the slope when $ y $ is fixed and $ x $ moves a little, that is, the tangent line in the $ x $ axis direction is drawn. It is expressed in the formula.

\frac{∂z}{∂x}=\lim_{\Delta x \to 0} \frac{f(x+\Delta x, y)-f(x, y)}{\Delta x}

$ y $ is fixed. How much the height $ z $ moves when it moves a little in the $ x $ direction, and the ratio of the change of $ z $ to the minute change of $ x $ is ** partial differential **. Therefore, the partial differential of $ y $ is

\frac{∂z}{∂y}=\lim_{\Delta y \to 0} \frac{f(x, y + \Delta y)-f(x, y)}{\Delta y}

The red line is the ratio of the change of $ z $ to the slight change of $ y $, how much the height $ z $ moves when $ x $ is fixed and moves a little in the direction of $ y $. is.

Now, let's consider the meaning that the partially differentiated value is $ 0 $. With the function $ z = f (x, y) $, if you move the variable $ x $ a little, the change of $ z $ is $ 0 $. In other words, if there is no change, and if it cannot be larger or smaller, it is the maximum or the minimum. Furthermore, as Lagrange's method of undetermined multipliers shows, when the value obtained by partially differentiating $ x $ and $ y $ at the same time is $ 0 $, the true peak (maximum value) or true valley bottom on the curved surface of the two-variable function. It means that it is (minimum value).


** Derivation of the covariance matrix from the partial derivative equation **

5 Return to the subject example. Again, the main components are as follows: p=ax+by+cu+dv+ew

The coefficients $ a, b, c, d, e $ of the principal component $ p $ are constants and are determined so that the variance of this $ p $ is maximized. What you want to know is the balance of the ratio of the constants $ a, b, c, d, e $, so $ a, b, c, d, e $ has the following conditions. a^2+b^2+c^2+d^2+e^2=1

Define the variance $ {s_ {p}} ^ 2 $ of the principal component for which you want the maximum value. It is expressed as follows, where the population is $ n $. \displaystyle Sp^2=\frac{1}{n} \\{ {(p_{1}-\bar{p})}^2+{(p_{2}-\bar{p})}^2+...+{(p_{n}-\bar{p})}^2\\}

Now we apply Lagrange's method of undetermined multipliers to define the function $ L $. \displaystyle L=\frac{1}{n} \\{ {(p_{1}-\bar{p})}^2 + {(p_{2}-\bar{p})}^2 + ... + {(p_{n}-\bar{p})}^2 \\} - λ(a^2 + b^2 + ... + e^2 - 1) Then, the following equation holds. \displaystyle \frac{∂L}{∂a}=0 , \frac{∂L}{∂b}=0 , \frac{∂L}{∂c}=0 , \frac{∂L}{∂d}=0 , \frac{∂L}{∂e}=0

Pay attention to the first formula in the attempt. \displaystyle \frac{∂L}{∂a}=0

Bring the formula of the function $ L $ to the numerator and partially differentiate it with $ a $ in the denominator. I said that everything except $ a $ is fixed, but specifically, it is regarded as a constant and $ b, c, d, e $ are deleted. \displaystyle \frac{∂L}{∂a} = \frac{2}{n} \\{ (p_{1}-\bar{p})(x_{1}-\bar{x}) + (p_{2}-\bar{p})(x_{2}-\bar{x}) + ... + (p_{n}-\bar{p})(x_{n}-\bar{x}) \\} - 2λa = 0

Bring the formula of the principal component $ p $, expand it in {}, and transform it using the definitions of variance and covariance. \displaystyle 2 \\{ a{s_{x}}^2 + bs_{xy} + cs_{xu} + ds_{xv} + es_{xw} \\} - 2λa = 0

Further organizing gives the following equation: a{s_{x}}^2 + bs_{xy} + cs_{xu} + ds_{xv} + es_{xw} = λa

A similar expression can be obtained from the remaining $ b, c, d, e $. as_{xy} + b{s_{y}}^2 + cs_{yu} + ds_{yv} + es_{yw} = λb as_{xu} + bs_{yu} + c{s_{u}}^2 + ds_{uv} + es_{uw} = λc as_{xv} + bs_{yv} + cs_{uv} + d{s_{v}}^2 + es_{vw} = λd as_{xw} + bs_{yw} + cs_{uw} + ds_{vw} + e{s_{w}}^2 = λe

The following is a summary of these five equations in a matrix.

\begin{pmatrix}
{Sx}^2 & S_{xy} & S_{xu} & S_{xv} & S_{xw}\\
S_{xy} & {Sy}^2 & S_{yu} & S_{yv} & S_{yw}\\
S_{xu} & S_{yu} & {Su}^2 & S_{uv} & S_{uw}\\
S_{xv} & S_{yv} & S_{uv} & {Sv}^2 & S_{vw}\\
S_{xw} & S_{yw} & S_{uw} & S_{vw} & {Sw}^2
\end{pmatrix}
\begin{pmatrix}
a\\
b\\
c\\
d\\
e
\end{pmatrix}
=
λ
\begin{pmatrix}
a\\
b\\
c\\
d\\
e
\end{pmatrix}

This square matrix on the left side (matrix with the same number of elements in rows and columns) is a ** covariance matrix **. The column vectors $ a, b, c, d, e $ that satisfy this equation are called ** eigenvectors **, and $ λ $ is called ** eigenvalues **. This eigenvalue $ λ $ is the variance of the principal components.

** ⑴ Import the library **

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

** ⑵ Create data **

#Create scores for 5 subjects with 20 students
arr = np.array([[71,64,83,100,71], [34,48,67,57,68], [58,59,78,87,66], [41,51,70,60,72],
                [69,56,74,81,66], [64,65,82,100,71], [16,45,63,7,59], [59,59,78,59,62],
                [57,54,84,73,72], [46,54,71,43,62], [23,49,64,33,70], [39,48,71,29,66],
                [46,55,68,42,61], [52,56,82,67,60], [39,53,78,52,72], [23,43,63,35,59],
                [37,45,67,39,70], [52,51,74,65,69], [63,56,79,91,70], [39,49,73,64,60]])

#Create a data frame by specifying the column name
df = pd.DataFrame(data = arr,
                  columns = ['Math', 'Science', 'society', 'English', 'National language'])

002_0301_001.PNG

** ⑶ Find the variance-covariance matrix **

#Convert dataframe to Numpy array type
arr = np.array(df)

#Find the covariance matrix
cov = np.cov(arr, rowvar=0, bias=0)

The variance-covariance matrix (variance cov </ font> ariance matrix) is calculated using Numpy's cov function. In the form of numpy.cov (data, rowvar = 0, bias = 0), the argument bias = 0 specifies the unbiased variance (bias = 1 for sample variance). rowvar = 0 is when the data is arranged vertically, and when it is horizontal, specify rowvar = 1. 002_0301_017.PNG

** ⑷ Find eigenvalues and eigenvectors **

#Find eigenvalues / eigenvectors
eig_val, eig_vec = np.linalg.eig(cov)

print("eigenvalue:\n", eig_val) 
print("Eigenvector:\n", eig_vec)

Numpy's linalg.eig () function simply specifies a matrix and returns two arrays of eigenvalues and eigenvectors in tuples. A tuple is a data type that combines multiple data and is an object that cannot be modified once it has been created. Here, the eigenvalues are stored separately as ʻeig_val and the eigenvectors are stored separately as ʻeig_vec. 002_0301_018.PNG Please note the sequence of eigenvalue data. They are not in the order of contributions as seen in scikit-learn. Therefore, we will sort them and convert them into data frames to make them easier to see.

#Concatenate eigenvalues and eigenvectors into one array
eigen = np.vstack((eig_val,eig_vec))

#Sort based on the first line
eigen[:, eigen[0,:].argsort()[::-1]]

Concatenate vertically (axis = 0) with Numpy's vstack ((a, b)) function. Numpy's ʻargsortfunction sorts by a specific row. At this time, the descending order is specified by adding[:: -1]]`. 002_0301_019.PNG

#Convert eigenvalues to data frames
eval_df = pd.DataFrame(data = eigen[0,:].reshape([1, 5]),
                       columns = ["Main component{}".format(x+1) for x in range(len(df.columns))],
                       index = ["eigenvalue"])
eval_df

002_0301_020.PNG

#Convert main component load to data frame
evec_df = pd.DataFrame(data = eigen[1:,:].reshape([5, 5]),
                       columns = ["Main component{}".format(x+1) for x in range(len(df.columns))],
                       index = ['Math', 'Science', 'society', 'English', 'National language'])
evec_df

002_0301_021.PNG


We end up with the so-called eigenvalue problem, a problem famous for linear algebra. Now I would like to deepen the eigenvalue problem, but let's move on. In the next section, factor analysis will be performed using scikit-learn.

Recommended Posts

2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 7-3. Decision tree [regression tree]
2. Multivariate analysis spelled out in Python 7-1. Decision tree (scikit-learn)
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 8-1. K-nearest neighbor method (scikit-learn)
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
2. Multivariate analysis spelled out in Python 8-3. K-nearest neighbor method [cross-validation]
2. Multivariate analysis spelled out in Python 7-2. Decision tree [difference in division criteria]
2. Multivariate analysis spelled out in Python 6-2. Ridge regression / Lasso regression (scikit-learn) [Ridge regression vs. Lasso regression]
2. Multivariate analysis spelled out in Python 2-3. Multiple regression analysis [COVID-19 infection rate]
2. Multivariate analysis spelled out in Python 6-1. Ridge regression / Lasso regression (scikit-learn) [multiple regression vs. ridge regression]
2. Multivariate analysis spelled out in Python 8-2. K-nearest neighbor method [Weighting method] [Regression model]
Principal component analysis (PCA) and independent component analysis (ICA) in python
2. Multivariate analysis spelled out in Python 6-3. Ridge regression / Lasso regression (scikit-learn) [How regularization works]
Python: Unsupervised Learning: Principal Component Analysis
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Visualize the correlation matrix by principal component analysis in Python
Principal component analysis
Principal component analysis with Power BI + Python
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Robot grip position (Python PCA principal component analysis)
Genetic algorithm in python
Algorithm in Python (Bellman-Ford)
Association analysis in Python
Principal component analysis (Principal component analysis: PCA)
Algorithm in Python (Dijkstra's algorithm)
Regression analysis in Python
Challenge principal component analysis of text data with Python
Principal component analysis using python from nim with nimpy
Algorithm in Python (primality test)
Axisymmetric stress analysis in Python
Reproduce Euclidean algorithm in Python
Algorithm in Python (binary search)
Unsupervised learning 3 Principal component analysis
Simple regression analysis in Python
Implement Dijkstra's Algorithm in python
EEG analysis in Python: Python MNE tutorial
Algorithm in Python (breadth-first search, bfs)
First simple regression analysis in Python
Sorting algorithm and implementation in Python
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
Write A * (A-star) algorithm in Python
Develop an investment algorithm in Python 2
Algorithm in Python (depth-first search, dfs)
Face recognition using principal component analysis
Principal component analysis with Spark ML
Implementing a simple algorithm in Python 2
Planar skeleton analysis in Python (2) Hotfix
Algorithm (segment tree) in Python (practice)
Run a simple algorithm in Python
Principal Component Analysis with Livedoor News Corpus-Practice-
Ant book in python: Sec.2-5 Dijkstra's algorithm
Algorithm in Python (ABC 146 C Binary Search
Write a simple greedy algorithm in Python
<Course> Machine learning Chapter 4: Principal component analysis
Residual analysis in Python (Supplement: Cochrane rules)
Alignment algorithm by insertion method in Python