Level4. Machine learning course (theory and practice)

alt What is a deep learning course that can be crushed in the field in 3 months

4-4. Principal component analysis

4-4-1. What is Principal Component Analysis?

● Perform "dimensional compression" in tasks that are different from prediction and classification. ● Compress the structure of multivariate data into a smaller number of indicators In short, compressing high-dimensional objects to low dimensions without reducing the amount of information as much as possible. ⇒ Make 1000 dimensions into 10 dimensions. ⇒ You can see the tendency by seeing high-dimensional things. ● Analysis and visualization (up to 2D and 3D) using decimal variables is possible. Humans cannot understand when it is four or more dimensions.

4-4-2. Principal component analysis formula

[Learning data] The first element of the i-th data, the second element ... and up to the m dimension.

x_i=(x_{i1},x_{i2},･ ･ ･,x_{im}) \in \mathbb R^m

[Average (vector)]

\bar x = \frac{1}{n}\sum^{n}_{i=1} x_i

[Data Matrix] I'll try to disperse around the origin.

\bar X=(x1 - \bar x,･ ･ ･,x_n - \bar x)^T \in \mathbb R^{n×m}

[Variance-covariance matrix] It can be calculated as long as $ \ bar X $ is obtained.

\sum = Var(\bar X) = \frac{1}{n}\bar X^T \bar X

[Vector after linear transformation] j is the index of the projection axis, and if 3D is compressed into 2D, j = 2 If 3D is compressed to 1D, j = 1.

s_j = (s_{1j},･ ･ ･,s_{nj})^T = \bar X a_j　　a_j \in \mathbb R^m

4-4-3. Variance after linear transformation

● If the coefficient vector changes, the ** value after linear conversion ** changes. ● Consider the amount of information as ** the size of dispersion **. ● Search for the projection axis that maximizes the variance of the variables after linear transformation.

s_j = (s_{1j},･ ･ ･,s_{nj})^T = \bar X a_j　　a_j \in \mathbb R^m

Then, linearly combine the variance $ \ bar X $ in the original space with $ a $. If $ a $ changes, so does the line.

[Dispersion after linear transformation] I want to maximize $ s_j $.

Var(s_j) = \frac{1}{n}s_j^Ts_j = \frac{1}{n}(\bar Xa_j)^T(\bar Xa_j)=\frac{1}{n}a_j^T\bar X^T\bar Xa_j = a_j^TVar(\bar X)a_j

4-4-4. Solving constrained optimization problems

● Insert a constraint that the norm is 1. ⇒ Otherwise, infinite solutions will come out. For example, since the vector (1,1) and the vector (2,2) have the same orientation, the solution is infinite.

[Objective function]

arg \  max_{a \in \mathbb R^m} \ a_j^T Var(\bar X)a_j

[Constraints] Consider only those with a norm of 1.

a_j^Ta_j = 1

● How to solve a constrained optimization problem Search for the coefficient vector that maximizes the Lagrange function (the point that differentiates to 0)

[Lagrange function] Objective function-constraints

E(a_j) = a^T_jVar(\bar X)a_j - λ(a^T_ja_j - 1)

What we do is to differentiate the Lagrange function once it is created and solve = 0. There is an expression $ a $, and one solution is $ \ hat a $. The solution $ \ hat a $ is the same as the solution for the constrained optimization problem above.

4-4-5. Find the optimal solution

● Differentiate the Lagrange function to find the optimum solution. ⇒ The eigenvalues and eigenvectors of the variance-covariance matrix of the original data are the solutions to the constrained optimization problem.

[Differentiation of Lagrange function] Let's review how to calculate. Vector differentiation.

\frac{\partial E(a_j)}{\partial a_j} = 2Var(\bar X)a_j - 2λa_j = 0\\

The solution is ...\  Var(\bar X)a_j = λa_j

● The variance of the projection destination matches the eigenvalue.

Var(s_1) = a^T_1Var(\bar X)a_1 = λ_1a^T_1a_1 = λ_1　　 (a^T_1a_1=1)

⇒It has the same form as the eigenvalues and eigenvectors of applied mathematics. $ a $ is the eigenvalue and eigenvector of the variance-covariance matrix of the original data. It is the axis that maximizes the variance. This is the result of principal component analysis.

4-4-6. Contribution rate

"Calculate the covariance matrix" ⇒ "Solve the eigenvalue problem" ⇒ "A maximum of m eigenvalue and eigenvector pairs appear"

● I want to know how much information was lost as a result of compression! !! ● The variance of the principal components for the 1st to xx dimensions matches the variance of the original data. ⇒ The sum of all eigenvalues matches the variance of the original space. The total amount of information is m, which is the sum of all the variances.

● When displaying 2D data with 2D principal components, the sum of eigenvalues and the variance of the original data match. ● The variance of the kth principal component is the eigenvalue corresponding to the principal component.

Var_{(total)} = \sum_{i=1}^m \ λ_i

[Contribution rate] The ratio of the variance of the kth principal component to the total variance. (Ratio of the amount of information possessed by the kth principal component)

c_k = \frac{λ_k}{\sum_{i=1}^m \ λ_i}"Dispersion of the kth principal component / total dispersion of the principal component"

[Cumulative contribution rate] The ratio of the amount of information loss when compressed to the 1st to k main components.

r_k = \frac{\sum_{j=1}^kλ_j}{\sum_{i=1}^m \ λ_i}"Dispersion of 1st to k principal components / total dispersion of principal components"

[PYTHON] DeepRunning ~ Level4.4.1 ~