What is a deep learning course that can be crushed in the field in 3 months
● Perform "dimensional compression" in tasks that are different from prediction and classification. ● Compress the structure of multivariate data into a smaller number of indicators In short, compressing high-dimensional objects to low dimensions without reducing the amount of information as much as possible. ⇒ Make 1000 dimensions into 10 dimensions. ⇒ You can see the tendency by seeing high-dimensional things. ● Analysis and visualization (up to 2D and 3D) using decimal variables is possible. Humans cannot understand when it is four or more dimensions.
[Learning data] The first element of the i-th data, the second element ... and up to the m dimension.
x_i=(x_{i1},x_{i2},・ ・ ・,x_{im}) \in \mathbb R^m
[Average (vector)]
\bar x = \frac{1}{n}\sum^{n}_{i=1} x_i
[Data Matrix] I'll try to disperse around the origin.
\bar X=(x1 - \bar x,・ ・ ・,x_n - \bar x)^T \in \mathbb R^{n×m}
[Variance-covariance matrix] It can be calculated as long as $ \ bar X $ is obtained.
\sum = Var(\bar X) = \frac{1}{n}\bar X^T \bar X
[Vector after linear transformation] j is the index of the projection axis, and if 3D is compressed into 2D, j = 2 If 3D is compressed to 1D, j = 1.
s_j = (s_{1j},・ ・ ・,s_{nj})^T = \bar X a_j a_j \in \mathbb R^m
● If the coefficient vector changes, the ** value after linear conversion ** changes. ● Consider the amount of information as ** the size of dispersion **. ● Search for the projection axis that maximizes the variance of the variables after linear transformation.
s_j = (s_{1j},・ ・ ・,s_{nj})^T = \bar X a_j a_j \in \mathbb R^m
Then, linearly combine the variance $ \ bar X $ in the original space with $ a $. If $ a $ changes, so does the line.
[Dispersion after linear transformation] I want to maximize $ s_j $.
Var(s_j) = \frac{1}{n}s_j^Ts_j = \frac{1}{n}(\bar Xa_j)^T(\bar Xa_j)=\frac{1}{n}a_j^T\bar X^T\bar Xa_j = a_j^TVar(\bar X)a_j
● Insert a constraint that the norm is 1. ⇒ Otherwise, infinite solutions will come out. For example, since the vector (1,1) and the vector (2,2) have the same orientation, the solution is infinite.
[Objective function]
arg \ max_{a \in \mathbb R^m} \ a_j^T Var(\bar X)a_j
[Constraints] Consider only those with a norm of 1.
a_j^Ta_j = 1
● How to solve a constrained optimization problem Search for the coefficient vector that maximizes the Lagrange function (the point that differentiates to 0)
[Lagrange function] Objective function-constraints
E(a_j) = a^T_jVar(\bar X)a_j - λ(a^T_ja_j - 1)
What we do is to differentiate the Lagrange function once it is created and solve = 0. There is an expression $ a $, and one solution is $ \ hat a $. The solution $ \ hat a $ is the same as the solution for the constrained optimization problem above.
● Differentiate the Lagrange function to find the optimum solution. ⇒ The eigenvalues and eigenvectors of the variance-covariance matrix of the original data are the solutions to the constrained optimization problem.
[Differentiation of Lagrange function] Let's review how to calculate. Vector differentiation.
\frac{\partial E(a_j)}{\partial a_j} = 2Var(\bar X)a_j - 2λa_j = 0\\
The solution is ...\ Var(\bar X)a_j = λa_j
● The variance of the projection destination matches the eigenvalue.
Var(s_1) = a^T_1Var(\bar X)a_1 = λ_1a^T_1a_1 = λ_1 (a^T_1a_1=1)
⇒It has the same form as the eigenvalues and eigenvectors of applied mathematics. $ a $ is the eigenvalue and eigenvector of the variance-covariance matrix of the original data. It is the axis that maximizes the variance. This is the result of principal component analysis.
"Calculate the covariance matrix" ⇒ "Solve the eigenvalue problem" ⇒ "A maximum of m eigenvalue and eigenvector pairs appear"
● I want to know how much information was lost as a result of compression! !! ● The variance of the principal components for the 1st to xx dimensions matches the variance of the original data. ⇒ The sum of all eigenvalues matches the variance of the original space. The total amount of information is m, which is the sum of all the variances.
● When displaying 2D data with 2D principal components, the sum of eigenvalues and the variance of the original data match. ● The variance of the kth principal component is the eigenvalue corresponding to the principal component.
Var_{(total)} = \sum_{i=1}^m \ λ_i
[Contribution rate] The ratio of the variance of the kth principal component to the total variance. (Ratio of the amount of information possessed by the kth principal component)
c_k = \frac{λ_k}{\sum_{i=1}^m \ λ_i}"Dispersion of the kth principal component / total dispersion of the principal component"
[Cumulative contribution rate] The ratio of the amount of information loss when compressed to the 1st to k main components.
r_k = \frac{\sum_{j=1}^kλ_j}{\sum_{i=1}^m \ λ_i}"Dispersion of 1st to k principal components / total dispersion of principal components"
Recommended Posts