[PYTHON] [Rabbit Challenge (E qualification)] Applied Mathematics

Introduction

This is a learning record when I took the Rabbit Challenge with the aim of passing the Japan Deep Learning Association (JDLA) E qualification, which will be held on January 19th and 20th, 2021.

Rabbit Challenge is a course that utilizes the teaching materials edited from the recorded video of the commuting course of "Deep learning course that can be crushed in the field". There is no support for questions, but it is a cheap course (the lowest price as of June 2020) for taking the E qualification exam.

Please check the details from the link below.

Chapter 1: Linear Algebra

--Scalar —— Generally, so-called ordinary numbers -+-× ÷ can be calculated --Can be a coefficient for a vector

Identity matrix and inverse matrix

A matrix such as "1" whose partner does not change even if it is multiplied is called an identity matrix. $ I = \begin{pmatrix} 1 & & \\\ & 1 & \\\ & & \ddots \\ \end{pmatrix} $ A matrix that acts like a reciprocal is called an inverse matrix. $ AA^{-1} = A^{-1}A = I $

Features of determinant

When you think of a matrix as a combination of two horizontal vectors, $ \begin{pmatrix} a & b \\\ c & d \end{pmatrix} = \begin{pmatrix} \vec{v_1} \\\ \vec{v_2} \end{pmatrix} $ Determines the presence or absence of an inverse matrix in the area of the parallelogram created by. This area is $ \ begin {vmatrix} a & b \
c & d \end{vmatrix} = \begin{vmatrix} \vec{v_1} \
\vec{v_2} It is expressed as \ end {vmatrix} $ and is called a determinant. When $ \ vec {v_1} = (a, b, c), \ vec {v_2} = (a, b, c), \ vec {v_3} = (a, b, c) $ $ \begin{vmatrix} \vec{v_1} \\\ \vec{v_2} \\\ \vec{v_3} \end{vmatrix} = \begin{vmatrix} a & b & c \\\ d & e & f \\\ g & h & i \end{vmatrix} = \begin{vmatrix} a & b & c \\\ 0 & e & f \\\ 0 & h & i \end{vmatrix} + \begin{vmatrix} 0 & b & c \\\ d & e & f \\\ 0 & h & i \end{vmatrix} + \begin{vmatrix} 0 & b & c \\\ 0 & e & f \\\ g & h & i \end{vmatrix} = a \begin{vmatrix} e & f \\\ h & i \end{vmatrix} - d \begin{vmatrix} b & c \\\ h & i \end{vmatrix} + g \begin{vmatrix} b & c \\\ e & f \end{vmatrix} $

The determinant made up of n vectors has the following characteristics.

--The determinant is zero if the same row vector is included $ \begin{vmatrix} \vec{v_1} \\\ \vdots \\\ \vec{w} \\\ \vdots \\\ \vec{w} \\\ \vdots \\\ \vec{v_n} \end{vmatrix} = 0 $ --When one vector is multiplied by $ \ lambda $, the determinant is multiplied by $ \ lambda $ $ \begin{vmatrix} \vec{v_1} \\\ \vdots \\\ \lambda\vec{v_i} \\\ \vdots \\\ \vec{v_n} \end{vmatrix} = \lambda \begin{vmatrix} \vec{v_1} \\\ \vdots \\\ \vec{v_i} \\\ \vdots \\\ \vec{v_n} \end{vmatrix} $ --If all the other components are the same but only the $ i $ th vector is different, the determinant is added. $ \begin{vmatrix} \vec{v_1}\\\ \vdots \\\ \vec{v_i} + \vec{w} \\\ \vdots \\\ \vec{v_n} \end{vmatrix} = \begin{vmatrix} \vec{v_1} \\\ \vdots \\\ \vec{v_i} \\\ \vdots \\\ \vec{v_n} \end{vmatrix} + \begin{vmatrix} \vec{v_1} \\\ \vdots \\\ \vec{w} \\\ \vdots \\\ \vec{v_n} \end{vmatrix} $ --The sign changes when you swap lines $ \begin{vmatrix} \vec{v_1} \\\ \vdots \\\ \vec{v_s} \\\ \vdots \\\ \vec{v_t} \\\ \vdots \\\ \vec{v_n} \end{vmatrix} = - \begin{vmatrix} \vec{v_1} \\\ \vdots \\\ \vec{v_t} \\\ \vdots \\\ \vec{v_s} \\\ \vdots \\\ \vec{v_n} \end{vmatrix} $

Eigenvalues and eigenvectors

For a matrix $ A $, there is a special vector $ \ vec {x} $ and a coefficient $ \ lambda $ on the right side that holds the following equation. $ A\vec{x} = \lambda\vec{x} $ The product of the matrix $ A $ and its special vector $ \ vec {x} $ is the same as the product of just the number of scalars $ \ lambda $ and its special vector $ \ vec {x} $. This special vector $ \ vec {x} $ and its coefficient $ \ lambda $ are called eigenvectors and eigenvalues for the matrix $ A $.

Eigenvalue decomposition

Suppose a matrix $ A $ created by arranging real numbers in a square has eigenvalues $ \ lambda_1, \ lambda_2,… $ and eigenvectors $ \ vec {v_1}, \ vec {v_2},… . A matrix in which these eigenvalues are arranged diagonally (other components are 0) $ A = \begin{pmatrix} \lambda_1 & & \
& \lambda_2 & \
& & \ddots \ \end{pmatrix} $ And the matrix in which the corresponding eigenvectors are arranged $ V = (\vec{v_1} \quad \vec{v_2} \quad …) $ When they prepared $ AV = VA $ Is associated with. Therefore $ A = VAV^{-1} $$ Can be transformed. Transforming a square matrix into the product of three matrices as described above is called eigenvalue decomposition. This transformation has advantages such as easy calculation of matrix exponentiation.

Singular value decomposition

Other than the square matrix, it is possible to resemble the eigenvalue decomposition. $ M \vec{v} = \sigma\vec{u} $ $ M^\top \vec{u} = \sigma\vec{v} $ If there is such a special unit vector, it can be decomposed into singular values. $ MV = US \qquad M^\top U = VS^\top $ $ M = USV^{-1} \qquad M^\top = VS^\top U^{-1} $ These products are $ MM^\top = USV^{-1}VS^\top U^{-1} = USS^\top U^{-1} $ In other words, if $ MM ^ \ top $ is decomposed into eigenvalues, the left singular vector and the square of the singular value can be obtained.

Chapter 2: Probability / Statistics

--Random variable $ x $: Actually realized value (realized value) $ \ Hspace {112pt} $… If you throw the dice and roll it, an integer value from 1 to 6

--Probability distribution $ P (x) $: How easy it is to choose the realized value $ x $ $ \ Hspace {145pt} $… In the dice example, $ P (1) =… = P (6) = \ frac {1} {6} $

Conditional probability

P(Y=y|X=x) = \frac{P(Y=y,X=x)}{P(X=x)}

Given an event X = x, the probability that Y = y.

Simultaneous probability of independent events

P(X=x,Y=y) = P(X=x)P(Y=y)=P(Y=y,X=x)

Bayes' theorem

P(x|y) = \frac{P(y|x)P(x)}{\sum_x P(y|x)P(x)}

Addition theorem (law of total probability)P(y) = \sum_x P(x,y) = \sum_x P(y|x)P(x)Conditional probabilityP(x|y) = \frac{P(x,y)}{P(y)} = \frac{P(y|x)P(x)}{P(y)}When used in, it is obtained.

Expected value

--Expected value: The average value of random variables in the distribution or the "probable" value $ E(f) = \sum_{k=1}^nP(X=x_k)f(X=x_k) $ $ \ hspace {28pt} $ For consecutive values ... $ \int P(X=x)f(X=x)dx $ --Distributed: How the data is scattered $ Var(f) = E\Bigl(\bigl(f_{(X=x)}-E_{(f)}\bigl)\^2\Bigl) = E\bigl(f^2_{(X=x)}\bigl)-\bigl(E_{(f)}\bigl)\^2$ --Covariance: Difference in trends between the two data series $ Cov(f,g) = E\Bigl(\bigl(f_{(X=x)}-E(f)\bigl)\bigl(g_{(Y=y)}-E(g)\bigl)\Bigl) = E(fg)-E(f)E(g) $ --Standard deviation: Dispersion of data (the variance is squared, so the unit is different from the original data, so the square root is taken and the unit is restored) $ \sigma = \sqrt{Var(f)} = \sqrt{E\bigl((f_{(X=x)}-E_{(f)})^2\bigl)} $

Various probability distributions

--Bernoulli distribution: A distribution showing the results of trials in which only two types of results can be obtained (coin toss image) $ P(x|\mu) = \mu^x(1-\mu)^{1-x} $ --Multi-nooy (categorical) distribution: Distribution showing the results of trials that can obtain multiple types of results (image of rolling dice) $ P(x|\mu) = \prod_{k=1}^K \mu_k^{x_k} $ --Binomial distribution: a multi-trial version of the Bernoulli distribution $ P(x|\lambda,n) = \frac{n!}{x!(n-x)!}\lambda^x(1-\lambda)^{n-x} $ --Gaussian distribution: Bell-shaped continuous distribution $ N(x;\mu,\sigma^2) = \sqrt\frac{1}{2\pi\sigma^2}\exp\bigl(-\frac{1}{2\sigma^2}(x-\mu)^2\bigl) $

Confusing discretes Summary of distribution

Chapter 3: Information Theory

Amount of self-information

I(x) = -\log{P(x)}

Observation of frequently occurring events does not provide much information, while the more rare events are, the greater the amount of information. Therefore, the reciprocal of probability $ \ frac {1} {P (x)} $ is a candidate for the definition of the amount of information. However, the amount of information obtained by observing the two independent phenomena $ x and y $ is not $ \ frac {1} {P (x) P (y)} $, but the sum of the amount of information each has. Since it is a more natural definition, take the logarithm. When the base of the logarithm is 2, the unit is bit. When the base of the logarithm is the Napier number $ e $, the unit is nat.

Shannon entropy (average amount of information)

H(x) = E\bigl(I(x)\bigl) = -E\Bigl(\log\bigl(P(x)\bigl)\Bigl) = -\sum_x P(x)\log\bigl(P(x)\bigl)

Expected value of information content (average of information content of all observed values.

Kullback-Leibler divergence

D_{KL}(P||Q) = E_x \Bigl[log\frac{P(x)}{Q(x)}\Bigl] = \sum_x P(x)\bigl(\log{P(x)}-\log{Q(x)}\bigl)

An index that shows how much information is different when looking at a new distribution of $ P $ from the distribution of $ Q . Generally KL divergenceQWith a downwardly convex function onP=QMinimum value only whenD_{KL}(P||P)=0$become. So it's certainly like the distance between distributions,PWhenQWas replacedD_{KL}(Q||P)Is a different value and is different from the true mathematical distance.

Cross entropy

H(P,Q) = -E_{X \sim P} \log{Q(x)} = -\sum_xP(x)\log{Q(x)}

An index showing how far the two probability distributions are by averaging the amount of self-information about Q with the distribution of P.

The cross entropy of the probability distributions $ P (x) $ and $ Q (x) $ is the entropy of $ P (x) $ and the KL divergence of $ Q (x) $ as seen from $ P (x) . It is a combination of things. $ \begin{align} H(P,Q) &= -\sum_xP(x)\log{Q(x)} \
&= -\sum_xP(x)\log{\frac{P(x)Q(x)}{P(x)}} \
&= -\sum_xP(x)\bigl(\log{P(x)}+\log{Q(x)}-\log{P(x)}\bigl) \
&= -\sum_xP(x)\log{P(x)} + \sum_xP(x)\bigl(\log{P(x)}-\log{Q(x)}\bigl) \
&= H(P) + D_{KL}(P||Q) \
\end{align} $$

Recommended Posts

[Rabbit Challenge (E qualification)] Applied Mathematics
JDLA E Qualification Measures Applied Mathematics
[Rabbit Challenge (E qualification)] Deep learning (day2)
[Rabbit Challenge (E qualification)] Deep learning (day3)
[Rabbit Challenge (E qualification)] Deep learning (day4)
Rabbit Challenge 4Day
Rabbit Challenge 3DAY
Machine learning rabbit challenge
Rabbit Challenge Deep Learning 1Day
E qualification report bullet points
Rabbit Challenge Deep Learning 2Day