[PYTHON] DeepRunning ~ Level3.3 ~

Level3. Applied mathematics ③

3-3. Information theory

・ Learning goals

(1) Confirm the definition of self-information content and Shannon entropy. (2) Get an overview of KL divergence and cross entropy.

3-3-1. Amount of self-information

-When the base of the logarithm is 2, the unit is bit. -When the base of the logarithm is $ e $ of Napier, the unit is nat.

I(x)= -log(P(x)) = log(W(x))

・ $ X $ is various events ・ It can be expressed as $ W = \ frac {1} {P} . When log is reciprocal, the sign is reversed. (Add-) - Log (W (x)) $, with $ e $ omitted at the bottom. (Bottom conversion)

(Example) The number of switches to prepare according to the amount of information can be found by taking $ log $. 1-2 pieces of information: 1 switch (= $ log_ {2} 2 $) 1 to 4 information: 2 switches (= $ log_ {2} 4 $) 1 to 8 information: 3 switches (= $ log_ {2} 8 $)

3-3-2. Shannon entropy

・ Although it is also called differential entropy, it is not differentiated. ・ Expected value of self-information content

\begin{align}
H(x)&= E(I(x))\\
&= -E(log\:(P(x)))\\
&= -\sum(P(x)\:log(P(x)))
\end{align}

⇒ Total of (probability x random variable) (= expected value)

3-3-3. Kullback-Leibler Divergence

-Represents the difference between different probability distributions $ P and Q $ in the same event / random variable. ⇒ Throw a coin. The probability distribution is $ \ frac {1} {2} $, but it was later found that the probability is different for Ikasama coins. Find out how different the distribution was. ・ Relative entropy

\begin{align}
D_{KL}(P||Q)&= E_{X~P}\begin{bmatrix}log \frac{P(x)}{Q(x)}\end{bmatrix}\\
&=E_{X~P}\begin{bmatrix}log P(x) - log Q(x)\end{bmatrix}
\end{align}

・ D: Divergence, KL: Kullback-Leibler ・(P||Q)Means I want to see the difference between the two.

\begin{align}
I(Q(x))-I(P(x))&= (-log(Q(x))) - (-log(P(x)))\\
&=log\frac{P(x)}{Q(x)}\\
\end{align}
E(f(x))= \sum_{x}P(x)f(x)
\begin{align}
D_{KL}(P||Q)&= \sum_{x}P(x)(-log(Q(x))) - (log(P(x)))\\
&=\sum_{x}P(x)log\frac{P(x)}{Q(x)}
\end{align}

⇒ Is it similar to Shannon entropy? ??     D_{KL}(P||Q)And D_{KL}(Q||P)Will be different values. Therefore, the difference between the two cannot be treated as a distance.

3-3-4. Cross entropy

・ A part of KL divergence is taken out. -The amount of self-information about $ Q $ is averaged by the distribution of $ P $. -There is also data compression such as preparing an encrypted table in advance. -Used to define the loss function in machine learning and optimization. (Logistic regression model, etc.)

\begin{align}
H(P,Q)&= H(P) + D_{KL}(P||Q)\\
H(P,Q)&= -E_{X~P}\;\;log\;Q(x)\\
&= -\sum_{x}P(x)\:log\:Q(x)
\end{align}

-Represents the difference between different probability distributions $ P and Q $ in the same event / random variable.

alt What is a deep learning course that can be crushed in the field in 3 months

Recommended Posts

DeepRunning ~ Level 6 ~
DeepRunning ~ Level4.4.2 ~
DeepRunning ~ Level 4.6 ~
DeepRunning ~ Level4.3.1 ~
DeepRunning ~ Level 3.2 ~
DeepRunning ~ Level3.3 ~
DeepRunning ~ Level4.3.2 ~
DeepRunning ~ Level7 ~
DeepRunning ~ Level4.5 ~
DeepRunning ~ Level2, Level3.1 ~
DeepRunning ~ Level4.4.1 ~
DeepRunning ~ Level 1 ~
DeepRunning ~ Level 4.7 ~
DeepRunning ~ Level5 ~