(1) Confirm the definition of self-information content and Shannon entropy. (2) Get an overview of KL divergence and cross entropy.
-When the base of the logarithm is 2, the unit is bit. -When the base of the logarithm is $ e $ of Napier, the unit is nat.
I(x)= -log(P(x)) = log(W(x))
・ $ X $ is various events
・ It can be expressed as $ W = \ frac {1} {P}
(Example) The number of switches to prepare according to the amount of information can be found by taking $ log $. 1-2 pieces of information: 1 switch (= $ log_ {2} 2 $) 1 to 4 information: 2 switches (= $ log_ {2} 4 $) 1 to 8 information: 3 switches (= $ log_ {2} 8 $)
・ Although it is also called differential entropy, it is not differentiated. ・ Expected value of self-information content
\begin{align}
H(x)&= E(I(x))\\
&= -E(log\:(P(x)))\\
&= -\sum(P(x)\:log(P(x)))
\end{align}
⇒ Total of (probability x random variable) (= expected value)
-Represents the difference between different probability distributions $ P and Q $ in the same event / random variable. ⇒ Throw a coin. The probability distribution is $ \ frac {1} {2} $, but it was later found that the probability is different for Ikasama coins. Find out how different the distribution was. ・ Relative entropy
\begin{align}
D_{KL}(P||Q)&= E_{X~P}\begin{bmatrix}log \frac{P(x)}{Q(x)}\end{bmatrix}\\
&=E_{X~P}\begin{bmatrix}log P(x) - log Q(x)\end{bmatrix}
\end{align}
・ D: Divergence, KL: Kullback-Leibler
・
\begin{align}
I(Q(x))-I(P(x))&= (-log(Q(x))) - (-log(P(x)))\\
&=log\frac{P(x)}{Q(x)}\\
\end{align}
E(f(x))= \sum_{x}P(x)f(x)
\begin{align}
D_{KL}(P||Q)&= \sum_{x}P(x)(-log(Q(x))) - (log(P(x)))\\
&=\sum_{x}P(x)log\frac{P(x)}{Q(x)}
\end{align}
⇒ Is it similar to Shannon entropy? ??
・ A part of KL divergence is taken out. -The amount of self-information about $ Q $ is averaged by the distribution of $ P $. -There is also data compression such as preparing an encrypted table in advance. -Used to define the loss function in machine learning and optimization. (Logistic regression model, etc.)
\begin{align}
H(P,Q)&= H(P) + D_{KL}(P||Q)\\
H(P,Q)&= -E_{X~P}\;\;log\;Q(x)\\
&= -\sum_{x}P(x)\:log\:Q(x)
\end{align}
-Represents the difference between different probability distributions $ P and Q $ in the same event / random variable.
What is a deep learning course that can be crushed in the field in 3 months
Recommended Posts