I was talking about Bayesian updates in an atmosphere, so I wanted to study properly and summarized it. This is the nth decoction article.
--Bayesian statistics (range) in 5 minutes (https://www.slideshare.net/matsukenbook/15-59154892) --Bayesian estimation of probability distribution (http://arduinopid.web.fc2.com/P19.html) --Bayesian estimation of the winning probability of Cocoichi's spoon (https://rmizutaa.hatenablog.com/entry/2019/02/15/200700)
Roughly summarizing Bayesian estimation, instead of point-estimating the parameters of likelihood (= model) with MLE, the distribution of the parameters is estimated using Bayes' theorem. Bayes' theorem is expressed by the following equation.
p(w|x) = \frac{p(x|w)p(w)}{p(x)}
Bayes' update is a mechanism that uses Bayes' theorem to sequentially update the posterior distribution. The following development is based on here. First, we transform the formula of Bayes' theorem.
\begin{align}
&p(A, B, C) = p(A|B, C)p(B, C) = p(B, C|A)p(A) \\
\Rightarrow& p(A|B, C) = \frac{p(B, C|A)p(A)}{p(B, C)}
\end{align}
Up to this point, it generally holds.
Next, suppose B and C are independent and conditionally independent given A. Then we get the following equation:
\begin{align}
p(A|B, C) &= \frac{p(B, C|A)p(A)}{p(B, C)} \\
&= \frac{p(B|A)p(C|A)p(A)}{p(B)p(C)} \\
&= \frac{p(C|A)}{p(C)}\frac{p(B|A)p(A)}{p(B)} \\
&= \frac{p(C|A)}{p(C)}p(A|B)
\end{align}
Now, let's put $ w $ in A, the first observed data $ x_1 $ in B, and the second observed data $ x_2 $ in C. "$ X_1 $ and $ x_2 $ are independent and conditional independence given $ w $" means that the observed data are independently generated from the distribution and the likelihood (= model) is also $ x_1. It means modeling so that $ and $ x_2 $ are independent. Note that this property does not hold for data that is time-series correlated. At this time, the following equation holds.
\begin{align}
p(w|x_1, x_2) = \frac{p(x_2|w)}{p(x_2)}p(w|x_1)
\end{align}
The posterior distribution of $ w $ given the data $ x_1 $ and $ x_2 $ is Bayesian theorem of $ x_2 $ and $ w $ with $ p (w | x_1) $ in the prior distribution. I will! Using this relationship, give an appropriate prior distribution → find the posterior distribution when data is given → read the posterior distribution as prior distribution → find the posterior distribution when data is given → ... and so on. You can update the distribution. This is how Bayesian updates work.
Let's calculate the Bayesian update by referring to here.
Consider a distorted coin. Let's Bayesian estimate the probability that this coin will appear. Suppose you get a coin roll independently on each trial. First, consider the likelihood, or model, of the coin's roll. Since the front and back of the coin are binary variables, they follow the Bernoulli distribution. Therefore, the likelihood is Bernoulli distribution. Correspond the front side to $ x = 1 $ and the back side to $ x = 0 $. $ w $ is a real number between 0 and 1 and corresponds to the very probability that the table will appear. Therefore, the distribution of $ w $ obtained by Bayesian update itself is the distribution of the probability that the table will appear.
The Bernoulli distribution can be written as:
\begin{align}
p(x|w) = w^x(1-w)^{1-x}
\end{align}
Let's assume that the prior distribution is $ p (w) = 1 $ as an information prior distribution. $ p (x) $ is unknown, but if you focus on $ w $, it is a normalized constant, so it can be automatically calculated from the condition that the integral of $ p (w | x) $ is 1.
Now, let's shake the coin.
\begin{align}
p(w|x_1=1) \propto p(x_1=1|w)p(w) = w
\end{align}
It is already standardized because it becomes 1 when integrated from 0 to 1.
\begin{align}
p(w|x_1=1) = w
\end{align}
This gives the second prior distribution $ p (w) = p (w | x_1 = 1) = w $.
\begin{align}
p(w|x_1=1, x_2=1) \propto p(x_2=1|w)p(w) = w^2
\end{align}
When integrated, it becomes 1/3, so when standardized, it becomes as follows.
\begin{align}
p(w|x_1=1, x_2=1) = 3w^2
\end{align}
This gives the third prior distribution $ p (w) = p (w | x_1 = 1, x_2 = 1) = 3w ^ 2 $.
\begin{align}
p(w|x_1=1, x_2=1, x_3=0) \propto p(x_3=0|w)p(w) = 3(1-w)w^2
\end{align}
When integrated, it becomes 1/4, so when standardized, it becomes as follows.
\begin{align}
p(w|x_1=1, x_2=1, x_3=0) = 12(1-w)w^2
\end{align}
In this way, you can find the distribution of the probability $ w $ that a coin will be turned upside down.
Let's implement the example seen above in Python. For the implementation, I referred to here.
import sympy
import numpy as np
import matplotlib.pyplot as plt
np.random.rand()
#Front and back series of coins
xs = [1, 1, 0] #Front, front, back
#Prior distribution is non-information prior distribution
prior_prob = 1
#Symbol to integrate
w = sympy.Symbol('w')
#Initialize
#It is necessary when executing repeatedly with jupyter notebook
posterior_prob = None
#Sequential calculation of posterior distribution
for x in xs:
#Calculate posterior distribution (unnormalized)
if x==1:
posterior_prob = w*prior_prob
else:
posterior_prob = (1-w)*prior_prob
#Standardization
Z = sympy.integrate(posterior_prob, (w, 0, 1))
posterior_prob = posterior_prob/Z
#Prior distribution replacement
prior_prob = posterior_prob
plt.figure(figsize=(5, 4))
X = np.linspace(0, 1, 100)
plt.plot(X, [posterior_prob.subs(w, i) for i in X])
plt.xlabel("w")
plt.show()
Let's try to estimate when there is a little more data. With a coin with a probability of appearing on the table 0.35, increase the number of trials to 30 and estimate. As the data increases, the estimation becomes more accurate, and the distribution becomes sharper.
import sympy
import numpy as np
import matplotlib.pyplot as plt
np.random.rand(0)
def bernoulli_sampler(w, n):
"""w is the probability of getting 1 and n is the number of data to generate"""
xs = np.random.rand(n)
xs = xs<w
return xs.astype("int")
#Front and back series of coins
xs = bernoulli_sampler(0.35, 30)
#Prior distribution is non-information prior distribution
prior_prob = 1
#Symbol to integrate
w = sympy.Symbol('w')
#Initialize
#It is necessary when executing repeatedly with jupyter notebook
posterior_prob = None
#Sequential calculation of posterior distribution
for x in xs:
#Calculate posterior distribution (unnormalized)
if x==1:
posterior_prob = w*prior_prob
else:
posterior_prob = (1-w)*prior_prob
#Standardization
Z = sympy.integrate(posterior_prob, (w, 0, 1))
posterior_prob = posterior_prob/Z
#Prior distribution replacement
prior_prob = posterior_prob
plt.figure(figsize=(5, 4))
X = np.linspace(0, 1, 100)
plt.plot(X, [posterior_prob.subs(w, i) for i in X])
plt.xlabel("w")
plt.show()
The point is that B and C are independent and conditional independence given A.