Introduction

I heard that Bayesian statistics are hot recently, so I thought I should study "Introduction to Bayesian Statistics". It's easy to understand up to the middle, but I felt it was difficult to understand as soon as the continuous distribution came out because it is a style that explains with figures without using mathematical formulas as much as possible. Therefore, for studying, I will use coin throwing as a simple example and look at the posterior distribution each time I throw it. I don't understand the MCMC method yet, so I will calculate the probability density function in small pieces.

reference

Hiroyuki Kojima "Introduction to Bayesian Statistics", Diamond, November 2015

It is an explanation using the same figure as the above book, using coin throwing as an example. Kita Mathematics 108th Educational Practice Study Group, Saturday, January 26, 2019 "Bayesian Inference Thought"

This is the same subject, but it is an advanced content calculated by the MCMC method using pyMC. I also want to be able to use pyMC. Look at the posterior distribution of coins on the front and back of each trial (PyMC)

This is the first chapter of "Introduction to Bayesian Statistics". The figure is often used and it is easy to understand. Bayesian inference for a clear understanding in 5 minutes

Bayesian inference

The strength of Bayesian statistics is

The property that it can be guessed even if there is little data, and it becomes more accurate as there is more data.
Learning function that automatically updates guesses in response to incoming information instantly It seems to be in [^ 1].

Bayesian inference is based on Bayes' theorem. $p(\theta|x)=p(\theta) \frac{p(x|\theta)}{\int p(x|\theta)p(\theta)d\theta}$ here\thetaIs a parameter,p(\theta)Is prior distribution,p(x|\theta)Is\thetaWhenxConditional probability of(Likelihood)、p(\theta|x)Is事後分布です。 In the example of coin throwing, $ \ theta $ is the front or back, $ p (\ theta) $ is the probability distribution of $ \ theta $, $ x $ is the data of whether the coin is front or back, and the denominator is Represents the probability of getting $ x $. The original probability of $ \ theta $ occurring was the prior probability $ p (\ theta) $, but with the information that $ x $ will occur, the posterior probability is $ p (\ theta | x) $. .. This is called a Bayesian update.

The posterior distribution $ p (\ theta | x) $ that was updated with information can be updated one after another as the prior distribution $ p (\ theta) $. The answer does not change even if the order of multiplication is changed, so in the example of coin throwing, $ x = $ {front, front, back, back} or $ x = $ {front, back, front, back} is the final The posterior distribution is the same. This is called the sequential rationality of Bayesian inference. Since the results of the data so far are reflected in the prior distribution, it can be said to be a kind of learning function [^ 2].

program

0. Module

import numpy as np
import matplotlib.pyplot as plt

plt.rcParams["font.family"] = "Times New Roman"      #Set the entire font
plt.rcParams["xtick.direction"] = "in"               #Inward the x-axis scale line
plt.rcParams["ytick.direction"] = "in"               #Inward the y-axis scale line
plt.rcParams["xtick.minor.visible"] = True           #Addition of x-axis auxiliary scale
plt.rcParams["ytick.minor.visible"] = True           #Addition of y-axis auxiliary scale
plt.rcParams["xtick.major.width"] = 1.5              #Line width of x-axis main scale line
plt.rcParams["ytick.major.width"] = 1.5              #Line width of y-axis main scale line
plt.rcParams["xtick.minor.width"] = 1.0              #Line width of x-axis auxiliary scale line
plt.rcParams["ytick.minor.width"] = 1.0              #Line width of y-axis auxiliary scale line
plt.rcParams["xtick.major.size"] = 10                #x-axis main scale line length
plt.rcParams["ytick.major.size"] = 10                #Length of y-axis main scale line
plt.rcParams["xtick.minor.size"] = 5                 #x-axis auxiliary scale line length
plt.rcParams["ytick.minor.size"] = 5                 #Length of y-axis auxiliary scale line
plt.rcParams["font.size"] = 14                       #Font size
plt.rcParams["axes.linewidth"] = 1.5                 #Enclosure thickness

1. Uniform prior distribution

If you don't know anything in advance, assume a uniform prior distribution for the time being. It is said that this is called Principle of insufficient reason. First, let's follow this principle.

N = 1000 #Number of calculation divisions

p_x_front = np.linspace(0.0,1.0,N) #p(x|table)
p_x_back = 1.0 - p_x_front #p(x|back)

prior_dist = np.ones(N) #p(θ)Prior distribution

fig,axes = plt.subplots(figsize=(8,6))
axes.plot(p_x_front,prior_dist,label="Prior distribution")
axes.scatter(np.sum(prior_dist*p_x_front/N),0.5,s=100,label="expected value")
axes.legend(loc="best")
axes.set_xlabel(r"$p(\theta)$")
axes.set_ylabel("Probability density")
axes.set_ylim(-0.1,2.5)
axes.set_xticks(np.linspace(0.0,1.0,11))

The vertical axis is the probability density. When integrated, it becomes 1. The round plot is the expected value. The value on the vertical axis of the expected value is just made to be easy to see.

def update(p_x_theta,p_theta):
    return (p_x_theta * p_theta) / np.sum(p_x_theta * p_theta / N)

fig,axes = plt.subplots(figsize=(8,6))
dist_front1_back0 = update(p_x_front,prior_dist)
axes.plot(p_x_front,dist_front1_back0,label="front=1, back=0")
axes.scatter(np.sum(dist_front1_back0*p_x_front/N),0.5,s=100)

dist_front1_back1 = update(p_x_back,dist_front1_back0)
axes.plot(p_x_front,dist_front1_back1,label="front=1, back=1")
axes.scatter(np.sum(dist_front1_back1*p_x_front/N),0.5,s=100)

dist_front2_back1 = update(p_x_front,dist_front1_back1)
axes.plot(p_x_front,dist_front2_back1,label="front=2, back=1")
axes.scatter(np.sum(dist_front2_back1*p_x_front/N),0.5,s=100)

dist_front2_back2 = update(p_x_back,dist_front2_back1)
axes.plot(p_x_front,dist_front2_back2,label="front=2, back=2")
axes.scatter(np.sum(dist_front2_back2*p_x_front/N),0.2,s=100)

axes.legend(loc="best")
axes.set_xlabel(r"$p(\theta)$")
axes.set_ylabel("Probability density")
axes.set_ylim(-0.1,2.5)
axes.set_xticks(np.linspace(0.0,1.0,11))

The update function is Bayes' theorem itself. The posterior distribution changes each time you get information about whether a coin is front or back. You can also see by comparing the orange and red plots that the more data you have, the more accurate it is.

Let's look at the posterior distribution when the front is 100 times and the back is 100 times.

def updates(front,back):
    p_theta = prior_dist[:]
    
    for i in range(front):
        p_theta = update(p_x_front,p_theta)
        
    for i in range(back):
        p_theta = update(p_x_back,p_theta)
        
    return p_theta

p_theta = updates(100,100)
fig,axes = plt.subplots(figsize=(8,6))
axes.plot(p_x_front,p_theta)
axes.scatter(np.sum(p_theta*p_x_front/N),0.5,s=100)
axes.set_xlabel(r"$p(\theta)$")
axes.set_ylabel("Probability density")
axes.set_ylim(-0.1,12.0)
axes.set_xticks(np.linspace(0.0,1.0,11))

The variance has become much smaller. The expected value is 0.5.

2. Biased prior distribution

Bayesian inference can reflect subjectivity. We will use the beta distribution of $ \ alpha = 3, \ beta = 2 $ for the prior distribution, assuming that the table will be easier to see this time.

a = 3.0
b = 2.0
x = np.linspace(0.0,1.0,N)
prior_dist = x**(a-1.0) * (1.0-x)**(b-1.0)
prior_dist /= np.sum(prior_dist) / N

fig,axes = plt.subplots(figsize=(8,6))
axes.plot(p_x_front,prior_dist,label="Prior distribution")
axes.scatter(np.sum(prior_dist*p_x_front/N),0.5,s=100,label="expected value")
axes.legend(loc="best")
axes.set_xlabel(r"$p(\theta)$")
axes.set_ylabel("Probability density")
axes.set_ylim(-0.1,2.5)
axes.set_xticks(np.linspace(0.0,1.0,11))

The expected value is around 0.6.

fig,axes = plt.subplots(figsize=(8,6))
dist_front1_back0 = update(p_x_front,prior_dist)
axes.plot(p_x_front,dist_front1_back0,label="front=1, back=0")
axes.scatter(np.sum(dist_front1_back0*p_x_front/N),0.5,s=100)

dist_front1_back1 = update(p_x_back,dist_front1_back0)
axes.plot(p_x_front,dist_front1_back1,label="front=1, back=1")
axes.scatter(np.sum(dist_front1_back1*p_x_front/N),0.5,s=100)

dist_front2_back1 = update(p_x_front,dist_front1_back1)
axes.plot(p_x_front,dist_front2_back1,label="front=2, back=1")
axes.scatter(np.sum(dist_front2_back1*p_x_front/N),0.5,s=100)

dist_front2_back2 = update(p_x_back,dist_front2_back1)
axes.plot(p_x_front,dist_front2_back2,label="front=2, back=2")
axes.scatter(np.sum(dist_front2_back2*p_x_front/N),0.2,s=100)

axes.legend(loc="best")
axes.set_xlabel(r"$p(\theta)$")
axes.set_ylabel("Probability density")
axes.set_ylim(-0.1,2.5)
axes.set_xticks(np.linspace(0.0,1.0,11))

Compared to using a uniform prior distribution, the expected value is higher due to the subjectivity that the table is likely to appear.

Again, let's look at the posterior distribution when the front is 100 times and the back is 100 times.

p_theta = updates(100,100)
fig,axes = plt.subplots(figsize=(8,6))
axes.plot(p_x_front,p_theta)
axes.scatter(np.sum(p_theta*p_x_front/N),0.5,s=100)
axes.set_xlabel(r"$p(\theta)$")
axes.set_ylabel("Probability density")
axes.set_ylim(-0.1,12.0)
axes.set_xticks(np.linspace(0.0,1.0,11))

The expected value was 0.502, which was slightly larger than 0.5, but as the number of data increased, the influence of the initial subjectivity gradually diminished, and an accurate distribution could be obtained.

Summary

I got a rough idea of Bayesian inference. I don't understand the MCMC method, so next is [Introduction to Data Analysis with Bayesian Statistical Modeling Beginning with R and Stan](https://logics-of-blue.com/r-stan-bayesian-model-intro-book I hope I can read -support /) and write about more advanced content this time.

[^ 1]: Hiroyuki Kojima "Introduction to Complete Self-study Bayes Statistics" Diamond, November 2015, p.6 [^ 2]: Hiroyuki Kojima "Introduction to Complete Self-study Bayes Statistics" Diamond, November 2015, p.145

[PYTHON] Bayesian estimation using coin throwing as an example See the posterior distribution for each trial