Introduction to Bayesian Statistical Modeling with python ~ Trying Linear Regression with MCMC ~

Introduction

Is the Complete Data Driven Approach Correct? Utilize more human know-how for modeling!

――I think that the approach of big data and machine learning these days is to learn all the rules from the data. ――However, as a matter of fact, it is rare that ** a large amount of comprehensive data on all events is accumulated **. --Therefore, I would like to pay attention to ** know-how (knowledge) that has not been converted into data but has been accumulated from many years of experience and exists in the human head ** --Bayesian statistics (Bayesian modeling) may be able to supplement insufficient data by using these know-how. ――Bayesian statistics has been attracting attention again due to the limitations of a data-driven approach to all events, and recently, terms such as ** Bayesian deep learning ** and ** Bayesian machine learning ** have appeared. Masu --So, in this article, I will explain the basics of Bayesian statistics using the python library (pyMC). ――The concept of Bayesian statistics is difficult for beginners to "understand", but it is easy to get an image when entering from "use" using actual data.

Bayesian statistics (basic)

――Here, I will introduce only the minimum idea (image) and the basic theorem. ――There are many easy-to-understand books out there for accurate and detailed explanations of Bayesian statistics, so please refer to them.

Bayesian way of thinking

Bayesian statistics infers the "cause" from the "result"

――Usually, people who start studying statistics often start with frequency-based statistics, but I think this is one of the reasons why Bayesian statistics are difficult to understand. --In frequency theory statistics, the true value of the population is the "cause", and the observed data is taken as the "result". ――On the other hand, Bayesian statistics estimates the population (cause) from the observed data (results). --Therefore, Bayesian statistics treats the population as a variable (not a value or constant that changes depending on the observed data).

In Bayesian statistics, variables are subjectively "tentatively determined" and corrected (updated) with the observed data.

--When estimating the "cause" variable (parameter), Bayesian statistics uses the concepts of "prior distribution" and "posterior distribution". --The prior distribution can be set based on the knowledge that humans already know (this is the reason why Bayesian statistics can be used to utilize the know-how in the human mind). ――The distribution of the result of correcting (updating) this prior distribution with the observed data is called "posterior distribution". --In Bayesian statistics, the process of modifying the distribution based on this data is called "probability update".

Bayes' theorem

――Here, I will introduce the most important theorem in Bayesian statistics, "Bayes' theorem". ――However, there is no new information because the above contents are simply expressed in mathematical formulas. ――This theorem itself can be easily derived from the equation transformation of conditional probability, so if you are interested, please check it out. ――I think you should at least get the image that the prior distribution of the right equation is updated with the data and the posterior distribution of the left equation is derived. --Bayesian statistics basically uses this theorem to infer the approximate value of the parameter θ analytically or (when it cannot be solved analytically) using random number simulation or optimization methods.

Digression: Interpreting Bayes' theorem from terms that appear in machine learning

――The following is a digression, so if you are not interested, please skip it. ---- P (X | θ) is called likelihood --The ** maximum likelihood estimation method **, which also appears in machine learning, searches for θ that maximizes this P (X | θ). --In other words, the maximum likelihood estimation method does not consider the prior distribution and can be regarded as a procedure that simply searches for θ that maximizes P (X | θ). ――This is an approach that fits the parameters perfectly to the data at hand, so it causes overfitting. --In addition, the parameter search method called ** MAP estimation ** searches for θ that maximizes P (X | θ) × P (θ). --This can be regarded as a parameter search that considers the constraint condition (regularization) of P (θ), which is a prior distribution. --Since the prior distribution is also taken into consideration, it does not overfit the data at hand as much as the maximum likelihood estimation, but the calculation is complicated because the probability density function of the prior distribution needs to be calculated. --Bayes' theorem finds posterior probabilities by further considering peripheral likelihood (probability that data is obtained on average) in MAP estimation.

Method of calculation

--Here, we will explain how to calculate the parameters. ――There are various calculation methods, but here we will introduce three typical methods.

Natural conjugate prior distribution --Some combinations of probability distributions have the same prior and posterior distributions (eg Bernoulli and beta distributions). ――It is a method to calculate parameters analytically by taking advantage of its characteristics. ――However, the distribution is very limited to satisfy the combination condition, so it is rarely used in practice at present. ――In modern times when computers have developed, approximate value inference using the following random number simulations and optimizations is the main solution.
MCMC --The official name is a method called Markov chain Monte Carlo method (in short, it is a method to generate "random numbers" by simulation). --However, MCMC can generate random numbers that follow the posterior distribution, which avoids complicated integral calculations. --There are various algorithms in MCMC (eg Gibbs sampling, M-H algorithm). --In python, it can be executed by using ** pyMC **
Variational reasoning --This is a method of searching for a distribution that approximates the posterior distribution by the gradient method. --Define the loss function (distance) of the approximate distribution and the posterior distribution with KL divergence, and search for the approximate value by minimizing the loss function (this is an approach often used in neural networks). --In python, it can be executed by using ** pyro **

――In this article, we will introduce parameter estimation by MCMC.

Demo-Linear regression model-

――It's hard to understand in sentences, so I'll try it with python immediately. --Here we will solve a very simple linear regression model with Bayesian statistics. ――Honestly, this problem does not need to be Bayesian statistics, but if you keep down the procedure, you can solve complex models with the same approach. --First, create sample data to be used in this demo.

Creating sample data

#Creating sample data
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)

#Sample data generation parameters
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)

#model: y = α + βx + Noise
x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real *x
y = y_real + eps_real

#Data visualization
plt.plot(x, y, '.')
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x, y_real, 'orange')
plt.show()

--The problem setting is that you want to estimate the linear model (orange line) when the data (blue dot) is given. --The linear model is defined by y = α + β * x, and α = 2.5 and β = 0.9 are the values you want to infer using Bayesian statistics.

Linear regression model with MCMC

--First, install pyMC

conda install pymc3==3.6

--Currently (as of November 2020), the latest version is 3.9, but there was a problem with some functions in my environment, so I am downgrading and using it. ――Please install while changing this area as appropriate.

Basic procedure of pyMC

Model definition & MCMC execution
Confirmation / evaluation of posterior distribution of each parameter
Inference using posterior distribution (visualization of prediction / confidence intervals)

How to write a model with pyMC

--In pyMC, you need to describe the model as follows ――I wrote various things, but the point is ** 1. Model definition, 2. Likelihood calculation, 3. Execution of MCMC **. -If you check Official Document, various other options (initial value setting and other MCMC algorithms) are introduced. ――First of all, it is better to understand the minimum model description method below, and then try other options as needed.

#In PyMC, the model is defined by with ward
with pm.Model() as model:
    # 1.Definition of prior distribution (define the variable you want to use as a parameter)
    '''
Various probability distributions are prepared by pyMC, and here we use the normal distribution
Set the distribution name as a character string, and then set the prior distribution parameters.
    '''
    α = pm.Normal('alpha', mu=0, sd=10)#Normal distribution (mean=0,standard deviation=10)
    β = pm.Normal('beta', mu=0, sd=1)#Normal distribution (mean=0,standard deviation=1)
    noise = pm.Normal('noise', mu=0, sd=1)
    # 2.Likelihood calculation
    '''
Probability update is performed by giving the observed value to observed
    '''
     y_pred = pm.Normal('y_pred', mu=α*x+β, sd=noise, observed=y)
    # 3.Run MCMC
    '''
There are various settings such as initial values and algorithms, but it is basic and general-purpose."NUTS"Use the algorithm
    draws=Number of random numbers to generate, chains=Number of parallel processes
    '''
    trace = pm.sample(draws=5000, chains=1) #5000 samples*1 parallel=Generate 5000 random numbers

Performing Bayesian modeling

--Let's perform Bayesian modeling according to the above procedure.

import pymc3 as pm    
    
#model
with pm.Model() as model:
    # 1.Prior distribution
    alpha = pm.Normal('alpha', mu=0, sd=10)#Average 0 for α,Assume a normal distribution with a standard deviation of 10.
    beta = pm.Normal('beta', mu=0, sd=1)#Average 0 for β,Assume a normal distribution with a standard deviation of 1
    epsilon = pm.HalfCauchy('epcilon', 5)#Assuming a half Cauchy distribution with a mode of 5 for noise
    # 2.Likelihood calculation
    y_pred = pm.Normal('y_pred', mu=alpha+beta*x, sd=epsilon, observed=y)
    # 3.Run MCMC
    trace = pm.sample(11000, chains=1)

--The simulation of the posterior distribution of parameters is completed up to this point. ――I will check the simulation result immediately.

#Confirmation / evaluation of posterior distribution of each parameter (graph)
trace_n = trace[1000:]#Discard the first 1000 cases (see below:"Burn in"）
pm.traceplot(trace_n)

#Confirmation / evaluation of posterior distribution of each parameter (statistic)
pm.summary(trace_n)

--How to read the output result --The graph (left column) shows the distribution of each parameter (probability density function). --The distribution is centered on α = about 2.64 and β = about 0.88, and you can see that parameters close to the correct answer (α = 2.5, β = 0.9) can be estimated. --The graph (right column) shows the convergence status of the random number simulation, and this time it converges to a steady distribution, but if this does not converge well, preprocessing such as prior distribution, its parameters, and standardization of features is performed. You need to do

--Burn in --In MCMC, it takes a while to obtain a sample from the target distribution, so it is better not to use the data at the beginning of sample generation. ――Therefore, we are evaluating the posterior distribution of the parameters except for the first 1000 samples generated.

Finally, use the simulation results to draw a regression line and evaluate the inference results of the model.

# α,Acquisition of β value (the average value of simulation results is used as the representative value of each parameter)
alpha_m = trace_n['alpha'].mean()
beta_m = trace_n['beta'].mean()

#Visualization
plt.plot(x, y, '.')
plt.plot(x, y_real, 'orange', label='True')
plt.plot(x, alpha_m + beta_m*x, 'red', label='Estimated')
plt.legend()

--Linear model plot --A regression line is drawn using the ** mean ** of the posterior distribution simulation results as the representative value of the parameters. -You can see that a straight line similar to the regression line of the correct answer plotted in "Creating sample data" can be drawn.

#Sample generation from posterior distribution
ppc = pm.sample_posterior_predictive(trace_n, samples=1000, model=model)

#Visualization of scatter plots and regression lines
plt.plot(x, y, '.')
plt.plot(x, alpha_m + beta_m*x)

#Confidence interval
idx = np.argsort(x)
x_ord = x[idx]
# 50%HPD section
sig0 = pm.stats.hpd(ppc['y_pred'], alpha=0.5)[idx]
# 95%HPD section
sig1 = pm.stats.hpd(ppc['y_pred'], alpha=0.05)[idx]
#Confidence intervals are filled in
plt.fill_between(x_ord, sig0[:,0], sig0[:,1], color='gray', alpha=1)
plt.fill_between(x_ord, sig1[:,0], sig1[:,1], color='gray', alpha=0.5)

--Bayesian model (MCMC) can have the result of random number simulation of posterior distribution --This allows you to express confidence intervals at 50% and 95%. --In Bayesian models, HPD intervals (values that parameters can take) are often used.

What is the HPD section?

--Also called the highest posterior density confidence interval --Confidence interval width set so that the mode of the parameter is always included regardless of the distribution shape ――It's complicated when you look at it with mathematical formulas, but it's a very natural definition when you check it with a picture (graph). --This site is easy to understand.

finally

Bayesian modeling possibilities

--In this article, I tried to solve the basics of Bayesian statistics and the linear regression model using MCMC. -** The feature of Bayesian modeling is that the estimation result can be expressed with a probability distribution instead of one value **. ――When making a decision, it is necessary to consider not only the prediction result but also the variation (reliability) in the set **, so I think it is very practical. --This time, for the sake of simplicity, all prior distributions are assumed to be normal distributions, but pyMC supports various probability distributions, and you can set distributions according to prior knowledge. ――In addition to the linear regression model, you can solve various problems with the same approach, such as solving the logistic regression model with the Bayesian model and visualizing the reliability of the estimation result by simply changing the link function. Masu ――In this sense, it can be said that ** Bayesian modeling is a highly flexible and versatile modeling method **. ――On the other hand, MCMC takes a long time to simulate when the model becomes complicated (as you can see by executing the above). ――In that case, you can also solve it by the optimization method by variational inference (I tried to explain the solution by variational inference by python library pyro, but since the amount of articles has increased, I will write it in another article. I will create it)

To promote data utilization

――In this article, I tried to explain the basics of Bayesian statistics using python, which is one of the most used languages in practice these days. ――In the present age when DX and data utilization are called for, I often hear the worries that "I want to promote efforts but there is no appropriate data !!" ――It is of course important to build a database from scratch and start collecting data, but that alone will not catch up with the global companies that are leading the way in data utilization efforts. ――Therefore, I thought that Bayesian modeling could be utilized only by ** Japanese companies that have no data but abundant business know-how cultivated over many years **, so I wrote this article. ――I hope it will give you some hints for businesses that want to promote data utilization but have not been successful.