[PYTHON] The first Markov chain Monte Carlo method by PyStan

Overview

Participating in a round-reading session of "Introduction to Statistical Modeling for Data Analysis" (so-called Midorimoto) held by the company, It was a very easy-to-understand book, but sadly for Mac users, the implementation sample is WinBUGS. Because it was We implemented the Bayesian inference approach of the generalized linear model in Chapter 9 in Python + STAN.

What i did

Roughly following the steps below.

  1. Generate dummy data from a probability distribution based on specific parameters
  2. Set the forecast model
  3. From the dummy data and the prediction model, estimate the parameters (posterior distribution) that generated the data with MCMC and match the answers.

Specifically, the body size of a certain plant (taking discrete values in 0.1 increments from 3.0 to 7.0) is used as an explanatory variable. Estimate the probability distribution of the number of seeds (integer greater than or equal to 0) that follows the Poisson distribution.

Tools and libraries used

Estimating method

MCMC MCMC method is an abbreviation for Marcov Chain Monte Carlo method. In Japanese, it is called Markov chain Monte Carlo method. Markov chain Monte Carlo method is a Markov chain using the Monte Carlo method. ... I've become a tautology, so I'll explain a little more.

Markov chain

The property that a certain state depends only on the immediately preceding state is called ** Markov property **. The ** Markov chain ** is a probabilistic model in which states that depend only on the immediately preceding state occur in a chain. From an engineer's point of view, it's easier to understand if you think of it as a ** type of automaton **.

Monte Carlo method

The Monte Carlo method is a general name for numerical calculations and simulations using random numbers.

In other words, the ** Markov chain Monte Carlo method ** is a method of generating Markov chains using random numbers. Here, in particular, it refers to an algorithm that generates a probability distribution (posterior distribution of parameters) using the properties of Markov chains.

Bayesian inference x MCMC

By combining the Bayesian estimation framework (posterior distribution ∝ likelihood x prior distribution) and MCMC that samples the probability distribution proportional to the likelihood, Even if the model cannot be solved analytically, the posterior distribution can be estimated as long as it can be expressed by a mathematical formula.

Implementation

A detailed explanation of the generalized linear model and MCMC seems to be very long, so we will start implementing it. "Introduction to Statistical Modeling for Data Analysis-Generalized Linear Models, Hierarchical Bayesian Models, MCMC (Science of Probability and Information)" is easy to understand.

Generation of training data

Body size is 3.0 ~ 7.0, Average μ = 1.5 + 0.1 * for each individual It is generated from the Poisson distribution of body size.

def generate(n):
	for i in range(n):
		x = round(random.random() * 4 + 3, 1) # 3.0 ~ 7.Random numbers up to 0
		mu = math.exp(1.5 + 0.1*x)
		print (x, np.random.poisson(mu))

"Body size" and "number of seeds".

6.1 11
5.9 6
4.1 7
5.1 6
6.8 13
5.6 7
5.0 7
5.4 16
5.4 6

STAN mcmc.stan

data {
  int<lower=1> N;
  vector[N] x;
  int<lower=0> y[N];
}
parameters {
  real beta1;
  real beta2;
}
model {
  for (i in 1:N) {
    y[i] ~ poisson(exp(beta1 + beta2 * x[i])); //Poisson distribution x logarithmic link function
  }
  beta1 ~ normal(0, 1000); //Average 0,Normal distribution with variance 1000 ≒ no information prior distribution
  beta2 ~ normal(0, 1000); //Average 0,Normal distribution with variance 1000 ≒ no information prior distribution
}

How to read

data This is the data to be passed to the stan program. Declare in the format ** {data type} {variable name}; **. Pass the data from Python by specifying the variable name written here.

parameters This is the variable used in the model described in stan. This time, the intercept β1 and the coefficient β2 of the logarithmic link function of the Poisson distribution are set inside the STAN. It is generated from a non-information prior distribution (a normal distribution with a very large variance that approximates it).

model It is a prediction model. The operators **'~' ** mean that the left term follows the probability distribution of the right term. Here, the number of seeds ** y ** follows a Poisson distribution with ** exp (β1 + β2x) ** as the link function (that is, the logarithmic link function).

Python

import numpy as np
import pystan
import matplotlib.pyplot as plt

data = np.loadtxt('./data.txt', delimiter=' ')

#Data generation to pass to Stan Interface
x = data[:,0] #numpy notation, cut out the first column of data
y = data[:,1].astype(np.int) #Since it is the objective variable of Poisson distribution, it is converted to an integer value.
N = data.shape[0] #The number of data

stan_data = {'N': N, 'x': x, 'y': y}

fit = pystan.stan(file='./mcmc.stan',\
	data=stan_data, iter=5000, chains=3, thin=10)
    # iter =Number of each sampling
    # chain =Repeat n sets of sampling specified by iter
    # thin =Sample thinning number

Execution result

If it goes well, a log like this will come out. STAN itself is implemented in C ++, so compilation is running.

INFO:pystan:COMPILING THE C++ CODE FOR MODEL ~ NOW.
Chain 1, Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 0, Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2, Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1, Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2, Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 0, Iteration: 1000 / 10000 [ 10%]  (Warmup)
...
Chain 0, Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2, Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1, Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 0, Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2, Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1, Iteration: 10000 / 10000 [100%]  (Sampling)# 
...
#  Elapsed Time: 9.51488 seconds (Warm-up)
#                10.7133 seconds (Sampling)
#                20.2282 seconds (Total)
# 

Graph

mcmc9.png

Statistics

summary

Inference for Stan model: anon_model_381b30a63720cfb3906aa9ce3e051d13.
3 chains, each with iter=10000; warmup=5000; thin=10; 
post-warmup draws per chain=500, total post-warmup draws=1500.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
beta1   1.41  2.4e-3   0.05   1.31   1.38   1.41   1.45   1.52    481    1.0
beta2   0.12  4.6e-4   0.01    0.1   0.11   0.12   0.13   0.14    478    1.0
lp__  7821.4    0.04   0.97 7818.8 7821.1 7821.7 7822.1 7822.4    496    1.0

Samples were drawn using NUTS(diag_e) at Tue Feb  9 23:31:02 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

When the fit.summary () method is executed, the output will be as follows. First, looking at β1, it shows that the average of the samples is 1.41, with a 95% probability of being in the range of 1.31 to 1.52. (It seems to be called a credit interval in Bayesian terms.)

Since there is only one mountain in the distribution this time, I will look at the representative value on average, β1 (1.5) is 1.41 and β2 (0.1) is 0.12, which are not significantly different, and the numerical values can be estimated.

Rhat When calling the stan code from Python, I repeated sampling 3 times with the * chain * parameter. In the posterior distribution estimation of parameters by MCMC, the initial values of the parameters are determined appropriately. Depending on the model, the probability distribution estimated by the initial value may differ, so Repeat sampling multiple times and check if the probability distribution converges by ** Rhat **, which quantifies the variation between sets. It seems that it is OK if it is 1.1 or less, but this time it can be judged that there is no problem because both beta1 and beta2 are less than that.

Reference book

Since the focus is on implementation introduction, the meaning of each argument such as * thin * when calling STAN, and Why are you using the first half of sampling for * Warm up *? In addition, MCMC is a general term for methods in the first place, and what is the specific algorithm? If you want to know more, please read the following.

[Introduction to Statistical Modeling for Data Analysis-Generalized Linear Model, Hierarchical Bayes Model, MCMC (Science of Probability and Information)](http://www.amazon.co.jp/%E3%83%87%E3 % 83% BC% E3% 82% BF% E8% A7% A3% E6% 9E% 90% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AE% E7% B5 % B1% E8% A8% 88% E3% 83% A2% E3% 83% 87% E3% 83% AA% E3% 83% B3% E3% 82% B0% E5% 85% A5% E9% 96% 80 % E2% 80% 95% E2% 80% 95% E4% B8% 80% E8% 88% AC% E5% 8C% 96% E7% B7% 9A% E5% BD% A2% E3% 83% A2% E3 % 83% 87% E3% 83% AB% E3% 83% BB% E9% 9A% 8E% E5% B1% A4% E3% 83% 99% E3% 82% A4% E3% 82% BA% E3% 83 % A2% E3% 83% 87% E3% 83% AB% E3% 83% BBMCMC-% E7% A2% BA% E7% 8E% 87% E3% 81% A8% E6% 83% 85% E5% A0% B1% E3% 81% AE% E7% A7% 91% E5% AD% A6-% E4% B9% 85% E4% BF% 9D-% E6% 8B% 93% E5% BC% A5 / dp / 400006973X)

Recommended Posts

The first Markov chain Monte Carlo method by PyStan
[Statistics] Let's explain sampling by Markov chain Monte Carlo method (MCMC) with animation.
Estimating π by Monte Carlo method
Understand the metropolitan hasting method (one of the methods in Markov chain Monte Carlo method) with implementation
Monte Carlo method
Find the ratio of the area of Lake Biwa by the Monte Carlo method
Dominion compression play analyzed by Monte Carlo method
PRML Chapter 11 Markov Chain Monte Carlo Python Implementation
Markov chain and Monte Carlo methods are summarized
Try implementing the Monte Carlo method in Python
Calculation of the shortest path using the Monte Carlo method
Speed comparison of each language by Monte Carlo method
Introduction to Monte Carlo Method
Increase the speed of the Monte Carlo method of Cython cut-out implementation.
Sprinkle rice grains to find the circumference (Monte Carlo method)
[Statistics] Visualize and understand the Hamiltonian Monte Carlo method with animation.
Simulate Monte Carlo method in Python
I couldn't install pypy3.6-7.3.1 on macOS + pyenv, but I could install pypy3.6-7.3.0. I felt the wind of pypy by the Monte Carlo method.
Quadratic programming by the interior point method
Gomoku AI by Monte Carlo tree search
#Monte Carlo method to find pi using Python
The first web app created by Python beginners