Introduction

As a result of collecting various observation data, there may be a desire to apply the probability distribution behind the observation data to the model and estimate it. It's convenient for calculating various statistics and simulation calculations. There are normal distribution, exponential distribution, binomial distribution, beta distribution, etc. as models of probability distribution, but this time, while implementing MCMC method (Markov Chain Monte Carlo method) "simplely" from the principle for gamma distribution. , I would like to try how much I can fit. In addition, WinBUGS and PyMC3 are excellent tools for executing MCMC. There are //docs.pymc.io/) etc., so you don't have to make it yourself **. However, there are many things that you can see by making it yourself, and I think there is a merit in promoting understanding.

What is the gamma distribution?

What is Gamma Distribution? The distribution is characterized by two parameters: the shape parameter $ k $ and the scale parameter $ q $.

P(x) = \frac{1}{\Gamma(k)q^k}x^{k-1}e^{-x/q},\ x \geq 0 \ (k>0,\ q>0)

It is a distribution that takes non-negative real numbers, and is an excellent guy who can express exponential distribution, chi-square distribution, or Erlang distribution. Especially, I like the fact that it can express a distorted distribution with a wide hem. There are many observational data in the world that are represented by non-negative values such as the number, length, weight, time interval, population, amount of money, etc., and it seems to be useful for expressing the distribution of such variables. In fact, it seems to be used in engineering and economics, such as waiting time distribution and income distribution.

What is MCMC?

The MCMC method (Markov Chain Monte Carlo method) is a method that efficiently generates a sample of a certain probability distribution and estimates the nature of the probability distribution from the sample. The Markov chain is responsible for the generation of the sample according to the probability distribution, and Monte Carlo is responsible for the estimation from the sample. For details, the Video of Professor Yukito Iba of the Institute of Statistical Mathematics is very easy to understand. In this article, we will use MCMC to estimate the gamma distribution, so we define the following terms:

$ \ Theta $: Parameters of the $ N $ dimension you want to estimate
$ P (x | \ theta) $: Probability distribution model calculated according to parameter $ \ theta $ (I know it because it is a model)
$ X_1, X_2, ..., X_n $: Observed sample values. However, independence is assumed.
$ D_n = (X_1, X_2, ..., X_n) $: Observed data (set of values)
$ P (\ theta | D_n) $: Probability distribution of parameter $ \ theta $ estimated from data $ D_n $

What I want to do here is to collect a lot of data $ D_n $, and if the probability distribution $ P (\ theta | D_n) $ of the parameter $ \ theta $ can be calculated well, then the parameter $ \ theta $ of the assumed model ( You can calculate the expected value (based on posterior probabilities), right? The probability distribution $ P (\ theta | D_n) $ is expressed by the following equation from Bayes' theorem (or from the normalization condition of the probability distribution).

P(\theta|D_n) = P(D_n | \theta)P(\theta) / \int_{\theta} d\theta P(D_n | \theta)P(\theta)

Here, the integral should be taken for all $ \ theta $ with a non-zero probability of producing the data $ D_n $. One of the simple ways to estimate this distribution is grid calculation. In other words, if $ \ theta $ is evenly distributed on an appropriate grid and represented by the discrete value $ \ theta_i (i = 1,2, ..., I) $,

P(\theta_i|D_n) \simeq P(D_n | \theta_i)P(\theta_i) / \sum_{j \in I}  P(D_n | \theta_j)P(\theta_j)

You can approximate the integral of the denominator by the sum, as in. It looks like this when I write it in a diagram.

However, in order to improve the estimation accuracy of $ \ theta $, the number of discrete values $ I $ must be large enough. If you simply try to represent a $ N $ dimension parameter with $ M $ meshes, you need $ I = M ^ N $ discrete values, which will explode as the dimension goes up.

So, ** MCMC changes the way of thinking and does not fix the discrete value $ \ theta_i $, but move it steadily **. Move $ k (k = 1,2, ...) $, $ k $ th $ \ theta_i (i = 1, .., I) $ to $ \ theta_i (k) $, $ \ theta_i ( Let $ P_k (\ theta) $ be the probability distribution that k) $ follows. Also, let $ \ pi (x \ rightarrow x') $ move from $ \ theta_i (k) $ to $ \ theta_i (k + 1) $. Then it can be illustrated as follows. P_k_k+1_map1.jpg

I designed this way of moving $ \ pi (x \ rightarrow x') $ well, and after moving it many times,

P_k(\theta) \rightarrow P(\theta | D_n) \  ({\rm as \ } k \rightarrow \infty)

If is true, you can achieve your goal. Next, let's look at the Metropolis algorithm as one of the ways to move it.

Metropolis algorithm

The Metropolis algorithm is a sampling algorithm from a probability distribution. In the figure above, we will increase the number of characters.

$ P _ * (\ theta) : Distribution according to the model you want to estimate ( P (\ theta | D_n) $ in this article)
$ P_ {o_k} (\ theta) $: Distribution according to other than $ P_ * (\ theta) $ in $ P_k (\ theta) $

In other words, let's say that $ \ theta_i $ is generated by synthesizing distributions that follow and do not follow the model. At this time, let's assume that the movement by the map $ \ pi (x \ rightarrow x') $ is restricted as shown in the figure below.

P_k_k+1_map2.jpg

In other words, the restriction is that $ \ theta_i $ that followed ** $ P_ * (\ theta) $ will always move to $ P_ * (\ theta) $ ** ((1) in the figure). Furthermore, let's assume that $ \ theta_i $ that followed ** $ P_ {o_k} (\ theta) $ can always be moved to $ P_ * (\ theta) $ ** (② in the figure). Then, if ② exists moderately, from the limitation of ①, it can be seen that as $ k $ increases, it becomes $ P_k (\ theta) \ rightarrow P_ * (\ theta) $. Strictly speaking, (1) is called stationary, (2) is called irreducible, and it seems that the convergence of the probability distribution from these two conditions is called ergodicity. (For details, see Explainer video of Professor Yukito Iba).

One of the mappings $ \ pi (x \ rightarrow x') $ that satisfy these properties is the Metropolis algorithm, which is represented by the following flow.

Generate $ x'$ from the symmetric probability distribution $ Q (x, x') (= Q (x', x)) $ with the current point $ x $.
Generate a uniform random number $ 0 \ leq r <1 $.
If $ r <{P_ * (x')} / {P_ * (x)} $, move to $ x \ rightarrow x'. If not, don't move ( x \ rightarrow x $).
Repeat steps 1 to 3.

As long as the probability distribution of 1 is symmetric, a normal distribution can be used, for example.

Q(x,x') = \frac{1}{\sqrt{2\pi \sigma^2}}exp\left(-\frac{(x-x')^2}{2\sigma^2}\right)

By the way, the algorithm extended to the non-symmetrical case ($ Q (x, x') \ neq Q (x', x) $) is called the Metropolis-Hastings algorithm. Looking at step 3,

$ P_ * (x') \ geq P_ * (x) $ will always move from x to x'. In other words, we will move to the one with the higher probability. (Equivalent to ② in the above figure)
$ P_ * (x') <P_ * (x) $ moves to x'or x, depending on the ratio of the probabilities $ P_ * (x') $ to $ P_ * (x) $. (Equivalent to ① in the above figure)

Try to calculate with Python

Now, let's estimate the gamma distribution while implementing MCMC in Python.

1. Preparation

Set the library.

import numpy as np
import matplotlib.pyplot as plt
from math import gamma

Defines a gamma function, a histogram generation, and a function that displays the shape of the probability distribution.

def gamma_pdf(x, k, q):
    p = 1./(gamma(k) * pow(q, k)) * pow(x, k-1) * np.exp( -x/q )
    return p

def genPdfHist(f, xmin, xmax, N):
    x = np.linspace(xmin,xmax, N)
    p = [f(v) for v in x]
    return x,p

def showPdf(f, xmin, xmax, N, yscale=1.0):
    x, p = genPdfHist(f, xmin, xmax, N)
    y = [e*yscale for e in p]
    plt.plot(x,y)
    plt.show()

2. Prepare observation data

Next, let's create observation data using the gamma distribution. Define a function to sample for creating observational data.

def PlainSampler(pdf, xmin, xmax, N):
    n = 0
    xs = []
    while n < N:
        x = xmin + np.random.rand()*(xmax-xmin)
        p = pdf(x)
        if np.random.rand() < p:
            # accept
            xs.append(x)
            n = n + 1
    return xs

Use the above function to generate the observation data. (Meaning experimental observation data) First, let's generate the parameters of the correct model as $ (k ^ *, q ^ *) = (2,4) $.

k_true = 2.0
q_true = 4.0
gp1 = lambda x : gamma_pdf(x, k_true, q_true)

smp1 = PlainSampler(gp1, 0, 50, 1000)
h1 = plt.hist(smp1, bins=50)
showPdf(gp1, 0, 50, 100, h1[0].sum()/(h1[1][1]-h1[1][0]))
plt.show()

The row of the showPdf function multiplies the adjustment factor to align the histogram with the Y-axis of the probability distribution. The data smp1 is the observed value, and its histogram looks like this. Of course, since it is generated with random numbers, the shape of the histogram changes each time it is executed. pdf_sample_k2.0_q4.0.png

3. Prepare a function for MCMC

We will define the functions used to calculate MCMC. First, posterior probability distributionP(D_n |\theta)Is a function that calculates. Simultaneous probability by selecting 50 sample values from the original sample to prevent the calculated probability value from becoming too small and to ensure data independence.P(D_n |\theta)=P(X_1 | \theta)P(X_2 | \theta) \cdots P(X_n | \theta)Is being calculated.

def pdf_gamma_sample(x, sample):
    k = x[0]
    q = x[1]
    #Prevent P from becoming too small
    sample1 = np.random.choice(sample, 50)
    p = [gamma_pdf(s, k, q) for s in sample1]
    return np.prod(p)

A function that gives the prior distribution (initial distribution) of the parameter $ \ theta = (k, q) $. Since there is no prior information, let's generate it from a uniform random number. However, I will give only the maximum value.

def genInit( cdim, xdim, xmax):
    x = np.random.rand(cdim, xdim)*xmax
    return x

Next is the core part. Map $ \ pi (x \ rightarrow x') $ by Metropolis algorithm and a function that executes MCMC.

def MHmap(pdf, x, xdim, sig):
    #Gibbs sampling (change only one variable)
    xnew = x + np.eye(1, xdim, np.random.randint(xdim))[0,:] * np.random.normal(loc=0, scale=sig)
    #Limit the range of x
    xnew = np.maximum(xnew, 0.01)
    #Selected by Metropolis criteria
    r = np.random.rand()
    if r < pdf(xnew)/pdf(x):
        x = xnew
    return x

def MCMC_gamma_01(sample, x0, sig, N):
    #
    pdf1 = lambda x: pdf_gamma_sample(x, sample)
    #
    x = x0
    xs = []
    for i in range(N):
        #Map point clouds according to Metropolis standards
        x = [MHmap(pdf1, e, 2, sig) for e in x]
        #Random resampling
        s = np.random.randint(0,len(x),(len(x)))
        x = [x[j] for j in s]
        #
        xs.append(x)
    return np.array(xs)

This is the point when coding.

Use Gibbs sampling (maybe not required)
Transition function $ Q (x, x') $ uses a normal distribution with standard deviation $ \ sigma $
Limit the range of the variable x (do not jump to the negative area) After mapping with + π, resample it further (in theory, it shouldn't be necessary, but I want to reduce the possibility that points that fly too far will not come back easily).

** I'm not sure if this implementation is optimal **, but I can imagine that the actual libraries (PyMC3, WinBUGS, etc.) would have been devised in various ways. This is where the gap between theory and implementation becomes visible.

4. Try MCMC

Now, let's MCMC with the function created in 3 to estimate the parameter $ \ theta = (k, q) $ of the gamma distribution model from the observation data smp1 created in 2. Set the point cloud size to 500, the parameter dimension to 2 dimensions, and the maximum value to 10. Also, let the standard deviation of the transition function $ Q (x, x') $ be 3. Calculate MCMC 100 times.

p0 = genInit( 500, 2, 10)
sig = 3
xs = MCMC_gamma_01(smp1, p0, sig, 100)

After a few minutes, the result will come out, so let's make a graph.

fig = plt.figure()
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax1.hist(xs[-1,:,0], bins=50)
ax2.hist(xs[-1,:,1], bins=50)
ax1.set_xlabel("k")
ax2.set_xlabel("q")
k_est = xs[-1,:,0].mean()
q_est = xs[-1,:,1].mean()
print("k*:{}, q*:{}".format(k_est, q_est))
plt.show()

The result is shown below. It's not a very clean histogram, but when calculated by the average value, it is $ (\ bar {k}, \ bar {q}) = (2.21, 4.46) $. MCMC_sample_k2.0_q4.0.png

A comparison of the original generated distribution and the distribution using $ (\ bar {k}, \ bar {q}) = (2.21, 4.46) $ is shown in the figure below. It's a little off, but it seems that the long-tailed distribution can be adjusted to some extent. MCMC_match_k2.0_q4.0.png

5. Other results

If you try changing the parameter $ \ theta = (k, q) $ in various ways, it seems to fit. The impression is that it will converge after repeating it about 100 times.

Consideration

From the above, the following trends were seen from the results of estimating the gamma distribution by simply implementing MCMC.

It is possible (to some extent) to estimate the parameters of the original gamma distribution from the sample mean using the observation data generated from the gamma distribution.
However, ** there was a tendency to depend on the initial distribution as the prior distribution given to MCMC and the transition function $ Q (x, x') $ (variance) **.
It seems to be useful because it can be used for a function with strong non-linearity called gamma distribution, and it can be used in principle even if the parameter dimension is high.

Furthermore ...

The part that calculates the probability ratio based on the model from the data $ D_n $ was written very simply, but it seems that we can devise various measures such as measures when the probability becomes very small and speeding up.
Estimating the probability distribution behind the sample set scattered as a point cloud seems to be somewhat similar to the particle nature and wave nature of light, and I feel that they are related.
What happens when the probability distribution changes over time? Is also interesting.

Reference link

I referred to the following page.

[PYTHON] Try to estimate the parameters of the gamma distribution while simply implementing MCMC