[Python] Bayesian inference with Pyro

I recently learned about Pyro, which handles stochastic models, and thought it would be interesting, so I tried it as a trial. This article shares the source code. The source code is written in Jupyter Notebook. Please note that there is almost no theoretical explanation.

Environment Windows10 Python: 3.7.7 Jupyter Notebook: 1.0.0 PyTorch: 1.5.1 Pyro: 1.4.0 scipy: 1.5.2 numpy: 1.19.1 matplotlib: 3.3.0

We also publish the ipynb file on github. https://github.com/isuya0508/exercise_Pyro/blob/master/bayesian_inference_bernulli.ipynb

What is Pyro?

Pyro is a library that handles probabilistic models with Pytorch as the back end. You can install it from pip.

pip install pyro-ppl

At this time, you need to install Pytorch in advance. Please refer to the official page for details.

Try it out

This time, we will consider estimating the parameter $ p $ from the data that follows the Bernoulli distribution $ Ber (p) $. First, import the required modules.

from collections import Counter

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

import torch
import torch.distributions.constraints as constraints
import pyro
import pyro.distributions as dist
from pyro.optim import SGD, Adam
from pyro.infer import SVI, Trace_ELBO

%matplotlib inline

pyro.set_rng_seed(0)
pyro.enable_validation(True)

Data generation

Create data with random numbers. At this time, the type must be a Pytorch tensor.

obs = stats.bernoulli.rvs(0.7, size=30, random_state=1)
obs = torch.tensor(obs, dtype=torch.float32)
obs

> tensor([1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1.,
          1., 1., 0., 0., 1., 1., 0., 0., 1., 1., 1., 0.])

Check the number of 1s in the data.

Counter(obs.numpy())

> Counter({1.0: 23, 0.0: 7})

Therefore, the maximum likelihood estimator for the parameter $ p $ is $ 23/30 \ fallingdotseq 0.77 $. [^ 2] After that, we will Bayesian estimate $ p $.

Prior distribution settings

In Bayesian estimation, the prior distribution of parameters is assumed, and the posterior distribution is calculated by combining the observed data.

For the parameter $ p $ of the Bernoulli distribution $ Ber (p) $, it is common to assume a beta distribution as the prior distribution [^ 1]. In Pyro, the model method describes the prior distribution and the model of the data.

def model(data):
    #Assuming prior distribution
    prior = dist.Beta(1, 1)

    #Data modeling
    p = pyro.sample('p', prior)
    for i in range(len(data)):
        pyro.sample(f'obs{i}', dist.Bernoulli(p), obs=data[i])

This time we assume $ Beta (1, 1) $, which is in fact consistent with the uniform distribution.

Setting posterior distribution

Describe the posterior distribution with the guide method. The posterior distribution is also a beta distribution, similar to the prior distribution. At this time, give an appropriate initial value as a parameter of the posterior distribution.

def guide(data):
    #Definition of posterior distribution
    alpha_q = pyro.param('alpha_q', torch.tensor(15), constraint=constraints.positive)
    beta_q = pyro.param('beta_q', torch.tensor(15), constraint=constraints.positive)
    posterior = dist.Beta(alpha_q, beta_q)

    pyro.sample('p', posterior)

This posterior distribution parameter $ \ alpha, \ beta $ will be found.

Post-distributed fitting

As for the method of estimating the parameters of the posterior distribution, we will use stochastic variational estimation this time. It seems that Pyro basically uses this method. In this Bernoulli distribution example, the posterior distribution can be obtained analytically, so it is nonsense to use an approximation method such as variational estimation, but I will use this method for practice. (Originally, this is the method used when the distribution cannot be calculated analytically.)

For implementation, instantiate gradient descent (SGD) and stochastic variational estimation (SVI). When instantiating an SVI, give the model and guide defined above.

optimizer = SGD(dict(lr=0.0001, momentum=0.9))
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

All you have to do now is repeat svi.step (obs) to update the posterior distribution parameters.

NUM_STEPS = 2000
pyro.clear_param_store()

history = {
    'loss': [],
    'alpha': [],
    'beta': []
}

for step in range(1, NUM_STEPS + 1):
    loss = svi.step(obs)
    
    history['loss'].append(loss)
    history['alpha'].append(pyro.param('alpha_q').item())
    history['beta'].append(pyro.param('beta_q').item())
    
    if step % 100 == 0:
        print(f'STEP: {step} LOSS: {loss}')

> 
STEP: 100 LOSS: 17.461310371756554
STEP: 200 LOSS: 18.102468490600586
(Omission)
STEP: 1900 LOSS: 17.97793820500374
STEP: 2000 LOSS: 17.95139753818512

Here, the Loss being fitted and the posterior distribution parameters $ \ alpha and \ beta $ are recorded in history. The plot of these numbers for each step looks like this: (The source code is omitted.)

ダウンロード .png !

Check the finally obtained $ \ alpha, \ beta $, and check the expected value and variance of the posterior distribution.

infered_alpha = pyro.param('alpha_q').item()
infered_beta = pyro.param('beta_q').item()
posterior = stats.beta(infered_alpha_beta, infered_beta_beta)

print(f'alpha: {infered_alpha}')
print(f'beta: {infered_beta}')
print(f'Expected: {posterior.expect()}')
print(f'Variance: {posterior.var()}')

>
alpha: 25.764650344848633
beta: 7.556574821472168
Expected: 0.7732203787899605
Variance: 0.005109101547631603

Let's plot the prior and posterior distributions. (The source code is omitted.)

ダウンロード.png

It seems that it can be estimated well.

Summary

I tried a simple Bayesian inference using Pyro. This time it was a simple Bayesian estimation, but I think Pyro's strength is that it can describe complex models flexibly and simply. There are many examples of probabilistic methods in the official documentation, and just looking at them can be a learning experience.

[^ 2]: The maximum likelihood estimator of $ p $ matches the sample mean. [^ 1]: It is known that the posterior distribution also follows the beta distribution by making the prior distribution of $ p $ a beta distribution. Such a distribution is called a conjugate prior.