Stochastic programming languages are used for Bayesian statistical modeling. I think that the method of using Stan, BUGS, and JAGS in R language is common because there are many books. Python's stochastic programming libraries include PyMC, PyStan, and Edward (now TensorFlow Probability), an extension of TensorFlow. Pyro was released in 2018. Pyro is a stochastic programming library based on PyTorch developed by Uber. Recently, the company has also released NumPyro based on Jax. Pyro's paper: http://jmlr.org/papers/v20/18-403.html NumPyro's paper: https://arxiv.org/abs/1912.11554

Pyro and NumPyro can easily implement Bayesian deep learning, and I think they will be applied in the future. (Implemented article)

This time, I will try the basics from Pyro's official tutorial. I also referred to the following blog article. https://www.hellocybernetics.tech/entry/2019/12/08/033824 https://www.hellocybernetics.tech/entry/2019/12/08/193220

Install Pyro

It is assumed that Python is 3.4 or higher and PyTorch is installed. The author's environment was Ubuntu 16.04, CUDA10, Anaconda, Python3.6, Pytorch 1.2.0.

pip install pyro-ppl

Pyro 1.3.1 is installed. Also, PyTorch has been automatically upgraded to 1.4.0.

Implementation (1): Definition and sampling of probability distribution

Import and set seed. The set seed is required for sampling reproducibility. If you do pyro.set_rng_seed, it seems that the seed value in pytorch is also decided.

import torch
import pyro
pyro.set_rng_seed(101)

pyro.distributions can be similar to torch.distributions. This time, we define the weather and temperature with the Bernoulli distribution and the normal distribution.

def weather():
    cloudy = pyro.sample('cloudy', pyro.distributions.Bernoulli(0.3))
    cloudy = 'cloudy' if cloudy.item() == 1.0 else 'sunny'
    mean_temp = {'cloudy': 55.0, 'sunny': 75.0}[cloudy]
    scale_temp = {'cloudy': 10.0, 'sunny': 15.0}[cloudy]
    temp = pyro.sample('temp', pyro.distributions.Normal(mean_temp, scale_temp))
    return cloudy, temp.item()

Build on this to define ice cream sales. In other words, it is a hierarchical probability model.

def ice_cream_sales():
    cloudy, temp = weather()
    expected_sales = 200. if cloudy == 'sunny' and temp > 80.0 else 50.
    ice_cream = pyro.sample('ice_cream', pyro.distributions.Normal(expected_sales, 10.0))
    return ice_cream

Let's sample the sales of ice cream and show it in a histogram.

import matplotlib.pyplot as plt
N = 1000 #The number of samples
x_list = []
for _ in range(N):
    x = ice_cream_sales()
    x_list.append(x)
plt.hist(x_list)

A bimodal distribution was obtained.

Vectorization with pyro.plate method

If you use the pyro.plate method, it seems that you can sample without repeating with a for statement by vectorization. It means that sampling is done in parallel from independent and identical distribution. I tried to rewrite the code so far using pyro.plate. Model definition

def model(N):
    with pyro.plate("plate", N):
        cloudy = pyro.sample('cloudy', pyro.distributions.Bernoulli(0.3))
        mean_temp = 75.0 - 20.0 * cloudy
        scale_temp = 15.0 - 5.0 * cloudy
        temp = pyro.sample('temp', pyro.distributions.Normal(mean_temp, scale_temp))
        expected_sales = 50.0 + 150.0 * (1 - cloudy) * (temp > 80.0)
        ice_cream = pyro.sample('ice_cream', pyro.distributions.Normal(expected_sales, 10.0))
    return ice_cream

Display the histogram.

import matplotlib.pyplot as plt
N = 1000 #The number of samples
x_list = model(N)
plt.hist(x_list)

The histogram is similar. Is the slight difference the effect of parallel sampling due to vectorization?

As an important feature of pyro, the variable name is specified by the first argument like pyro.sample ("hoge", dist). This seems necessary to define that it is a global random variable. (I'm not sure about the details.)

Implementation (2): Bayesian inference

In the previous section, we found out how to generate data from a probabilistic model. In this section, we will try to infer a stochastic model from the data.

import torch
import pyro
pyro.set_rng_seed(101)
import matplotlib.pyplot as plt

As data, prepare a total of 10 data, 1 value x 6 and 0 value x 4.

data = []
for _ in range(6):
    data.append(torch.tensor(1.0))
for _ in range(4):
    data.append(torch.tensor(0.0))
data = torch.tensor(data) #Make it a tensor for handling with Pyro
plt.hist(data)

The histogram will look like the one above.

Model a probability distribution for this data. This time, we will consider the Bernoulli distribution with the parameter "latent_fairness" that follows the beta distribution.

import pyro.distributions as dist
def model(data):
    alpha0 = torch.tensor(10.0)
    beta0 = torch.tensor(10.0)
    f = pyro.sample("latent_fairness", dist.Beta(alpha0, beta0))
    with pyro.plate("data_plate"):
        pyro.sample("obs", dist.Bernoulli(f), obs=data)

Infer parameters that fit this model. There are several methods of inference, but variational Bayes and MCMC are the main ones.

Variational Bayes

Among the inference methods, Pyro seems to have the most complete functions of variational Bayes. This is probably because PyTorch's automatic differentiation and optimization techniques, which are the basis of Pyro, are used for variational Bayesian SVI (stochastic variational inference).

First, prepare a variational approximation distribution. You can define it yourself, or you can do it automatically with the pyro.infer.autoguide.guides class.

guide = pyro.infer.autoguide.guides.AutoDelta(model)

Perform variational inference.

from pyro.optim import Adam
from pyro.infer import SVI, Trace_ELBO
adam_params = {"lr": 0.0005, "betas": (0.90, 0.999)}
optimizer = Adam(adam_params)
pyro.clear_param_store()
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())
n_steps = 1000 #Number of steps
losses, f  = [], [] #For recording for later plot
for step in range(n_steps):
    loss = svi.step(data)
    losses.append(loss)
    f.append(pyro.param("AutoDelta.latent_fairness").item())

Now you can infer. Plot the progress.

plt.subplot(2,1,1)
plt.plot(losses)
plt.xlabel("step")
plt.ylabel("loss")
plt.subplot(2,1,2)
plt.plot(f)
plt.xlabel("step")
plt.ylabel("latent_fairness")

The parameter "latent_fairness" has converged to about 0.5357. The final value can be displayed below.

for name in pyro.get_param_store():
    print("{}: {}".format(name, pyro.param(name)))

AutoDelta.latent_fairness: 0.535700261592865

Display the predicted probability distribution from the inferred model. Take 1000 samples.

from pyro.infer import Predictive
predictive_dist = Predictive(model=model, guide=guide,num_samples=1000, return_sites=["obs"])
plt.hist(predictive_dist.get_samples(None)["obs"].view(1,-1))

The distribution is similar to the histogram of the data. (Is it slightly different because the original number of data is as small as 10?)

MCMC MCMC is a brute force technique and takes time.

from pyro.infer import　NUTS, MCMC
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=5000, warmup_steps=1000)
mcmc.run(data)

Now you can infer. See the result.

mcmc.summary()

parameter	mean	std	median	5.0%	95.0%	n_eff	r_hat
latent_fairness	0.53	0.09	0.54	0.38	0.68	1894.58	1.00

MCMC can express the predicted probability distribution by passing a sample of parameters to posterior_samples.

f_mcmc = mcmc.get_samples()["latent_fairness"]
from pyro.infer import Predictive
predictive_dist = Predictive(model=model, posterior_samples={"latent_fairness": f_mcmc}, return_sites=["obs"])
plt.hist(predictive_dist.get_samples(None)["obs"].view(1,-1))

This also has a distribution similar to the histogram of the data.

Other inference methods

There are MAP estimation and Laplace approximation. MAP estimation is a special form of Bayesian inference, and since it is point estimation, a probability distribution is not required. It's not an inference but an estimation because it doesn't ask for a probability distribution. (Reference article) *** The variational Bayes performed above is equivalent to MAP estimation because it selects the delta distribution as the variational approximation distribution. ** ** Laplace approximation seems to be possible with AutoLaplaceApproximation of autoguide.

Applications

It seems that you can do various things such as mixed Gaussian model, Gaussian process, Bayesian optimization, NMF, Kalman filter, hidden Markov model.

At the end

I'm just getting started with Pyro, but somehow I've figured out how to use it. I think it is a wonderful technology to be able to implement a theory that is difficult with mathematical formulas so concisely.

[PYTHON] Try using a stochastic programming language (Pyro)