[PYTHON] Try using a stochastic programming language (Pyro)

Stochastic programming languages are used for Bayesian statistical modeling. I think that the method of using Stan, BUGS, and JAGS in R language is common because there are many books. Python's stochastic programming libraries include PyMC, PyStan, and Edward (now TensorFlow Probability), an extension of TensorFlow. Pyro was released in 2018. Pyro is a stochastic programming library based on PyTorch developed by Uber. Recently, the company has also released NumPyro based on Jax. Pyro's paper: http://jmlr.org/papers/v20/18-403.html NumPyro's paper: https://arxiv.org/abs/1912.11554

Pyro and NumPyro can easily implement Bayesian deep learning, and I think they will be applied in the future. (Implemented article)

This time, I will try the basics from Pyro's official tutorial. I also referred to the following blog article. https://www.hellocybernetics.tech/entry/2019/12/08/033824 https://www.hellocybernetics.tech/entry/2019/12/08/193220

Install Pyro

It is assumed that Python is 3.4 or higher and PyTorch is installed. The author's environment was Ubuntu 16.04, CUDA10, Anaconda, Python3.6, Pytorch 1.2.0.

pip install pyro-ppl

Pyro 1.3.1 is installed. Also, PyTorch has been automatically upgraded to 1.4.0.

Implementation (1): Definition and sampling of probability distribution

Import and set seed. The set seed is required for sampling reproducibility. If you do pyro.set_rng_seed, it seems that the seed value in pytorch is also decided.

import torch
import pyro
pyro.set_rng_seed(101)

pyro.distributions can be similar to torch.distributions. This time, we define the weather and temperature with the Bernoulli distribution and the normal distribution.

def weather():
    cloudy = pyro.sample('cloudy', pyro.distributions.Bernoulli(0.3))
    cloudy = 'cloudy' if cloudy.item() == 1.0 else 'sunny'
    mean_temp = {'cloudy': 55.0, 'sunny': 75.0}[cloudy]
    scale_temp = {'cloudy': 10.0, 'sunny': 15.0}[cloudy]
    temp = pyro.sample('temp', pyro.distributions.Normal(mean_temp, scale_temp))
    return cloudy, temp.item()

Build on this to define ice cream sales. In other words, it is a hierarchical probability model.

def ice_cream_sales():
    cloudy, temp = weather()
    expected_sales = 200. if cloudy == 'sunny' and temp > 80.0 else 50.
    ice_cream = pyro.sample('ice_cream', pyro.distributions.Normal(expected_sales, 10.0))
    return ice_cream

Let's sample the sales of ice cream and show it in a histogram.

import matplotlib.pyplot as plt
N = 1000 #The number of samples
x_list = []
for _ in range(N):
    x = ice_cream_sales()
    x_list.append(x)
plt.hist(x_list)

image.png A bimodal distribution was obtained.

Vectorization with pyro.plate method

If you use the pyro.plate method, it seems that you can sample without repeating with a for statement by vectorization. It means that sampling is done in parallel from independent and identical distribution. I tried to rewrite the code so far using pyro.plate. Model definition

def model(N):
    with pyro.plate("plate", N):
        cloudy = pyro.sample('cloudy', pyro.distributions.Bernoulli(0.3))
        mean_temp = 75.0 - 20.0 * cloudy
        scale_temp = 15.0 - 5.0 * cloudy
        temp = pyro.sample('temp', pyro.distributions.Normal(mean_temp, scale_temp))
        expected_sales = 50.0 + 150.0 * (1 - cloudy) * (temp > 80.0)
        ice_cream = pyro.sample('ice_cream', pyro.distributions.Normal(expected_sales, 10.0))
    return ice_cream

Display the histogram.

import matplotlib.pyplot as plt
N = 1000 #The number of samples
x_list = model(N)
plt.hist(x_list)

image.png

The histogram is similar. Is the slight difference the effect of parallel sampling due to vectorization?

As an important feature of pyro, the variable name is specified by the first argument like pyro.sample ("hoge", dist). This seems necessary to define that it is a global random variable. (I'm not sure about the details.)

Implementation (2): Bayesian inference

In the previous section, we found out how to generate data from a probabilistic model. In this section, we will try to infer a stochastic model from the data.

import torch
import pyro
pyro.set_rng_seed(101)
import matplotlib.pyplot as plt

As data, prepare a total of 10 data, 1 value x 6 and 0 value x 4.

data = []
for _ in range(6):
    data.append(torch.tensor(1.0))
for _ in range(4):
    data.append(torch.tensor(0.0))
data = torch.tensor(data) #Make it a tensor for handling with Pyro
plt.hist(data)

image.png The histogram will look like the one above.

Model a probability distribution for this data. This time, we will consider the Bernoulli distribution with the parameter "latent_fairness" that follows the beta distribution.

import pyro.distributions as dist
def model(data):
    alpha0 = torch.tensor(10.0)
    beta0 = torch.tensor(10.0)
    f = pyro.sample("latent_fairness", dist.Beta(alpha0, beta0))
    with pyro.plate("data_plate"):
        pyro.sample("obs", dist.Bernoulli(f), obs=data)

Infer parameters that fit this model. There are several methods of inference, but variational Bayes and MCMC are the main ones.

Variational Bayes

Among the inference methods, Pyro seems to have the most complete functions of variational Bayes. This is probably because PyTorch's automatic differentiation and optimization techniques, which are the basis of Pyro, are used for variational Bayesian SVI (stochastic variational inference).

First, prepare a variational approximation distribution. You can define it yourself, or you can do it automatically with the pyro.infer.autoguide.guides class.

guide = pyro.infer.autoguide.guides.AutoDelta(model)

Perform variational inference.

from pyro.optim import Adam
from pyro.infer import SVI, Trace_ELBO
adam_params = {"lr": 0.0005, "betas": (0.90, 0.999)}
optimizer = Adam(adam_params)
pyro.clear_param_store()
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())
n_steps = 1000 #Number of steps
losses, f  = [], [] #For recording for later plot
for step in range(n_steps):
    loss = svi.step(data)
    losses.append(loss)
    f.append(pyro.param("AutoDelta.latent_fairness").item())

Now you can infer. Plot the progress.

plt.subplot(2,1,1)
plt.plot(losses)
plt.xlabel("step")
plt.ylabel("loss")
plt.subplot(2,1,2)
plt.plot(f)
plt.xlabel("step")
plt.ylabel("latent_fairness")

image.png

The parameter "latent_fairness" has converged to about 0.5357. The final value can be displayed below.

for name in pyro.get_param_store():
    print("{}: {}".format(name, pyro.param(name)))

AutoDelta.latent_fairness: 0.535700261592865

Display the predicted probability distribution from the inferred model. Take 1000 samples.

from pyro.infer import Predictive
predictive_dist = Predictive(model=model, guide=guide,num_samples=1000, return_sites=["obs"])
plt.hist(predictive_dist.get_samples(None)["obs"].view(1,-1))

image.png

The distribution is similar to the histogram of the data. (Is it slightly different because the original number of data is as small as 10?)

MCMC MCMC is a brute force technique and takes time.

from pyro.infer import NUTS, MCMC
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=5000, warmup_steps=1000)
mcmc.run(data)

Now you can infer. See the result.

mcmc.summary()
parameter mean std median 5.0% 95.0% n_eff r_hat
latent_fairness 0.53 0.09 0.54 0.38 0.68 1894.58 1.00

MCMC can express the predicted probability distribution by passing a sample of parameters to posterior_samples.

f_mcmc = mcmc.get_samples()["latent_fairness"]
from pyro.infer import Predictive
predictive_dist = Predictive(model=model, posterior_samples={"latent_fairness": f_mcmc}, return_sites=["obs"])
plt.hist(predictive_dist.get_samples(None)["obs"].view(1,-1))

image.png

This also has a distribution similar to the histogram of the data.

Other inference methods

There are MAP estimation and Laplace approximation. MAP estimation is a special form of Bayesian inference, and since it is point estimation, a probability distribution is not required. It's not an inference but an estimation because it doesn't ask for a probability distribution. (Reference article) *** The variational Bayes performed above is equivalent to MAP estimation because it selects the delta distribution as the variational approximation distribution. ** ** Laplace approximation seems to be possible with AutoLaplaceApproximation of autoguide.

Applications

It seems that you can do various things such as mixed Gaussian model, Gaussian process, Bayesian optimization, NMF, Kalman filter, hidden Markov model.

At the end

I'm just getting started with Pyro, but somehow I've figured out how to use it. I think it is a wonderful technology to be able to implement a theory that is difficult with mathematical formulas so concisely.

Recommended Posts

Try using a stochastic programming language (Pyro)
[Pyro] Statistical modeling using the stochastic programming language Pyro ② ~ Simple regression model ~
I tried using Pythonect, a dataflow programming language.
Try programming with a shell!
Try to select a language
[Pyro] Statistical modeling by the stochastic programming language Pyro ① ~ What is Pyro ~
Try a functional programming pipe in Python
Try using platypus, a multipurpose optimization library
100 Language Processing Knock-34 (using pandas): "A B"
[Pyro] Stochastic modeling by the stochastic programming language Pyro ③ ~ Analysis of variance model, normal linear model ~
Try using docker-py
Reinforcement learning 10 Try using a trained neural network.
Prepare a programming language environment for data analysis
Try using cookiecutter
Try using PDFMiner
Try using a QR code on a Raspberry Pi
Try using Sourcetrail, a source code visualization tool
Try using Selenium
Try using scipy
Try using pandas.DataFrame
Try using django-swiftbrowser
Try using matplotlib
Try using tf.metrics
Try using PyODE
Try to make a Python module in C language
Try creating a compressed file using Python and zlib
(Python) Try to develop a web application using Django
Try drawing a social graph using Twitter API v2
A programming language that protects the people from NHK
[Azure] Try using Azure Functions
Try using virtualenv now
Try using W & B
Try using Django templates.html
[Kaggle] Try using LGBM
Try using Python's feedparser.
Try using Python's Tkinter
Try using Tweepy [Python2.7]
Try using Pytorch's collate_fn
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
Try a similar search for Image Search using the Python SDK [Search]
Try building a neural network in Python without using a library
Try to model a multimodal distribution using the EM algorithm
Try running a function written in Python using Fn Project
Try using [Tails], a purveyor of hackers (?), By USB booting.