[PYTHON] A note of trying a simple MCMC tutorial on PyMC3

Overview

One of the free MCMC samplers that can be used with Python is PyMC3. The other day. A high school girl sitting next to me on Starbucks was talking about "PyMC3 may be faster than PyMC2 ..." or "Stan has discrete parameters ...", so the official tutorial https: While trying //pymc-devs.github.io/pymc3/getting_started/, I wrote notes about some errors and additional experiments. For people like "I want to use hierarchical Bayes in Python, but it's hard to write MCMC in full scratch ..." and "Discrete parameters ...". A student who is not very familiar with MCMC did it in the middle of the night, so I think there are many parts that are difficult to read / small mistakes / lack of understanding, but please forgive me.

environment

My PC environment is OS X10.9.5, Python2.7 +. You can enter what you need with pip. pip install git+https://github.com/pymc-devs/pymc3 When installing PyMC3, please note that the URL link destination may be different if it is a little old article. It is convenient to have Pandas and Patsy, so I think it is good to put them in with pip. pip install pandas pip install patsy

Basic processing

In order to explain where the error occurred, I will use a part of the tutorial, so I will summarize the minimum necessary parts from the first half of the tutorial. In this article, the PyMC3 dependent part is explicitly written in the form of pm. In the tutorial, pm. Is omitted, but otherwise it is the same code.

First, some import.

import numpy as np
%pylab inline
np.random.seed(27)
import pymc3 as pm

Create a sample data set as follows.

alpha, sigma = 1, 1
beta = [1, 2.5]
size = 100
X1 = np.linspace(0, 1, size)
X2 = np.linspace(0,.2, size)
Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma

Define the model.

#A model that linearly regresses Y on X1 and X2
#Means to define a model named model
with pm.Model() as model:

    # Priors for unknown model parameters
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    sigma = pm.HalfNormal('sigma', sd=1)

    # Expected value of outcome
    mu = alpha + beta[0]*X1 + beta[1]*X2

    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

Run MCMC using the model above. (It took 13.6 seconds with my CPU)

from scipy import optimize
with model:
    # obtain starting values via MAP
    start = pm.find_MAP(fmin=optimize.fmin_powell)
    # instantiate sampler
    step = pm.Slice(vars=[sigma])
    # draw 5000 posterior samples
    trace = pm.sample(5000, start=start, step=step)

As for the argument of pm.sample (), the first 5000 is the size of the sample you want to get. In the next start, the initial value is set. This time, the MAP estimator is used, but it is possible not to specify it. Slice sampling is specified in the last step, but if it is not specified, it is specified. ・ Binary variable: BinaryMetropolis ・ Discrete variable: Metropolis ・ Continuous variable: NUTS Is assigned by default. In the simple model used here, there was almost no difference depending on the sampling method. The Hamiltonian MC and NUTS that can be used with Stan can also be used with PyMC3.

Visualization is very easy, just one shot with trace plot as follows.

pm.traceplot(trace[4000:])

The summary information can also be viewed as follows.

pm.summary(trace[4000:])

For the sake of simplicity, only the normal distribution is dealt with this time, but of course, various other distributions such as gamma distribution, beta distribution, binomial distribution, Poisson distribution, etc. can be handled.

Trouble

Although it is such a convenient PyMC3, there are some problems. Let's try the tutorial.

1. I get an error when I use theano method

I think I just saw this related discussion on Twitter the other day, but in this case the cause of the error is probably the last of https://github.com/pymc-devs/pymc3/blob/master/pymc3/model.py line

theano.config.compute_test_value = 'raise'

(Reference: https://sites.google.com/site/iwanamidatascience/vol1/support_tokushu). I don't know the details of the cache, but after importing PyMC3

import theano
theano.config.compute_test_value = 'ignore'

I was able to solve it by typing in my heart. By the way, it seems that you can write'off'instead of'ignore', but in glm calculation etc. scratchpad instance has no attribute 'test_value' I think it's safe to set it to'ignore'for the time being. By the way, it is also'ignore'on the Probabilistic Matrix Factorization page of the official Examples.

2. Troublesome handling of self-made functions and classes

2.1 as_op unnecessary theory

In the Arbitrary deterministics section of the tutorial, the deterministic variable b is found by the following crazy_modulo3 () function.

import theano.tensor as T 
from theano.compile.ops import as_op

@as_op(itypes=[T.lscalar], otypes=[T.lscalar])
def crazy_modulo3(value):
    if value > 0: 
        return value % 3
    else :
        return (-value + 1) % 3

with pm.Model() as model_deterministic:
    a = pm.Poisson('a', 1)
    b = crazy_modulo3(a)

The @ as_op () part is like a spell for using a function that calculates a variable of type theano.tensor, and write it on the line just before defining the function to declare the input / output type. This is also written in the PyMC3 tutorial.

Theano needs to know the types of the inputs and outputs of a function, which are specified for as_op by itypes for inputs and otypes for outputs.

** However, this can actually be defined as follows without using @ as_op. ** **

import theano.ifelse
def symbolf(x):
    return theano.ifelse.ifelse(T.gt(x,0), x%3, (-x+1)%3)
def crazy_modulo3(value):
    return symbolf(value)

I'm a little addicted to this, so I'll explain it.

For example, naive without using ifelse

def crazy_modulo3(value):
    if value > 0:
        return value % 3
    else :
        return (-value + 1) % 3

Then, I get angry if I can't compare value> 0. Also,

def crazy_modulo3(value):
    if T.lt(0,value):
        return value % 3
    else :
        return (-value + 1) % 3

Then, no error occurs, but when I trace the value, T.lt (0, value) does not become False (Theano has no boolean dtype. Instead, all boolean tensors are represented in'int8' . --http://deeplearning.net/software/theano/library/tensor/basic.html). Therefore, ifelse is used for forced comparison. This is a bit of a pain in theano, but now I can define the function without using @as_op.

2.2 How to trace plot a deterministic variable

Although it is okay to define the deterministic variable b in the above code, the tutorial does not describe the subsequent sampling method. Try it there

with model_deterministic:
    trace = pm.sample(500)
pm.traceplot(trace)

If you try, only the value of the variable a will be plotted. I was also interested in the variable b, so when I looked at the contents of pm., I found something like pm.Deterministic (). The rest is the definition of the model b = crazy_modulo3(a) To b = pm.Deterministic("b",crazy_modulo3(a)) If you change it to and sample it, the value of b should be plotted correctly.

2.3 Avoid using as_op to calculate random variables

In the Arbitrary distributions section of the tutorial, we describe how to define the probability distributions you want to sample as follows.

with pm.Model() as model1:
    alpha = pm.Uniform('intercept', -100, 100)
    # Create custom densities
    beta = pm.DensityDist('beta', lambda value: -1.5 * T.log(1 + value**2), testval=0)
    eps = pm.DensityDist('eps', lambda value: -T.log(T.abs_(value)), testval=1)
    # Create likelihood
    like = pm.Normal('y_est', mu=alpha + beta * X, sd=eps, observed=Y)

This method defined by the lambda expression and pm.DensityDist is not a problem. Even if you continue sampling as follows, it will work correctly.

with model1:
    trace = pm.sample(500)
pm.traceplot(trace)

As code with similar meaning, the tutorial uses the following method of defining a class (** I will point out that this code is likely to be incorrect soon **). This does not use a lambda expression, so it seems to have the advantage of making the function easier to write. In addition, it is a parameter that I added so that it has the same shape as model1 above except for beta.

class Beta(pm.distributions.Continuous):
    def __init__(self, mu, *args, **kwargs):
        super(Beta, self).__init__(*args, **kwargs)
        self.mu = mu
        self.mode = mu

    def logp(self, value):
        mu = self.mu
        return beta_logp(value - mu)

@as_op(itypes=[T.dscalar], otypes=[T.dscalar])
def beta_logp(value):
    return -1.5 * np.log(1 + (value)**2)

with pm.Model() as model2:
    beta = Beta('slope', mu=0, testval=0)

    #I add other parameters to follow model1 above
    alpha = pm.Uniform('intercept', -100, 100)
    eps = pm.DensityDist('eps', lambda value: -T.log(T.abs_(value)), testval=1)
    like = pm.Normal('y_est', mu=alpha + beta * X, sd=eps, observed=Y)

The example @as_op appears in the beta calculation. When I used this to calculate the deterministic variable earlier, sampling worked fine, but what about this time?

with model2:
    trace = pm.sample(100)
pm.traceplot(trace)

If you do this, you'll probably get a ↓ error. AttributeError: 'FromFunctionOp' object has no attribute 'grad'

Apparently, the message brought in by ** @ as_op does not have the gradient calculation attribute grad required for MCMC sampling **. Was it okay because the gradient is not calculated with deterministic variables? When I googled around here, I found an answer from a great person, "I don't know how to deal with it."

The only way to get the automatic gradient computation is by expressing your density in terms of theano operators. as_op creates a blackbox function for which autodiff will not work so there is no way I know of (except numerical differentiation) to make this work. https://github.com/pymc-devs/pymc3/issues/601

However, here is the result of 2.1. In other words, if @ as_op is bad, why not define it without @ as_op in the first place? It is a story. Therefore,

@as_op(itypes=[T.dscalar], otypes=[T.dscalar])
def beta_logp(value):
    return -1.5 * np.log(1 + (value)**2)

Is rewritten as follows.

def beta_logp(value):
    return -1.5 * T.log(1 + (value)**2)

As in 2.1, if you pay attention to the rules when handling variables of type theano.tensor (in this case, use T.log instead of np.log), you can calculate safely. The results are almost the same (although there is some blurring because it is sampling).

** The bottom line is that when you use your own functions and classes, you need to be careful with theano's grammar, whether you use lambda expressions or not. ** **

3. Speed and performance at glm

Finally, I will talk about speed and performance when calculating generalized linear models (glm). Details can be found at https://github.com/pymc-devs/pymc3/issues/544, but "NUTS is often slow by default" "Metropolis is fast" "Depending on the type of data and machine, when Hamiltonian MC is fast" There is also a story. " For reference, I applied the tutorial data to glm (ordinary linear regression) in my environment.

import pandas as pd
df = pd.DataFrame({'x1': X1, 'x2': X2, 'y': Y})
with pm.Model() as model:
    pm.glm.glm('y ~ x1+x2', df)
    start = pm.find_MAP(fmin=optimize.fmin_powell)
    step = pm.HamiltonianMC()
    trace = pm.sample(1000, start=start, step=step)
pm.traceplot(trace)

・ HamiltonianMC: Finished in 1.7 seconds. Convergence is also ok. (Figure 1) ・ Metropolis: Finished in 0.3 seconds. Convergence is not good. ・ NUTS: Only 8 samples can be sampled even after 5 minutes, too late -Handwritten sampling without using the pm.glm class performed in 1. of this article: Finished in 3.2 seconds. Convergence is inferior to Hamiltonian MC, but not so bad. (Figure 2)

True parameters: ʻalpha = 1, beta [0] = 1, beta [1] = 2.5, sigma = 1` (Figure 1) スクリーンショット 2015-11-08 2.58.45.png

(Fig. 2: The blue line of beta corresponds to the regression coefficient of X1, the green line corresponds to the regression coefficient of X2, and sigma corresponds to sd) スクリーンショット 2015-11-08 2.58.57.png

The result was. By the way, when I cut off NUTS in the middle and plot it, the parameter scaling seems to be strange. As a solution in this case, on the issue page above

C = pm.approx_hessian(model.test_point)
step = pm.NUTS(scaling=C)

If so, it is written that speed will come out. When I updated the version of Sphinx (if you don't have it, install it with pip) and typed in the above two lines, I was told that the leading minor is not a definite value and cannot be calculated. LinAlgError: 2-th leading minor not positive definite

I've passed through this because it doesn't seem to be a solution, but there seems to be a related discussion here → https://github.com/scikit-learn/scikit-learn/issues/2640

As a personal impression, ** If you just want to use glm, try glm by NUTS using approx_hessian for scaling, and if it doesn't work, try glm by HMC. ** **

Also, the method that does not use the pm.glm class as described in the example of 1. seems to have the possibility of improving the performance if you devise such as setting the prior well, and if you want to see the behavior of hyperparameters. When I want to use a more complicated hierarchical Bayesian model, I thought that I should adopt this without using pm.glm.

Finally

I wrote my memo for a long time, but I hope it will serve as a reference for places that seem to be addictive. At @ as_op, there was something that was quite difficult, such as wasting one night trying to redefine the grad on the theano side.

Also, on the official page, I have a feeling that Examples are more interesting than tutorials (Probabilistic Matrix Factorization, Survival Analysis, etc.), so I will try it again when I feel like midnight.

I will add it as appropriate.

Recommended Posts

A note of trying a simple MCMC tutorial on PyMC3
A note on the default behavior of collate_fn in PyTorch
Implementation of a simple particle filter
A small sample note of list_head
Build a simple WebDAV server on Linux
A note on enabling PostgreSQL with Django
Pandas: A very simple example of DataFrame.rolling ()
A note about doing the Pyramid tutorial
I tried a TensorFlow tutorial (MNIST for beginners) on Cloud9-Classification of handwritten images-
A simple example of how to use ArgumentParser
Basics of Supervised Learning Part 1-Simple Regression- (Note)
A note on customizing the dict list class
A note on optimizing blackbox functions in Python
A note about the python version of python virtualenv
A simple Pub / Sub program note in Python
Calculate the probability of outliers on a boxplot