[PYTHON] Bayesian inference concept (3) ... Calculation of change points in the number of emails received by PyMC3

From Bayesian inference experienced in Python

"Bayesian Inference Experienced in Python"

Has the number of messages changed?

Now let's actually calculate the example. The Python script is written in PyMC2 in the book, but within the scope of my understanding, it is written based on the PyMC3 script on the official website.

When a user received a number of messages each day, did this number of received messages change? That's what it means. Get the data from the official github.

# -*- coding: utf-8 -*-
#Data visualization
from matplotlib import pyplot as plt
import numpy as np

count_data = np.loadtxt("data/txtdata.csv")
n_count_data = len(count_data)
plt.bar(np.arange(n_count_data), count_data, color="#348ABD")
plt.xlabel("Time (days)")
plt.ylabel("count of text-msgs received")
plt.title("Did the user's texting habits change over time?")
plt.xlim(0, n_count_data)
plt.show()

Here is the graph. 1_4_1_messageFigure 2020-08-09 160451.png

Do you see this as a change somewhere? You can't tell just by looking at the graph. What do you think?

The idea of Bayesian inference ... "Bayesianism and probability distribution"

First of all, this is the number of daily receptions, so it is discrete value data. So it has a Poisson distribution. Therefore. If the number of messages on the $ i $ day is $ C_i $,


C_i \sim \text{Poisson}(\lambda) 

What do you think of $ λ $ here? I think that $ λ $ may have changed somewhere during this period. Consider this that the parameter $ λ $ changed on the $ τ $ day. This sudden change is called a switchpoint.

$ λ $ is expressed as follows.

\lambda = 
\begin{cases}
\lambda_1  & \text{if } t \lt \tau \cr
\lambda_2 & \text{if } t \ge \tau
\end{cases}

If the number of emails received does not change, it should be $ \ lambda_1 = \ lambda_2 $.

Now, in order to consider this $ λ $, we will model it with an exponential distribution. (Because $ λ $ is a continuous value) If the parameter at this time is $ α $, it is expressed by the following formula.

\begin{align}
&\lambda_1 \sim \text{Exp}( \alpha ) \\\
&\lambda_2 \sim \text{Exp}( \alpha )
\end{align}

The expected value of the exponential distribution is the reciprocal of the parameter, so it is expressed as follows.


\frac{1}{N}\sum_{i=0}^N \;C_i \approx E[\; \lambda \; |\; \alpha ] = \frac{1}{\alpha} 

This does not include subjectivity in the prior distribution. Here, using different prior distributions for the two $ \ lambda_i $ expresses the belief that we believe that the number of receptions has changed somewhere during the observation period.

After that, what to think about $ τ $ is that $ τ $ uses a uniform distribution. In other words, the belief that every day is the same.


\begin{align}
& \tau \sim \text{DiscreteUniform(1,70) }\\\\
& \Rightarrow P( \tau = k ) = \frac{1}{70}
\end{align}

At this point, I'm wondering how to write a program in python, but after that I'll write it in pymc3.

import pymc3 as pm
import theano.tensor as tt

with pm.Model() as model:
    alpha = 1.0/count_data.mean()
    # Recall count_data is the
    # variable that holds our txt counts
    lambda_1 = pm.Exponential("lambda_1", alpha)
    lambda_2 = pm.Exponential("lambda_2", alpha)
    
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1)

lambda_1 = pm.Exponential("lambda_1", alpha) Create PyMC variables corresponding to $ \ lambda_1 $ and $ \ lambda_2 $.

tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1) Give $ \ tau $ with a uniform distribution pm.DiscreteUniform.

with model:
    idx = np.arange(n_count_data) # Index
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)


with model:
    observation = pm.Poisson("obs", lambda_, observed=count_data)

lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2) Now create a new lambda_. The switch () function assigns the value of lambda_1 or lambda_2 as the value of lambda_, depending on which side you are on tau. The value of lambda_ up to tau is lambda_1, and the values after that are lambda_2.

observation = pm.Poisson("obs", lambda_, observed=count_data)

This means that the data count_data is to be calculated and observed with the variables provided by the variable lambda_.

with model:
    step = pm.Metropolis()
    trace = pm.sample(10000, tune=5000,step=step, cores=1)


lambda_1_samples = trace['lambda_1']
lambda_2_samples = trace['lambda_2']
tau_samples = trace['tau']

The details of this code seem to be in Chapter 3 of the book. This is the learning step, which is calculated by an algorithm called Markov Chain Monte Carlo (MCMC). It returns thousands of random variables from the posterior distribution of $ \ lambda_1 $ \ lambda_2 $ \ tau $ called traces. By plotting this random variable, you can see the shape of the posterior distribution.

The calculation starts here, but since multi-core and Cuda cannot be used (cores = 1), it will take a long time, but the answer will come out.

The result graph. F1_4_2igure 2020-08-09 160340.png

As you can see from the results \lambda_1=18 \lambda_2=23 \tau=45

$ \ Tau $ is often 45 days, and the probability is 50%. In other words, I don't know the reason, but around the 45th day, there was some change (some action that changes the number of receptions ... but this is only known to the person), and $ \ lambda $ has changed.

Finally, I will write the expected value of the number of receptions.

1_4_3Figure 2020-08-09 185407.png

Thus, with Bayesian inference, the data tells us when there was a change. However, it should be noted that the expected value of the number of receptions is actually a distribution. You also have to think about the cause yourself.

Finally, all the scripts.


# -*- coding: utf-8 -*-
#Data visualization
from matplotlib import pyplot as plt
import numpy as np
count_data = np.loadtxt("data/txtdata.csv")
n_count_data = len(count_data)
plt.bar(np.arange(n_count_data), count_data, color="#348ABD")
plt.xlabel("Time (days)")
plt.ylabel("count of text-msgs received")
plt.title("Did the user's texting habits change over time?")
plt.xlim(0, n_count_data);

plt.show()

import pymc3 as pm
import theano.tensor as tt

with pm.Model() as model:
    alpha = 1.0/count_data.mean()
    # Recall count_data is the
    # variable that holds our txt counts
    lambda_1 = pm.Exponential("lambda_1", alpha)
    lambda_2 = pm.Exponential("lambda_2", alpha)
    
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1)




with model:
    idx = np.arange(n_count_data) # Index
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)




with model:
    observation = pm.Poisson("obs", lambda_, observed=count_data)




### Mysterious code to be explained in Chapter 3.
with model:
    step = pm.Metropolis()
    trace = pm.sample(10000, tune=5000,step=step, cores=1)


lambda_1_samples = trace['lambda_1']
lambda_2_samples = trace['lambda_2']
tau_samples = trace['tau']


#histogram of the samples:

ax = plt.subplot(311)
ax.set_autoscaley_on(False)

plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\lambda_1$", color="#A60628", density=True)
plt.legend(loc="upper left")
plt.title(r"""Posterior distributions of the variables
    $\lambda_1,\;\lambda_2,\;\tau$""")
plt.xlim([15, 30])
plt.xlabel("$\lambda_1$ value")

ax = plt.subplot(312)
ax.set_autoscaley_on(False)
plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\lambda_2$", color="#7A68A6", density=True)
plt.legend(loc="upper left")
plt.xlim([15, 30])
plt.xlabel("$\lambda_2$ value")

plt.subplot(313)
w = 1.0 / tau_samples.shape[0] * np.ones_like(tau_samples)
plt.hist(tau_samples, bins=n_count_data, alpha=1,
         label=r"posterior of $\tau$",
         color="#467821", weights=w, rwidth=2.)
plt.xticks(np.arange(n_count_data))

plt.legend(loc="upper left")
plt.ylim([0, .75])
plt.xlim([35, len(count_data)-20])
plt.xlabel(r"$\tau$ (in days)")
plt.ylabel("probability");
plt.show()

# tau_samples, lambda_1_samples, lambda_2_samples contain
# N samples from the corresponding posterior distribution
N = tau_samples.shape[0]
expected_texts_per_day = np.zeros(n_count_data)
for day in range(0, n_count_data):
    # ix is a bool index of all tau samples corresponding to
    # the switchpoint occurring prior to value of 'day'
    ix = day < tau_samples
    # Each posterior sample corresponds to a value for tau.
    # for each day, that value of tau indicates whether we're "before"
    # (in the lambda1 "regime") or
    #  "after" (in the lambda2 "regime") the switchpoint.
    # by taking the posterior sample of lambda1/2 accordingly, we can average
    # over all samples to get an expected value for lambda on that day.
    # As explained, the "message count" random variable is Poisson distributed,
    # and therefore lambda (the poisson parameter) is the expected value of
    # "message count".
    expected_texts_per_day[day] = (lambda_1_samples[ix].sum()
                                   + lambda_2_samples[~ix].sum()) / N


plt.plot(range(n_count_data), expected_texts_per_day, lw=4, color="#E24A33",
         label="expected number of text-messages received")
plt.xlim(0, n_count_data)
plt.xlabel("Day")
plt.ylabel("Expected # text-messages")
plt.title("Expected number of text-messages received")
plt.ylim(0, 60)
plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65,
        label="observed texts per day")

plt.legend(loc="upper left")
plt.show()


Recommended Posts

Bayesian inference concept (3) ... Calculation of change points in the number of emails received by PyMC3
Calculation of the number of Klamer correlations
Graph of the history of the number of layers of deep learning and the change in accuracy
Graph the change in the number of keyword appearances per month using pandas
Divides the character string by the specified number of characters. In Ruby and Python.
Output the number of CPU cores in Python
Change the font size of the legend in df.plot
Find the number of days in a month
Minimize the number of polishings by combinatorial optimization
Experience the good calculation efficiency of vectorization in Python
How to get the number of digits in Python
Count the number of parameters in the deep learning model
Calculation of the minimum required number of votes from turnout
○○ Solving problems in the Department of Mathematics by optimization
Get the size (number of elements) of UnionFind in Python