[PYTHON] Analysis by Bayesian inference (1) ... Which is better, A or B?

Bayesian inference analysis example ;: Which is better, A or B?

From Bayesian inference experienced in Python

"Bayesian Inference Experienced in Python"

Which is better, Site A or Site B, in Bayesian inference? Is the problem.

Question: What is the true conversion for Site A?

Let $ p_A $ be the probability that a user who sees Site A will eventually lead to a profitable action (this is called conversion) such as requesting materials or purchasing. Suppose $ N $ people look at Site A and $ n $ people lead to conversions. Then

p_A=n/N

You might think, but it's not. That is, we do not know if $ p_A $ and $ n / N $ are equal. There is a difference between the observed frequency and the true frequency, and the true frequency can be interpreted as the probability that an event will occur. In other words, since I saw Site A, I don't know if it led to the action. For example, taking the dice as an example, the true frequency of rolling a dice and getting a 1 is $ 1/6 , but even if you actually drop the dice 6 times, you may not get a 1 even once. That is the frequency with which it was observed. Unfortunately, the reality is too complicated and noise is unavoidable, so we do not know the true frequency, so we have no choice but to estimate the true frequency from the observed frequency. From Bayesian statistics, the true frequency ( p_A ) is estimated from the prior distribution and the observed data ( n / N $).

First, we have to consider the prior distribution for the unknown $ p_A $. What do you think of $ p_A $ when you don't have the data? I'm not sure yet, so let's assume that $ p_A $ follows a uniform distribution of [0,1]. Give the variable p a uniform distribution from 0 to 1 with @ pm.Uniform.

import pymc3 as pm

# The parameters are the bounds of the Uniform.
with pm.Model() as model:
    p = pm.Uniform('p', lower=0, upper=1)

For example, $ p_A = 0.05 $, and $ N = 1500 $ Let's simulate how many users converted (purchased or requested materials by looking at the site) when the user looked at site A. Actually, I don't know $ p_A $, but I assume that I know it for the time being and simulate it to create observation data.

Since there are N people, [Bernoulli distribution](https://ja.wikipedia.org/wiki/%E3%83%99%E3%83%AB%E3%83%] to simulate N trials 8C% E3% 83% BC% E3% 82% A4% E5% 88% 86% E5% B8% 83) is used. Since the Bernoulli distribution is a probability distribution for binary variables (variables that take either 0 or 1), it is a good probability distribution for calculating whether to convert or not.

 X\sim \text{Ber}(p)

At this time, $ X $ is $ 1 $ with probability $ p $ and $ 0 $ with probability $ 1-p $.

The python script that simulates this is

import scipy.stats as stats
import numpy as np

#set constants
p_true = 0.05  # remember, this is unknown.
N = 1500

# sample N Bernoulli random variables from Ber(0.05).
# each random variable has a 0.05 chance of being a 1.
# this is the data-generation step
occurrences = stats.bernoulli.rvs(p_true, size=N)

print(occurrences) # Remember: Python treats True == 1, and False == 0
print(np.sum(occurrences))

#[0 0 0 ... 0 0 0]
#67

# Occurrences.mean is equal to n/N.
print("What is the observed frequency in Group A? %.4f" % np.mean(occurrences))
print("Does this equal the true frequency? %s" % (np.mean(occurrences) == p_true))

#What is the observed frequency in Group A? 0.0447
#Does this equal the true frequency? False

Now that we have the observation data (occurrences), we set the observation data in the obs variable of PyMC and execute the inference algorithm to make a graph of the posterior distribution of unknown $ p_A $.

#include the observations, which are Bernoulli
with model:
    obs = pm.Bernoulli("obs", p, observed=occurrences)
    # To be explained in chapter 3
    step = pm.Metropolis()
    trace = pm.sample(18000, step=step,cores =1)
    burned_trace = trace[1000:]

import matplotlib.pyplot as plt
plt.title("Posterior distribution of $p_A$, the true effectiveness of site A")
plt.vlines(p_true, 0, 90, linestyle="--",color="r", label="true $p_A$ (unknown)")
plt.hist(burned_trace["p"], bins=25, histtype="bar",density=True)
plt.legend()
plt.show()

The resulting graph looks like this. 0811_pagoodFigure 2020-08-11 161933.png

This time, the graph of the posterior probability is like this, while the prior probability of the true value is 0.5. If you repeat this, you will get a slightly different graph each time.

Question: Comparing Site A and Site B, which one is better?

If site B is calculated in the same way, the posterior probability of $ p_B $ can be obtained. So I decided to calculate $ delta = p_A-p_B $.

Like $ p_A $, $ p_B $ doesn't know the true value. Let's assume that $ p_B = 0.4 $. Then $ delta = 0.1 $. Calculate $ p_B $ and $ delta $ in the same way as $ p_A $. I calculated everything again.

# -*- coding: utf-8 -*-

import pymc3 as pm
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt

#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04

#notice the unequal sample sizes -- no problem in Bayesian analysis.
N_A = 1500
N_B = 750

#generate some observations
observations_A = stats.bernoulli.rvs(true_p_A, size=N_A)
observations_B = stats.bernoulli.rvs(true_p_B, size=N_B)
print("Obs from Site A: ", observations_A[:30], "...")
print("Obs from Site B: ", observations_B[:30], "...")

#Obs from Site A:  [0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] ...
#Obs from Site B:  [0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] ...


print(np.mean(observations_A))
#0.04666666666666667
print(np.mean(observations_B))
#0.03866666666666667

# Set up the pymc3 model. Again assume Uniform priors for p_A and p_B.
with pm.Model() as model:
    p_A = pm.Uniform("p_A", 0, 1)
    p_B = pm.Uniform("p_B", 0, 1)
    
    # Define the deterministic delta function. This is our unknown of interest.
    delta = pm.Deterministic("delta", p_A - p_B)

    
    # Set of observations, in this case we have two observation datasets.
    obs_A = pm.Bernoulli("obs_A", p_A, observed=observations_A)
    obs_B = pm.Bernoulli("obs_B", p_B, observed=observations_B)

    # To be explained in chapter 3.
    step = pm.Metropolis()
    trace = pm.sample(20000, step=step,cores=1)
    burned_trace=trace[1000:]


p_A_samples = burned_trace["p_A"]
p_B_samples = burned_trace["p_B"]
delta_samples = burned_trace["delta"]

#histogram of posteriors

ax = plt.subplot(311)

plt.xlim(0, .1)
plt.hist(p_A_samples, histtype='stepfilled', bins=25, alpha=0.85,
         label="posterior of $p_A$", color="#A60628", density=True)
plt.vlines(true_p_A, 0, 80, linestyle="--", label="true $p_A$ (unknown)")
plt.legend(loc="upper right")
plt.title("Posterior distributions of $p_A$, $p_B$, and delta unknowns")

ax = plt.subplot(312)

plt.xlim(0, .1)
plt.hist(p_B_samples, histtype='stepfilled', bins=25, alpha=0.85,
         label="posterior of $p_B$", color="#467821", density=True)
plt.vlines(true_p_B, 0, 80, linestyle="--", label="true $p_B$ (unknown)")
plt.legend(loc="upper right")

ax = plt.subplot(313)
plt.hist(delta_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of delta", color="#7A68A6", density=True)
plt.vlines(true_p_A - true_p_B, 0, 60, linestyle="--",
           label="true delta (unknown)")
plt.vlines(0, 0, 60, color="black", alpha=0.2)
plt.legend(loc="upper right");

plt.show()

# Count the number of samples less than 0, i.e. the area under the curve
# before 0, represent the probability that site A is worse than site B.
print("Probability site A is WORSE than site B: %.3f" % \
    np.mean(delta_samples < 0))

print("Probability site A is BETTER than site B: %.3f" % \
    np.mean(delta_samples > 0))

#Probability site A is WORSE than site B: 0.201
#Probability site A is BETTER than site B: 0.799

The resulting graph looks like this.

It can be seen that the base of the posterior distribution of $ p_B $ is slightly wider. This is because Site B has less data. This indicates that we are not confident in the value of $ p_B $ due to the lack of data on Site B.

The third graph, $ delta = p_A-p_B $, shows a value greater than 0. This means that Site A is more responsive to users as a conversion.

You can get a better understanding by changing the values of $ p_A $, $ p_B $ and the number of data.