[PYTHON] Analysis by Bayesian inference (2) ... Test cheat discovery algorithm

Bayesian inference analysis example; did any students cheat?

From Bayesian inference experienced in Python

"Bayesian Inference Experienced in Python"

Are there any students who cheat on Bayesian reasoning? Consider the problem.

Question: What percentage of students cheat?

The setting of the question is to use the binomial distribution to find out how often the student cheated during the exam, but if you think about it, you can't honestly say that you cheated, so the following "Cheat discovery" I'm thinking about an algorithm.

"Cheat discovery algorithm" At the interview after the test, the student throws a coin invisible to the interviewer. If the table appears here, let the students agree to answer honestly. If the back comes out, throw the coin again so that you can't see it. Then, when the front appears, "Yes, I cheated." When the back appears, answer "No. I haven't cheated." In other words

1st time	Second time	Cheat person	Those who have not
table	－－	Answer: Cheat	Answer: Not done
back	table	Answer: Cheat	Answer: Cheat
back	back	Answer: Not done	Answer: Not done

As a result, half of the data will result in coin throws and students will be protected in privacy. Also, since the first time I have consented to answer honestly, there is no lie. From this result, the posterior distribution of the true frequency can be obtained.

Here, the binomial distribution is used. There are two parameters for the binomial distribution. N representing the number of trials and the probability p that one event occurs in one trial. The binomial distribution is the same discrete distribution as the Poisson distribution, but unlike the Poisson distribution, probabilities are assigned to integers from 0 to N. (The Poisson distribution assigns probabilities to all integers from 0 to infinity) The probability mass function is

P( X = k ) =  {{N}\choose{k}}  p^k(1-p)^{N-k}

X \sim \text{Bin}(N,p)

X is a random variable and p and N are parameters. In other words, the fact that X follows the binomial distribution means that the number of events that occur in N trials is X, and the larger p (in the range of 0 to 1), the more events are likely to occur. ..

# -*- coding: utf-8 -*-

import pymc3 as pm
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt

#The true percentage p of the cheat person is sampled from the prior distribution,
#Since p is unknown, uniform distribution (Uniform)(0, 1)). The number of students N is 100.
N = 100
with pm.Model() as model:
    p = pm.Uniform("freq_cheating", 0, 1)

#Assign a random variable (1 is cheat, 0 is not) to 100 students,
#It follows the Bernoulli distribution.
with model:
    true_answers = pm.Bernoulli("truths", p, shape=N, testval=np.random.binomial(1, 0.5, N))

#To model the student's first coin throw in the privacy algorithm step
#p=1/Perform the Bernoulli distribution of 2 100 times.
with model:
    first_coin_flips = pm.Bernoulli("first_flips", 0.5, shape=N, testval=np.random.binomial(1, 0.5, N))
print(first_coin_flips.tag.test_value)

#Model the second coin throw in the same way.
with model:
    second_coin_flips = pm.Bernoulli("second_flips", 0.5, shape=N, testval=np.random.binomial(1, 0.5, N))

import theano.tensor as tt

with model:
    val = first_coin_flips*true_answers + (1 - first_coin_flips)*second_coin_flips
    observed_proportion = pm.Deterministic("observed_proportion", tt.sum(val)/float(N))

print(observed_proportion.tag.test_value)

#As a result of the interview, 35 out of 100 people (X) answered that they "cheat".
#From this algorithm, if no one is cheating, 25 people will say they cheat.
#If everyone was cheating, 75 would say they were cheating.
#inside that,

X = 35
with model:
    observations = pm.Binomial("obs", N, observed_proportion, observed=X)

#It is a calculation. For spyder, cores=I need one. With jupyter notebook, you don't need cores.
with model:
    step = pm.Metropolis(vars=[p])
    trace = pm.sample(40000, step=step,cores=1)
    burned_trace = trace[15000:]
    
p_trace = burned_trace["freq_cheating"][15000:]
plt.hist(p_trace, histtype="stepfilled", density=True, alpha=0.85, bins=30, 
         label="posterior distribution", color="#348ABD")
plt.vlines([.05, .35], [0, 0], [5, 5], alpha=0.3)
plt.xlim(0, 1)
plt.legend()
plt.show()

The result.

In this graph, the horizontal axis is the value of p (the person who cheated) and the vertical axis is the frequency. Looking at the posterior probability graph of the true frequency (p) of cheating confirmed by the "cheat discovery algorithm", the results are wide. So you might think you don't know anything. However, only this can be seen from here. The value of p is given in a uniform distribution, but the "cheat discovery algorithm" underestimates the value of $ p = 0 $. In other words, it means that there is no one who cheats $ p = 0 $. Surprisingly, this algorithm determines that "cheat must have existed"! !!

If you do this calculation with spyder, it will take a lot of time, and the calculation with a single thread is difficult. I'm doing it with PyMC3, but the backend uses theano, and there isn't much information about theano, so I don't know how to calculate with multithreading. By the way, with jupyter notebook, you can calculate with multithreading without doing anything, so if you are not particular about it, let's calculate with jupyter notebook. PyMC4 has a backend of TensorFlow, so when I've finished studying this book, I'll change it to PyMC4.