[PYTHON] Calculate sample distribution with Scipy (discrete distribution)

Purpose

Given the probability distribution of the population, calculate the sampling distribution. This time we will deal with discrete distributions.

Method

policy

--Use the statistical package scipy.stats of scipy, which is a collection of scientific and technological calculation packages of python, and numpy of the numerical calculation library. --Calculate in the following 3 steps.

  1. Define the discrete probability distribution of the population as a discrete distribution on scipy.stats
  2. Sampling
  3. Calculate the sample distribution from the obtained sample

Definition of discrete distribution

scipy.stats.rv_discrete The following is an excerpt from the scipy documentation. You can find it in the Help Example section, which you can read as follows:

from scipy import stats
help(stats.rv_discrete)

Definition of discrete distribution

Just pass the discrete random variable in $ xk $, the probability corresponding to $ pk $, and pass it to stats.rv_discrete. This is an example of a random variable that takes 7 discrete values.

from scipy import stats
xk = np.arange(7)
pk = (0.1, 0.2, 0.3, 0.1, 0.1, 0.0, 0.2)
custm = stats.rv_discrete(name='custm', values=(xk, pk))

Visualization of discrete distribution

To visualize the probability mass function (pmf) of a discrete distribution, do the following:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
ax.plot(xk, custm.pmf(xk), 'ro', ms=12, mec='r')
ax.vlines(xk, 0, custm.pmf(xk), colors='r', lw=4)
plt.show()

sampling

Sampling is performed from the created discrete distribution. For example, if you want to randomly sample 1000 times, do as follows. (rvs: random variables) If you want to fix the seed of the random number to ensure reproducibility, uncomment the first line.

import numpy as np
# np.random.seed(0)
sampled_rvs = custm.rvs(size=1000)

Calculation of sample distribution

Calculate using the histogram function of numpy. Be careful not to forget to add +1 to bin.

f = np.histogram(sampled_rvs, bins = np.arange(7 + 1), density=True)[0]

Put together into a function

The above contents are summarized. Up until now, we have dealt with the case of one-dimensional probability distributions, but so that we can handle multidimensional joint probability distributions as well. First, crush it to one dimension with p.ravel (), then calculate the sample distribution, reshape it, and then return it. Pass the argument p as an array of nupmy.

import numpy as np
from scipy import stats

def sampling_dist(p, size, seed = 0):
    xk = np.arange(p.size)
    pk = p.ravel()
    true_dist = stats.rv_discrete(name='true_dist', values=(xk, pk))

    np.random.seed(seed)
    sampled_rvs = true_dist.rvs(size=size)

    f = np.histogram(sampled_rvs, bins = np.arange(p.size + 1), density=True)[0]
    f.reshape(p.shape)

    return(f.reshape(p.shape))

p = np.array([[0.10,0.10],[0.10,0.15],[0.15,0.10],[0.10,0.20]])
p_est = sampling_dist(p, 1000)
print(p)
print(p_est)

Recommended Posts

Calculate sample distribution with Scipy (discrete distribution)
Generate a normal distribution with SciPy
LPC with Scipy
ICA with Scipy
CORDIC with Scipy
1. Statistics learned with Python 2-1. Probability distribution [discrete variable]
Create filter with scipy
Normarize data with Scipy
Calculate tf-idf with scikit-learn
Sample data created with python
3. Normal distribution with neural network!
Use OpenBLAS with numpy, scipy
[MCMC] Calculate WAIC with pystan
Extract peak values with scipy