Test whether the observed data follow the Poisson distribution (Test of the goodness of fit of the Poisson distribution by Python)

Problem setting

In a supermarket, ** the number of customer complaints (per day) ** was aggregated for $ 365 $ a day, and the results were as follows.

Number of complaints 0 cases 1 2 cases 3 cases 4 cases 5 cases 6 7 cases
Observation frequency
(Number of days)
22 49 82 86 63 39 16 8

Can we think of this data as ** following a Poisson distribution **? Test at the significance level $ 5 % $.

Visualization

Let's graph the "** expected frequency " and " observed frequency **" calculated by hypothesizing that they follow the Poisson distribution.

"** Expected frequency " is calculated as follows. First, multiply the "number of complaints" by the " observation frequency " and divide by the number of observation days $ 365 $ to get the ** average number of complaints per day ** $ \ mu = 2.931 . Then, find the probability mass ( k = 0.7 $) for the ** Poisson distribution ** with an average of $ 2.931 $, and multiply it by $ 365 $ for the number of observation days to obtain " expected frequency **".

Visualization of observed and expected frequencies


import numpy as np
import scipy.stats as st
import japanize_matplotlib
import matplotlib.pyplot as plt

n = np.arange(0,7+1)
f = np.array((22,49,82,86,63,39,16,8)) #frequency

mu = (n*f).sum()/f.sum() #Calculate average number of claims 2.931

Po = st.poisson(mu=mu) # μ=2.Poisson distribution of 931
xp = Po.pmf(k=n)       # k=1...Probability mass of 7
xp[-1] = 1 - st.poisson.cdf(n[-2],mu=mu) # k>=Probability mass of 7 k=Integrated into 7
fp = xp * f.sum()      #Expected frequency

plt.figure(figsize=(6,3),dpi=120,facecolor='white')
plt.bar(n,f,width=0.9, label='Observation frequency')
plt.plot(n,fp,'o--', c='tab:orange', label=f'Expected frequency$Po\,(\\lambda={mu:.2f})$')
plt.tick_params(axis='x',length=0)
plt.legend(framealpha=1, fancybox=False)
plt.gca().set_axisbelow(True)
plt.grid(axis='y')
plt.show()

The execution result is as follows.
Download (1) .png

Number of complaints 0 cases 1 2 cases 3 cases 4 cases 5 cases 6 7 cases
Observation frequency
f
22 49 82 86 63 39 16 8
Expected frequency
f_{p}
19.5 57.0 83.6 81.7 59.9 35.1 17.2 11.0

Goodness of fit test

Is the hypothesis that "the object follows a Poisson distribution" (null hypothesis) correct? Is evaluated by ** statistical test **.

If this hypothesis is correct, then the following ** statistic ** $ t calculated from ** observation frequency ** $ \ boldsymbol {x} $ and ** hypothesis-based expected frequency ** $ \ boldsymbol {x} _p $ The test is performed using the property that $ follows (approximately) ** chi-square distribution ** (degrees of freedom $ n-2 $).

t = \sum \frac{(x-x_p)^2}{x_p}

In the current problem, the number of claims is in the $ 8 $ range from $ 0 $ to $ 7 $, so ** degrees of freedom ** is $ n-2 = 8-2 = 6 $.

Let's draw a probability density function for a chi-square distribution with $ 6 $ degrees of freedom.

Chi-square distribution with 6 degrees of freedom


import numpy as np
import scipy.stats as st
import japanize_matplotlib
import matplotlib.pyplot as plt

x = np.linspace(0,25,200)
p = st.chi2.pdf(x, df=6)

x_p95 = st.chi2.ppf(0.95, df=6)

y_range = (0.00,0.16)
plt.figure(figsize=(6,3),dpi=120,facecolor='white')
plt.plot(x,p)
plt.vlines(x_p95,*y_range,lw=1,ls=':',color='tab:red')

x2 = np.linspace(0,x_p95,200)
plt.fill_between(x2,st.chi2.pdf(x2,df=6),np.zeros(len(x2)),facecolor='tab:blue',alpha=0.3)

plt.text(5,0.05,'0.95',ha='center',va='center')
plt.text(x_p95,-0.004,f'{x_p95:.2f}',c='tab:red', ha='center',va='top')

plt.text(x_p95,0.08,f'→ Rejection area',c='tab:red')

arrowprops=dict(arrowstyle='->',connectionstyle='angle,angleA=0,angleB=60, rad=1')
kw = dict(xycoords='data',textcoords='data',ha='left', arrowprops=arrowprops)
plt.gca().annotate('0.05', xy=(13.3, 0.003), xytext=(15.5,0.016), **kw)

plt.xlim(0,25)
plt.ylim(*y_range)

plt.xlabel('Statistics$t$')
plt.show()

The execution result is as follows.
Download.png

In fact, find the statistic $ t $ and its corresponding $ p $ value.

Calculation of test statistic and p-value


t = ( ((f-fp)**2)/fp ).sum()        #Statistic t
p = 1 - st.chi2.cdf(t,df=len(n)-2)  #6 degrees of freedom(8-2)P-value corresponding to t in the chi-square distribution of

print(f'Statistic t= {t:.2f}, P-value= {p:.2f} ',end='')
if p >= 0.05 :
  print('( >= 0.05 ) \n Null hypothesis (following Poisson distribution) is not rejected')
else :
  print('( < 0.05 ) \n Null hypothesis (following Poisson distribution) is rejected')

The execution result is as follows.

Statistic t= 3.22, p-value= 0.78 ( >= 0.05 ) 
Null hypothesis (following Poisson distribution) is not rejected

Recommended Posts

Test whether the observed data follow the Poisson distribution (Test of the goodness of fit of the Poisson distribution by Python)
Test the goodness of fit of the distribution
Test of the difference between the mean values of count data according to the Poisson distribution
Pandas of the beginner, by the beginner, for the beginner [Python]
[Python] Test the moon matagi of relative delta
The story of reading HSPICE data in Python
Find the cumulative distribution function by sorting (Python version)
[Python] Generate random numbers that follow the Rayleigh distribution
Not being aware of the contents of the data in python
Python: Diagram of 2D data distribution (kernel density estimation)
Let's use the open data of "Mamebus" in Python
Understand the status of data loss --Python vs. R
Extract the band information of raster data with python
Check the asymptotic nature of the probability distribution in Python
Primality test by Python
the zen of Python
Try scraping the data of COVID-19 in Tokyo with Python
How to test the attributes added by add_request_method of pyramid
[Python3] Call by dynamically specifying the keyword argument of the function
The story of rubyist struggling with python :: Dict data with pycall
[Homology] Count the number of holes in data with Python
[Python] I tried to visualize the follow relationship of Twitter
Debug by attaching to the Python process of the SSH destination
[Python] I tried collecting data using the API of wikipedia
Find the diameter of the graph by breadth-first search (Python memory)
I passed the Python data analysis test, so I summarized the points
What I saw by analyzing the data of the engineer market
[Data science memorandum] Confirmation of the contents of DataFrame type [python]
Graph the Poisson distribution and the Poisson cumulative distribution in Python and Java, respectively.