[PYTHON] Test the goodness of fit of the distribution

Last time I have explained various things about Hadoop on the premise that the population will be surveyed completely, but when I made some assumptions about the distribution of the population. A goodness-of-fit test is required.

What you can see by the test

Let's think about what can be judged by the test.

Ratio test

Test of difference in ratio → Is there a difference in population ratio between two different populations? This is determined by whether the population ratio P is equal to a certain value P_0.

Test of average value

Test of difference in mean value → Is there a difference in the population mean between two different populations? This is determined by whether the population mean μ is equal to a certain value μ_0.

Variance test

Test of variance difference → Is there a difference in variance between two different normal populations? This is determined by whether the variance σ ^ 2 of the normal population is equal to a certain value σ ^ 2_0.

Goodness-of-fit test

Can we say that the observed data are consistent with a particular distribution? Whether the probability distributions of the two populations are different.

Poisson distribution

The Poisson distribution is as explained earlier in Hypothesis test and probability distribution.

If the possible values of a random variable are discrete and infinite, it is the probability that an event that occurs λ times on average per unit time will occur X times per unit time.

{P(X = k) = \frac {{\lambda}^xe^{-\lambda}} {k!} \\
However\\
\lambda \gt 0
}

Goodness-of-fit test of distribution

Consider the number of observations of terminal-specific information in a certain location information. Suppose you have investigated 100 devices to see if each device-specific information is observed in a specific area, and follow the table below for each device-specific information.

Number of observations	Number of terminals
0	43
1	31
2	14
3	8
4	3
5	1

It can be said that this number of observations follows the Poisson distribution, or it is tested at the significance level (= P value) of 5%.

λ = Unknown parameter (estimated from data) X = class k

Therefore, the sample mean is used as an estimate of the unknown parameter λ of the Poisson distribution.

\hat{\lambda} = \frac 1 {100} (0 x 43 + 1 x 31 + 2 x 14 + ... ) = 1

Therefore, the expected frequency is

Class k	Observation frequency	Expected frequency
0	43	36.8
1	31	36.8
2	14	18.4
3	8	6.13
4	3	1.53
5	1	0.307
6	0	0.0330

\chi^2 = \frac {(43-36.8)^2} {36.8} + \frac {(31-36.8)^2} {36.8} + ... = 5.011

The values obtained in this way are compared with the Chi-square distribution table.

The degree of freedom is 7-1 -1 = 5 because the number of unknown parameters is subtracted from the number of classes -1. If k = 3 and above are grouped together and the number of classes is 4, then 4-1 -1 = 2. I will. Looking at the column with 2 degrees of freedom with a P-value of 0.05, it says 5.99146, which is within this value and the null hypothesis is not rejected. In other words, it turns out that ** it cannot be said that it does not follow the Poisson distribution **.

Poisson distribution and the limit theorem

Let's simulate that if n is brought closer to infinity while keeping λ of the normal distribution with parameters n and p = λ / n constant, it approximates the Poisson distribution.

Simulation of the central limit theorem was done in a brute force way before, but it is easier to do.

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(1,1,1)

M = 1000
for N in [10,30,50,100]:
    data = [np.average(np.random.poisson(3, N)) for i in range(M)]
    hist, key = np.histogram(data, bins=np.arange(1,5,0.1), density=True)
    ax.plot(hist, label=str(N))

plt.legend(loc='best')
plt.show()
plt.savefig("image.png ")

When N = 100, it is observed that it is close to a normal distribution.