Last time I have explained various things about Hadoop on the premise that the population will be surveyed completely, but when I made some assumptions about the distribution of the population. A goodness-of-fit test is required.
Let's think about what can be judged by the test.
Test of difference in ratio → Is there a difference in population ratio between two different populations? This is determined by whether the population ratio P is equal to a certain value P_0.
Test of difference in mean value → Is there a difference in the population mean between two different populations? This is determined by whether the population mean μ is equal to a certain value μ_0.
Test of variance difference → Is there a difference in variance between two different normal populations? This is determined by whether the variance σ ^ 2 of the normal population is equal to a certain value σ ^ 2_0.
Can we say that the observed data are consistent with a particular distribution? Whether the probability distributions of the two populations are different.
The Poisson distribution is as explained earlier in Hypothesis test and probability distribution.
If the possible values of a random variable are discrete and infinite, it is the probability that an event that occurs λ times on average per unit time will occur X times per unit time.
{P(X = k) = \frac {{\lambda}^xe^{-\lambda}} {k!} \\
However\\
\lambda \gt 0
}
Consider the number of observations of terminal-specific information in a certain location information. Suppose you have investigated 100 devices to see if each device-specific information is observed in a specific area, and follow the table below for each device-specific information.
Number of observations | Number of terminals |
---|---|
0 | 43 |
1 | 31 |
2 | 14 |
3 | 8 |
4 | 3 |
5 | 1 |
It can be said that this number of observations follows the Poisson distribution, or it is tested at the significance level (= P value) of 5%.
λ = Unknown parameter (estimated from data) X = class k
Therefore, the sample mean is used as an estimate of the unknown parameter λ of the Poisson distribution.
\hat{\lambda} = \frac 1 {100} (0 x 43 + 1 x 31 + 2 x 14 + ... ) = 1
Therefore, the expected frequency is
Class k | Observation frequency | Expected frequency |
---|---|---|
0 | 43 | 36.8 |
1 | 31 | 36.8 |
2 | 14 | 18.4 |
3 | 8 | 6.13 |
4 | 3 | 1.53 |
5 | 1 | 0.307 |
6 | 0 | 0.0330 |
\chi^2 = \frac {(43-36.8)^2} {36.8} + \frac {(31-36.8)^2} {36.8} + ... = 5.011
The values obtained in this way are compared with the Chi-square distribution table.
The degree of freedom is 7-1 -1 = 5 because the number of unknown parameters is subtracted from the number of classes -1. If k = 3 and above are grouped together and the number of classes is 4, then 4-1 -1 = 2. I will. Looking at the column with 2 degrees of freedom with a P-value of 0.05, it says 5.99146, which is within this value and the null hypothesis is not rejected. In other words, it turns out that ** it cannot be said that it does not follow the Poisson distribution **.
Let's simulate that if n is brought closer to infinity while keeping λ of the normal distribution with parameters n and p = λ / n constant, it approximates the Poisson distribution.
Simulation of the central limit theorem was done in a brute force way before, but it is easier to do.
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
M = 1000
for N in [10,30,50,100]:
data = [np.average(np.random.poisson(3, N)) for i in range(M)]
hist, key = np.histogram(data, bins=np.arange(1,5,0.1), density=True)
ax.plot(hist, label=str(N))
plt.legend(loc='best')
plt.show()
plt.savefig("image.png ")
When N = 100, it is observed that it is close to a normal distribution.
Recommended Posts