[PYTHON] Test the goodness of fit of the distribution

Last time I have explained various things about Hadoop on the premise that the population will be surveyed completely, but when I made some assumptions about the distribution of the population. A goodness-of-fit test is required.

What you can see by the test

Let's think about what can be judged by the test.

  1. Ratio test

Test of difference in ratio → Is there a difference in population ratio between two different populations? This is determined by whether the population ratio P is equal to a certain value P_0.

  1. Test of average value

Test of difference in mean value → Is there a difference in the population mean between two different populations? This is determined by whether the population mean μ is equal to a certain value μ_0.

  1. Variance test

Test of variance difference → Is there a difference in variance between two different normal populations? This is determined by whether the variance σ ^ 2 of the normal population is equal to a certain value σ ^ 2_0.

  1. Goodness-of-fit test

Can we say that the observed data are consistent with a particular distribution? Whether the probability distributions of the two populations are different.

Poisson distribution

The Poisson distribution is as explained earlier in Hypothesis test and probability distribution.

If the possible values of a random variable are discrete and infinite, it is the probability that an event that occurs λ times on average per unit time will occur X times per unit time.

{P(X = k) = \frac {{\lambda}^xe^{-\lambda}} {k!} \\
However\\
\lambda \gt 0
}

Goodness-of-fit test of distribution

Consider the number of observations of terminal-specific information in a certain location information. Suppose you have investigated 100 devices to see if each device-specific information is observed in a specific area, and follow the table below for each device-specific information.

Number of observations Number of terminals
0 43
1 31
2 14
3 8
4 3
5 1

It can be said that this number of observations follows the Poisson distribution, or it is tested at the significance level (= P value) of 5%.

λ = Unknown parameter (estimated from data) X = class k

Therefore, the sample mean is used as an estimate of the unknown parameter λ of the Poisson distribution.

\hat{\lambda} = \frac 1 {100} (0 x 43 + 1 x 31 + 2 x 14 + ... ) = 1

Therefore, the expected frequency is

Class k Observation frequency Expected frequency
0 43 36.8
1 31 36.8
2 14 18.4
3 8 6.13
4 3 1.53
5 1 0.307
6 0 0.0330
\chi^2 = \frac {(43-36.8)^2} {36.8} + \frac {(31-36.8)^2} {36.8} + ... = 5.011

The values obtained in this way are compared with the Chi-square distribution table.

The degree of freedom is 7-1 -1 = 5 because the number of unknown parameters is subtracted from the number of classes -1. If k = 3 and above are grouped together and the number of classes is 4, then 4-1 -1 = 2. I will. Looking at the column with 2 degrees of freedom with a P-value of 0.05, it says 5.99146, which is within this value and the null hypothesis is not rejected. In other words, it turns out that ** it cannot be said that it does not follow the Poisson distribution **.

Poisson distribution and the limit theorem

Let's simulate that if n is brought closer to infinity while keeping λ of the normal distribution with parameters n and p = λ / n constant, it approximates the Poisson distribution.

Simulation of the central limit theorem was done in a brute force way before, but it is easier to do.

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(1,1,1)

M = 1000
for N in [10,30,50,100]:
    data = [np.average(np.random.poisson(3, N)) for i in range(M)]
    hist, key = np.histogram(data, bins=np.arange(1,5,0.1), density=True)
    ax.plot(hist, label=str(N))

plt.legend(loc='best')
plt.show()
plt.savefig("image.png ")

image.png

When N = 100, it is observed that it is close to a normal distribution.

Recommended Posts

Test the goodness of fit of the distribution
Test whether the observed data follow the Poisson distribution (Test of the goodness of fit of the Poisson distribution by Python)
Test the version of the argparse module
Distribution and test
Explain the nature of the multivariate normal distribution graphically
Match the distribution of each group in Python
[Python] Test the moon matagi of relative delta
Test of the difference between the mean values of count data according to the Poisson distribution
The beginning of cif2cell
The meaning of self
the zen of Python
The story of sys.path.append ()
Steps to calculate the likelihood of a normal distribution
Let's test the medical collapse hypothesis of the new coronavirus
Verification of normal distribution
The story of wanting to buy Ring Fit Adventure
Check the type and version of your Linux distribution
Check the asymptotic nature of the probability distribution in Python
Summary of test method
Revenge of the Types: Revenge of types
Understanding the meaning of complex and bizarre normal distribution formulas
Align the version of chromedriver_binary
Distribution of eigenvalues of Laplacian matrix
10. Counting the number of lines
The story of building Zabbix 4.4
Towards the retirement of Python2
[Apache] The story of prefork
For the G test 2020 # 2 exam
Compare the fonts of jupyter-themes
About the ease of Python
Get the number of digits
Explain the code of Tensorflow_in_ROS
Summary of Linux distribution types
Reuse the results of clustering
EM of mixed Gaussian distribution
GoPiGo3 of the old man
Calculate the number of changes
Change the theme of Jupyter
The popularity of programming languages
Change the style of matplotlib
Visualize the orbit of Hayabusa2
About the components of Luigi
Connected components of the graph
Filter the output of tracemalloc
About the features of Python
Hypothesis test and probability distribution
Simulation of the contents of the wallet
The Power of Pandas: Python
Let's measure the test coverage of pushed python code on GitHub.
[Python] Try to graph from the image of Ring Fit [OCR]
Try transcribing the probability mass function of the binomial distribution in Python
Can I pass the first grade of math test by programming?
Embedding method DensMAP that reflects the density of distribution of high-dimensional data
Install and manage multiple environments of the same distribution on WSL
Test the application of migration files with Django + PostgreSQL (Evil Way)
If the accuracy of the PCR test is poor, why not repeat the test?
Carefully derive the interquartile range of the standard normal distribution from the beginning