[PYTHON] [Statistics] Grasp the image of the central limit theorem with a graph

1. What is the Central Limit Theorem?

When studying statistics, a theorem with a rather rigid name, the central limit theorem, comes up. According to Wikipedia teacher

According to the law of large numbers, the sample mean randomly sampled from a population approaches the true mean as the sample size increases. The central limit theorem, on the other hand, discusses the error between the sample mean and the true mean. In many cases, whatever the distribution of the population, the error will approximately follow a normal distribution when the sample size is increased. http://ja.wikipedia.org/wiki/中心極限定理

It is written, but I do not understand well ^ ^; Whatever the shape of the original distribution, the sample mean of the samples taken from it will be close to the normal distribution. It seems that the sample variance will also be close to the normal distribution. (To be precise, if there are many N according to the chi-square distribution, it can be approximated by a normal distribution) Even if I explain it in words, even if I prove it with a mathematical formula (such as when the moment generating function matches), I think that it is not intuitively understandable, so the purpose of this article is to draw a graph and understand it. is.

2. Preparation for graph drawing

I will draw a graph using Python, but the preparatory process for that is as follows. We are preparing functions for importing various libraries and drawing graphs.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rd
import matplotlib.mlab as mlab
import scipy.stats as st

#Sample parameters
n = 10000
sample_size = 10000

#Function to calculate mean and variance for each sample
def sample_to_mean_var(sample):
    mean = np.mean(sample)
    var  = np.var(sample)
    return [mean, var]
    
#A function that draws a histogram of the mean and variance
def plot_mean_var(stats, dist_name=""):
    mu = stats[:,0]
    var = stats[:,1]
    bins = 40
    
    #Histogram of sample mean
    plt.figure(figsize=(7,5))
    plt.hist(mu, bins=bins, normed=True, color="plum")
    plt.title("mu from %s distribution"%(dist_name))
    plt.show()
    
    #Histogram of sample variance
    plt.figure(figsize=(7,5))
    plt.hist(var, bins=bins, color="lightblue", normed=True)
    plt.title("var from %s distribution"%(dist_name))
    plt.show()
    
def plot_dist(data, bins, title =""):
    plt.figure(figsize=(7,5))
    plt.title(title)
    plt.hist(data, bins, color="lightgreen", normed=True)
    plt.show()

3. Draw

3-1. Exponential distribution

First, try [Exponential distribution](http://qiita.com/kenmatsu4/items/c1a64cf69bc8c9e07aa2#geometricp-sizenone --- geometric distribution). The following is a graph with the exponential distribution parameter $ \ lambda $ set to 0.1 and 10,000 samples generated. It is a completely asymmetrical distribution with a long hem to the right.

#Graph drawing of exponential distribution
lam = 0.1  
x = rd.exponential(1./lam, size=sample_size)
plot_dist(x, 100, "exponential dist")

Unknown-7-compressor.png

The sample mean and sample variance are calculated from this 10,000 samples as one set. Repeat this 10,000 times and write a histogram of the sample mean and sample variance as shown below.

#Generate a lot of exponential distributions and draw a histogram of sample mean and sample variance
lam = 0.1
stats = np.array([sample_to_mean_var(rd.exponential(1./lam, size=sample_size)) for i in range(n)])
plot_mean_var(stats, dist_name="exponential")

exp_mean-compressor.png

exp_var.png

I wonder if the original distribution was quite distorted, but the sample mean and sample variance seem to be a beautiful symmetrical bell shape. The central limit theorem is that this follows a normal distribution.

Below, I will try other distorted graphs.

3-1. Chi-square distribution

Next is the [chi](http://qiita.com/kenmatsu4/items/c1a64cf69bc8c9e07aa2#chisquaredf-sizenone --- chi-square distribution) squared distribution. This is also quite distorted.

#Chi-square distribution with 5 degrees of freedom
df = 5
x = rd.chisquare(df, sample_size)
plot_dist(x, 50, "chi square dist")

chi2-compressor.png

#Histogram of mean and variance of chi-square distribution
df = 5   #Degree of freedom

#Generate a lot of chi-square distributions
chi_stats = np.array([sample_to_mean_var(rd.chisquare(df, sample_size)) for i in range(n)])
plot_mean_var(chi_stats, dist_name="chi square")

Again, you can see that a symmetrical bell-shaped histogram can be written.

chi2_mean-compressor.png chi2_var-compressor.png

3-1. Futamine normal distribution

I will also try a strangely shaped distribution with two mountains.

#Futamine normal distribution
def generate_bimodal_norm():
    x = np.random.normal(0, 4, sample_size)
    y = np.random.normal(25, 8, sample_size)
    return np.append(x,y)

z = generate_bimodal_norm()
plot_dist(z, 70, "bi-modal normal dist")

binorm-compressor.png

#Histogram of mean and variance of bimodal normal distribution

#Generate a lot of bimodal normal distributions
binorm_stats = np.array([sample_to_mean_var(generate_bimodal_norm()) for i in range(n)])
plot_mean_var(binorm_stats, dist_name="bi-modal normal")

Even with such a distribution, the sample mean and sample variance are normally distributed. It's amazing, the central limit theorem w

binorm_mean-compressor.png binorm_var-compressor.png

4. Conclusion

So, it is a central limit theorem that seems difficult when looking at mathematical formulas and proofs, but I tried to understand it intuitively by looking at the graph. This seems to be the reason why the normal distribution is important in statistics: smile:

Recommended Posts

[Statistics] Grasp the image of the central limit theorem with a graph
Count the maximum concatenated part of a random graph with NetworkX
Calculate the shortest route of a graph with Dijkstra's algorithm and Python
Calculate the probability of being a squid coin with Bayes' theorem [python]
Read the graph image with OpenCV and get the coordinates of the final point of the graph
[Python] limit axis of 3D graph with Matplotlib
Increase the font size of the graph with matplotlib
The basis of graph theory with matplotlib animation
Get the stock price of a Japanese company with Python and make a graph
Draw a graph with PyQtGraph Part 5-Increase the Y-axis
Extract the table of image files with OneDrive & Python
[Python] Get the numbers in the graph image with OCR
Take a screenshot of the LCD with Python-LEGO Mindstorms
Visualize the characteristic vocabulary of a document with D3.js
Calculate the product of matrices with a character expression?
How to plot a lot of legends by changing the color of the graph continuously with matplotlib
I tried to find the entropy of the image with python
I tried "gamma correction" of the image with Python + OpenCV
A network diagram was created with the data of COVID-19.
Measure the importance of features with a random forest tool
Get the id of a GPU with low memory usage
Get UNIXTIME at the beginning of today with a command
Image crawling summary performed at the speed of a second
I made a dot picture of the image of Irasutoya. (part1)
I made a dot picture of the image of Irasutoya. (part2)
Analyze the topic model of becoming a novelist with GensimPy3
The story of making a question box bot with discord.py
Draw a graph with NetworkX
Connected components of the graph
Draw a graph with networkx
The image is a slug
Let's prove the addition theorem of trigonometric functions by replacing the function with a function in SymPy (≠ substitution)
What to do when a part of the background image becomes transparent when the transparent image is combined with Pillow
Create a graph that displays an image with a mouse hover using the data visualization library Dash
Read the coordinates of the plot on the graph with Python-matplotlib (super beginner)
A method of converting the style of an image while preserving the color
Process the contents of the file in order with a shell script
A story stuck with the installation of the machine learning library JAX
Save the result of the life game as a gif with python
[python, ruby] fetch the contents of a web page with selenium-webdriver
[Python] Try to graph from the image of Ring Fit [OCR]
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
If you give a list with the default argument of the function ...
Consider the speed of processing to shift the image buffer with numpy.ndarray
[Go] Create a CLI command to change the extension of the image
The story of making a standard driver for db with python.
Get the URL of a JIRA ticket created with the jira-python library
The idea of feeding the config file with a python file instead of yaml
A story that supports electronic scoring of exams with image recognition
I checked the image of Science University on Twitter with Word2Vec.
The story of making a module that skips mail with python
Create a compatibility judgment program with the random module of python.
Since the dokcer image (1GB) of OpenJDK11 is large, create a small image (85MB) with alpine linux + jlink.
I tried to make a thumbnail image of the best avoidance flag-chan! With RGB values ​​[Histogram] [Visualization]
The story of making a tool to load an image with Python ⇒ save it as another name
Draw a graph with Julia + PyQtGraph (2)
Draw a loose graph with matplotlib
Draw a graph with Julia + PyQtGraph (1)
Draw a graph with Julia + PyQtGraph (3)
Output the call graph with PyCallGraph
About the upper limit of threads-max