[Introduction to statistics] What kind of distribution is the t distribution, chi-square distribution, and F distribution? A little summary of how to use [python]

I wrote an article as a memorandum of statistical learning. This time, we have summarized the t distribution, chi-square distribution, and F distribution. We have also summarized the estimation of related confidence intervals and the F-test.

t distribution

This is a continuous probability distribution that is used instead of the standard normal distribution when estimating / testing the population mean, etc. from a small sample with an unknown population variance. Before explaining the t distribution, let's first look at the z-value for the sample mean $ \ bar x $.

z_\bar x=\frac{\bar x -\mu}{\sqrt{\frac{\sigma^2}{n}}}

z follows a standard normal distribution, but cannot be calculated without knowing the population variance $ \ sigma ^ 2 $. There are few cases where the population variance is realistically clear. Therefore, this population variance $ \ sigma ^ 2 $ is replaced by the unbiased variance $ {\ hat {\ sigma}} ^ 2 $. An unbiased variance is a statistic adjusted by multiplying $ \ frac {n} {n-1} $ so that the expected value of the sample variance $ s ^ 2 $ matches the population variance. The reason for multiplying $ \ frac {n} {n-1} $ is that if n is not large enough, the expected value of the sample variance is smaller than the population variance.

t_\bar x=\frac{\bar x -\mu}{\sqrt{\frac{\hat\sigma^2}{n}}}=\frac{\bar x -\mu}{\sqrt{\frac{s^2}{n-1}}}

This is the t-value. Some sites, etc., do not replace with unbiased variance, but with sample variance (formula below).

t_\bar x=\frac{\bar x -\mu}{\sqrt{\frac{s^2}{n}}}=\frac{\bar x -\mu}{\frac{s}{\sqrt{n}}}

By the way, $ \ frac {s} {\ sqrt {n}} $ is called the standard error of the sample mean. And this t-value follows the following probability density function.

f(t)=\frac{\Gamma((\nu+1) / 2)}{\sqrt{\nu \pi} \Gamma(\nu / 2)}\left(1+t^{2} / \nu\right)^{-(\nu+1) / 2}

$ \ nu $ is the degree of freedom ($ n $ -1). $ \ Gamma (\ bullet) $ is a gamma function, which represents the factorial using complex numbers. The shape of the t distribution changes depending on $ \ nu $.

Now let's actually see the distribution.

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

x = np.linspace(-6, 6, 1000)
fig, ax = plt.subplots(1,1, figsize=(10,7))

linestyles = [':', '--', '-.', '-']
deg_of_freedom = [1, 5, 30]
for df, ls in zip(deg_of_freedom, linestyles):
    ax.plot(x, stats.t.pdf(x, df), linestyle=ls, label=f'df={df}')

ax.plot(x, stats.norm.pdf(x, 0, 1), linestyle='-', label='Standard Normal Distribution')

plt.xlim(-6, 6)
plt.ylim(0, 0.4)
plt.title('t-distribution')
plt.legend()
plt.savefig('t-distribution.png')
plt.show()

t-distribution.png

You can see that the shape of the distribution changes depending on the degree of freedom. And we can see that the larger the sample size, the closer to the standard normal distribution.

Estimating the confidence interval for the mean

Let's use this distribution to estimate the confidence interval that we expect the population mean $ \ mu $ to be in.

Suppose you get the following data.

data = [np.random.randn() for _ in range(10)]
print(data)
# >>> [-0.14917153222917484, 0.7951720064790415, 0.662152983830839, 0.430521357874449, -2.48235088848113, 0.6166315938744059, 1.055076432212844, 0.7400193126962409, 0.90477126838906, -0.10509107744284621]
print(f"Specimen average:{np.mean(data)}")
# >>>Specimen average:0.24677314572037296
print(f"Sample variance:{np.var(data)}")
# >>>Sample variance:0.9702146524752354
print(f"Sample standard deviation:{np.sqrt(np.var(data))}")
# >>>Sample standard deviation:0.9849947474353533

This data is generated based on a standard normal distribution, so the true mean $ \ mu $ is 0. The estimation of the confidence interval with 9 and 95% degrees of freedom can be obtained by the following transformation of the equation.

-2.262≤t_{\bar x}≤2.262
-2.262≤\frac{\bar x-\mu}{\frac{s}{\sqrt{n-1}}}≤2.262
\bar x - 2.262\frac{s}{\sqrt{n-1}}≤\mu≤\bar x + 2.262\frac{s}{\sqrt{n-1}}

Since the data is as small as 10 this time, we calculated using the unbiased variance substitution type. Also, the number 2.262 is taken from the t distribution table at the intersection with 9 and 2.5% degrees of freedom. Therefore, the data with the above sample mean has a 95% probability of having a population mean.

bottom = np.mean(data) - 2.262*(np.sqrt(np.var(data))/(np.sqrt(len(data)-1)))
up = np.mean(data) + 2.262*(np.sqrt(np.var(data))/(np.sqrt(len(data)-1)))
print(f'{bottom} ≤ μ ≤ {up}')
# >>> -0.4959128938458835 ≤ μ ≤ 0.9894591852866295

With the library

bottom, up = stats.t.interval(alpha=0.95, loc=np.mean(data), scale=np.sqrt(np.var(data)/(len(data)-1)), df=len(data)-1)
print(f'{bottom} ≤ μ ≤ {up}')
# >>> -0.49596449533733994 ≤ μ ≤ 0.9895107867780859

It can be estimated that

Chi-square distribution

The chi-square distribution is ** that can handle multiple variables at once, such as the distribution of sample variance.

First, the definition of the chi-square value is shown below.

\chi^2_{(n)}\equiv\sum^n_{i=1}z^2_i=\frac{\sum^n_{i=1}(x_i-\mu)^2}{\sigma^2}

The subscript $ n $ of $ \ chi ^ 2 $ represents the degree of freedom and is based on the population mean, so $ degree of freedom = n $. As the degree of freedom increases, the chi-square value also tends to increase.

And the probability density function of the chi-square distribution is as follows.

f(x ; k)=\frac{1}{2^{k / 2} \Gamma(k / 2)} x^{k / 2-1} e^{-x / 2}

The distribution is as follows.

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x = np.linspace(0, 10, 1000)
fig,ax = plt.subplots(1,1, figsize=(10,7))
 
linestyles = [':', '--', '-.', '-']
deg_of_freedom = [1, 3, 5, 10]
for df, ls in zip(deg_of_freedom, linestyles):
    ax.plot(x, stats.chi2.pdf(x, df), linestyle=ls, label=f'df={df}')

plt.xlim(0, 10)
plt.ylim(0, 1.0)
 
plt.title('chi 2 distribution')
plt.legend()
plt.savefig('chi2distribution.png')
plt.show()

chi2distribution.png

Interval estimation of population variance

The above formula was the population mean $ \ mu $, so let's replace it with the sample mean $ \ bar x $.

\chi^2_{(n-1)}=\frac{\sum^n_{i=1}(x_i-\bar x)^2}{\sigma^2}

The degree of freedom is n-1. Here, if expressed using unbiased variance $ \ hat \ sigma ^ 2 $,

\chi^2_{(n-1)}=\frac{(n-1)\hat \sigma^2}{\sigma^2}

We can see that the $ \ chi ^ 2 $ value is proportional to the unbiased variance. So, when we solve the population variance,

\sigma^2=\frac{(n-1)\hat \sigma^2}{\chi^2_{(n-1)}}

Therefore, the confidence interval of the population variance can be estimated.

The data used is for the t distribution.

data = [np.random.randn() for _ in range(10)]
print(data)
# >>> [-0.14917153222917484, 0.7951720064790415, 0.662152983830839, 0.430521357874449, -2.48235088848113, 0.6166315938744059, 1.055076432212844, 0.7400193126962409, 0.90477126838906, -0.10509107744284621]
print(f"Specimen average:{np.mean(data)}")
# >>>Specimen average:0.24677314572037296
print(f"Sample variance:{np.var(data)}")
# >>>Sample variance:0.9702146524752354
print(f"Sample standard deviation:{np.sqrt(np.var(data))}")
# >>>Sample standard deviation:0.9849947474353533

When the degree of freedom is 9, the confidence interval can be calculated as follows.

\frac{(n-1)\hat \sigma^2}{\chi^2_{(n-1,\alpha/2)}}≤\sigma^2≤\frac{(n-1)\hat \sigma^2}{\chi^2_{(n-1,1-\alpha/2)}}
\frac{(n-1)\hat \sigma^2}{19.02}≤\sigma^2≤\frac{(n-1)\hat \sigma^2}{2.7}

Therefore,

bottom = ((len(data)-1)*np.var(data, ddof=1))/19.02
up = ((len(data)-1)*np.var(data, ddof=1))/2.7
print(f'{bottom} ≤ σ^2 ≤ {up}')
# >>> 0.5101023409438672 ≤ σ^2 ≤ 3.593387601760131

The unbiased variance is obtained by setting ddof = 1 ofnp.var (data, ddof = 1)). You can use the library to get the values in the chi-square distribution table.

chi2_025, chi2_975 = stats.chi2.interval(alpha=0.95, df=len(data)-1)
bottom = ((len(data)-1)*np.var(data, ddof=1))/chi2_975
up = ((len(data)-1)*np.var(data, ddof=1))/chi2_025
print(f'{bottom} ≤ σ^2 ≤ {up}')
# >>> 0.5100281214306344 ≤ σ^2 ≤ 3.5928692971228506

F distribution

The F distribution is a ** distribution followed by statistics based on two samples randomly selected from two populations **. This property is used to test whether the variances of the two populations from which they are extracted are the same. Now, regarding the F-number, the F-number is ** the ratio of two $ \ chi ^ 2 $ values randomly selected from two populations that follow a normal distribution **. It is important that it follows a normal distribution.

F_{(\nu_1, \nu_2)}=\frac{\chi^2_{(\nu_1)}/\nu_1}{\chi^2_{(\nu_2)}/\nu_2}

In addition, the probability density function is as follows.

f\left(x ; k_{1}, k_{2}\right)=\frac{\Gamma\left(\frac{k_{1}+k_{2}}{2}\right) x^{\frac{k_{1}-2}{2}}}{\Gamma\left(\frac{k_{1}}{2}\right) \Gamma\left(\frac{k_{2}}{2}\right)\left(1+\frac{k_{1}}{k_{2}} x\right)^{\frac{k_{1}+k_{2}}{2}}}\left(\frac{k_{1}}{k_{2}}\right)^{\frac{k_{1}}{2}}

The distribution is as follows.

fdistribution.png

F-test

The F-test tests that there is a difference in the variances of the two groups. If the population variances ($ \ sigma ^ 2_1, \ sigma ^ 2_2 $) of the two groups are the same, the F value will be as follows.

F=\frac{\chi^2_{(\nu_1)}/\nu_1}{\chi^2_{(\nu_2)}/\nu_2}=\frac{\frac{\nu_1\hat \sigma^2_1}{\sigma^2_1}/\nu_1}{\frac{\nu_2\hat \sigma^2_2}{\sigma^2_2}/\nu_2}=\frac{\hat \sigma^2_1}{\hat \sigma^2_2}

Since the samples are taken from the same population variance, they should approach 1 if there is no difference in variance. Conversely, if the F-number is greater than 1 (usually the one with the larger variance is the numerator), the population variances are likely to be different.

I will actually test it. As for the data, try to generate the following data.

#Sample of 10 Japanese men
np.random.seed(1)
Japan = np.round([np.random.normal(64, 9, 10)],1).reshape(10)
jp_var = np.var(Japan, ddof=1)
#Sample 10 American Men
np.random.seed(1)
US = np.round([np.random.normal(87, 12, 10)],1).reshape(10)
us_var = np.var(US, ddof=1)
print(f'Unbiased dispersion of Japanese:{jp_var}')
# >>>Japanese sample dispersion:127.71955555555557
print(f'Unbiased dispersion of Americans:{us_var}')
# >>>American sample variance:226.57377777777785
print(f'F value:{us_var/jp_var}')
# >>>F value:1.7739944113665709

F-test Null hypothesis: The variances of the two groups are equal Alternative hypothesis: There is a difference in the variance of the two groups will do. In the F distribution with degrees of freedom (9,9), if the yellow area in the graph below corresponds to the p-value and the p-value is 0.05 or less, the null hypothesis is rejected.

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x = np.linspace(0.000001, 8, 1000)
fig,ax = plt.subplots(1,1, figsize=(10,7))
 
df = (9,9)
ls = '-'
y = stats.f.pdf(x, df[0], df[1])
ax.plot(x, y, linestyle=ls, label=f'k = {df[0]}, {df[1]}')

plt.xlim(0, 8)
plt.ylim(0, 1.0)
plt.fill_between(x, y, 0, where=x>=us_var/jp_var, facecolor='y',alpha=0.5)
plt.title('F distribution')
plt.legend()
plt.savefig('fdistribution_p.png')
print(f'p-value:{stats.f.sf(us_var/jp_var, len(Japan)-1, len(US)-1)}')
# >>> p-value:0.20301975133837194
plt.show()

fdistribution_p.png

Since the p-value is 0.203 ..., which is larger than 0.05 and the null hypothesis cannot be rejected, it seems that there is no difference in the variance of the two groups.

reference

-Degree of freedom -Sample variance and unbiased variance -Standard error -[Gamma function](https://ja.wikipedia.org/wiki/Gamma function) -t distribution table -Chi-square distribution table

Recommended Posts

[Introduction to statistics] What kind of distribution is the t distribution, chi-square distribution, and F distribution? A little summary of how to use [python]
[Introduction to Python] What is the difference between a list and a tuple?
[Python] Summary of how to use split and join functions
[Python] What is a tuple? Explains how to use without tuples and how to use it with examples.
[Introduction to Python] How to use the in operator in a for statement?
[Introduction to Python] How to use the Boolean operator (and ・ or ・ not)
[Python2.7] Summary of how to use unittest
Summary of how to use Python list
[Python2.7] Summary of how to use subprocess
How to use is and == in Python
[Python] What is pip? Explain the command list and how to use it with actual examples
[Python] What is a formal argument? How to set the initial value
[Introduction to Python] How to sort the contents of a list efficiently with list sort
[Introduction to Python] What is the method of repeating with the continue statement?
Summary of how to use MNIST in Python
[Introduction to Python] How to get the index of data with a for statement
[CleanArchitecture with Python] Apply CleanArchitecture step by step to a simple API and try to understand "what kind of change is strong" in the code base.
[Introduction to Udemy Python3 + Application] 27. How to use the dictionary
[Introduction to Udemy Python3 + Application] 30. How to use the set
[Python] Summary of how to specify the color of the figure
What is a recommend engine? Summary of the types
[Introduction to Python] What is the recommended way to install pip, a package management system?
How to check in Python if one of the elements of a list is in another list
[python] Summary of how to retrieve lists and dictionary elements
How to use the __call__ method in a Python class
[Introduction to Udemy Python 3 + Application] 36. How to use In and Not
Comparison of how to use higher-order functions in Python 2 and 3
Introduction of DataLiner ver.1.3 and how to use Union Append
How to determine the existence of a selenium element in Python
I wrote AWS Lambda, and I was a little addicted to the default value of Python arguments
[Introduction to Python] What is the most powerful programming language now?
Summary of how to use pandas.DataFrame.loc
[Introduction to Python] How to split a character string with the split function
How to give and what the constraints option in scipy.optimize.minimize is
How to check the memory size of a variable in Python
[Introduction to Python] I compared the naming conventions of C # and Python.
[Python] How to get the first and last days of the month
Summary of how to use pyenv-virtualenv
How to use the asterisk (*) in Python. Maybe this is all? ..
How to check the memory size of a dictionary in Python
Summary of how to use csvkit
[Python] How to use the for statement. A method of extracting by specifying a range or conditions.
Learn the flow of Bayesian estimation and how to use Pystan through a simple regression model
Use AWS lambda to scrape the news and notify LINE of updates on a regular basis [python]
How to input a character string in Python and output it as it is or in the opposite direction.
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
[Python] How to specify the window display position and size of matplotlib
What is the XX file at the root of a popular Python project?
[Python] Explains how to use the range function with a concrete example
[Python] How to use the enumerate function (extract the index number and element)
[Introduction to Data Scientists] Basics of Probability and Statistics ♬ Probability / Random Variables and Probability Distribution
Introduction of cyber security framework "MITRE CALDERA": How to use and training
What kind of book is the best-selling "Python Crash Course" in the world?
[Introduction to Python] How to write a character string with the format function
What is the fastest way to create a reverse dictionary in python?
[Introduction to Python] How to use class in Python?
How to install and use pandas_datareader [Python]
[Pandas] What is set_option [How to use]
What kind of programming language is Python?
[python] [meta] Is the type of python a type?
python: How to use locals () and globals ()