[PYTHON] I tried to estimate the interval.

Interval estimation

This time, I would like to continue the estimation from last time . Last time, we estimated the points, so please take a look if you are interested.

Regarding data

This time, I tried to estimate the interval using the Pokemon dataset. The data is as below. This time as well, I would like to use the HP data of this data as the population for interval estimation. 2020-10-20.png

First, I would like to find the average and variance of HP.

score = np.array(df['HP'])
mean = np.mean(score)
var = np.var(score)
print("HP average: {} ,Distributed: {} ".format(mean , var))

HP Average: 69.25875, Variance: 651.2042984374999 I would like to analyze these as the population mean and population variance.

Interval estimation of the population mean of the normal distribution (if the population variance is known)

This time, let's consider the case where the HP data mentioned earlier is used as the population, and it is assumed that it follows a normal distribution, and the population variance is also known.

Since we assume a normal distribution for the population, the sample mean $ \ bar {X} $ follows $ N (μ, σ ^ 2 / n) $. In other words, the estimator of the sample mean has a standard deviation of $ \ sqrt {σ ^ 2 / n} $, although the expected value is the population mean μ. The standard deviation of such an estimator is called the "standard error".

Also, since the sample mean $ \ bar {X} $ follows $ N (μ, σ ^ 2 / n) $, $ Z = (\ bar {X} -μ) / \ sqrt {σ ^ 2 / n} It can be standardized with $ and Z follows a standard normal distribution. What is good about this standardization is the confidence interval. It means that it will be easier to calculate.

First, calculate the population mean / variance and the sample mean / variance of the sample data. The sample size of this sample is 20.

np.random.seed(0)
n = 20
sample = np.random.choice(score , n)

p_mean = np.mean(score)
p_var = np.var(score)

s_mean = np.mean(sample)
s_var = np.var(sample , ddof = 1)

Population mean: 69.25875, population variance: 651.2042984374999 Sample mean: 68.8, sample variance (unbiased variance): 451.26000000000005

This time, I would like to use this sample mean to calculate the confidence interval for the population mean. (Suppose you know the population variance.) Consider finding a 95% confidence interval from the sample mean to the population mean. First, standardizing the sample mean $ \ bar {X} $ yields $ Z = (\ bar {X} -μ) / \ sqrt {σ ^ 2 / n} $. So let's first consider the 95% confidence interval for $ Z $. Then P(z_{0.975}≦(\bar{X}-μ)/\sqrt{σ^2/n} ≦z_{0.025})=0.95…① You can make the inequality. This equation has a 95% probability that the random variable $ Z = (\ bar {X} -μ) / \ sqrt {σ ^ 2 / n} $ will be in the interval $ [z_ {0.975}, z_ {0.025}] $. It represents that there is. Transforming this $ ① $ equation into an inequality for the population mean μ P( \bar{X}-z_{0.025}*\sqrt{σ^2/n}≦μ≦\bar{X}-z_{0.095} * \sqrt{σ^2/n})=0.95 It will be.

Therefore, to find the 95% confidence interval when the population variance is known, [\bar{X}-z_{0.025}*\sqrt{σ^2/n} , \bar{X}-z_{0.095} * \sqrt{σ^2/n}] It means that you should ask for.

I tried to implement it.

rv = stats.norm()
#rv.isf(0.025)Has a standard normal distribution probability of 0.It represents the point 025. Multiply it by the standard error.
lcl = s_mean - rv.isf(0.025) * np.sqrt(p_var/n)
ucl = s_mean - rv.isf(0.975) * np.sqrt(p_var/n)
lcl , ucl

(57.616, 79.984)

From the above, we found that the 95% confidence interval for the population mean is (57.616, 79.984). Since the population mean obtained earlier was 69.25875, we can see that the population mean is included in the confidence interval.

This confidence interval is sampled many times in the same way, and when interval estimation is performed, 95% of the interval estimates include the population mean. In a chewed form, when the interval is estimated 100 times, the confidence interval including the population mean is obtained 95 times, but the confidence interval obtained 5 times does not include the population mean. ..

Interval estimation of population variance

We will estimate the interval of the population variance. Let's consider the case where a normal distribution is assumed for the population and the population mean is not known.

Just as we standardized when calculating the confidence interval of the population mean and converted it to a random variable that follows a standard normal distribution, we also perform some conversion to the unbiased variance $ s ^ 2 $ to create a random variable that follows a typical probability distribution. need to do it. The probability distribution used at this time is the chi-square distribution. It is known that this variable Y follows a chi-square distribution with n-1 degrees of freedom by converting the unbiased variance $ s ^ 2 $ to $ Y = (n-1) s ^ 2 / σ ^ 2 $. I will.

Now, I would like to find the confidence interval for the population variance. First, find the 95% confidence interval for $ \ chi {} ^ 2 (n-1) $.

P(\chi{}^2_{0.975}(n-1) ≦ (n-1)s^2/σ^2 ≦\chi{}^2_{0.025}(n-1)) = 0.95

Since we want to find the confidence interval of the population variance this time, make sure that $ σ ^ 2 $ is in the middle.

P((n-1)s^2/\chi{}^2_{0.025}(n-1) ≦ σ^2 ≦(n-1)s^2/\chi{}^2_{0.975}(n-1)) = 0.95

From this, the 95% confidence interval for the population variance $ σ ^ 2 $ is

[(n-1)s^2/\chi{}^2_{0.025}(n-1) , (n-1)s^2/\chi{}^2_{0.975}(n-1)]

Will be.

rv = stats.chi2(df=n-1)
lcl = (n-1) * s_var / rv.isf(0.025)
hcl = (n-1) * s_var / rv.isf(0.975)

lcl , hcl

(260.984, 962.659)

The confidence interval for the population variance is (260.984, 962.659). Since the population variance was 651.204, we can see that it is included in the interval.

Interval estimation of population mean (when population variance is not known)

I proceeded with the analysis in the situation where the population variance was known when finding the confidence interval for the population mean. However, there are not many situations where the population mean is not known and the population variance is known. Therefore, this time, I would like to estimate the confidence interval of the population mean when the population variance is unknown.

When the population variance was known, the interval was estimated by the standard error $ \ sqrt {σ ^ 2 / n} $ of the sample mean $ \ bar {X} $. Since we do not know this population variance $ σ ^ 2 $ this time, we substitute $ \ sqrt {s ^ 2 / n} $, which uses the estimator unbiased variance $ s ^ 2 $, as the standard error. ..

First, transform the sample mean $ \ bar {X} $ using $ \ sqrt {s ^ 2 / n} $ as you would when you know the population variance.

t = (\bar{X} - μ) / \sqrt{s^2/n}

This $ t $ is

Y=(n-1)s^2/σ^2 Z = (\bar{X}-μ)/\sqrt{σ^2/n}

If you convert using these two,

t = Z / \sqrt{Y/(n-1)}

It can be represented by. Therefore, we can see that this $ t $ follows a t distribution with n-1 degrees of freedom.

Now we know that $ t = (\ bar {X} --μ) / \ sqrt {s ^ 2 / n} $ follows a t distribution with n-1 degrees of freedom, so from here we have a 95% confidence interval for the population mean. I will ask.

P(t_{0.975}(n-1)≦ (\bar{X} - μ) / \sqrt{s^2/n} ≦ t_{0.025}(n-1)) = 0.95

This equation is transformed so that the population mean μ is in the middle.

P(\bar{X} - t_{0.025}(n-1) * \sqrt{s^2/n}≦ μ ≦ \bar{X} - t_{0.975}(n-1)*\sqrt{s^2/n}) = 0.95

This gives the 95% confidence interval for the population mean

[ \bar{X} - t_{0.025}(n-1) * \sqrt{s^2/n} , \bar{X} - t_{0.975}(n-1)*\sqrt{s^2/n} ]

It will be.

rv = stats.t(df=n-1)
lcl = s_mean - rv.isf(0.025) * np.sqrt(s_var/n)
ucl = s_mean - rv.isf(0.975) * np.sqrt(s_var/n)

lcl , ucl

(58.858, 78.742)

We found that the 95% confidence interval for the population mean was (57.616, 79.984). Since the population mean obtained earlier was 69.25875, we can see that the population mean is included in the confidence interval.

Summary

This time I tried to estimate the interval. I thought it would be nice to move my hands and output because the understanding would be deepened by actually moving my hands and implementing it!


Reference materials
Basics of statistical analysis understood by python

Recommended Posts

I tried to estimate the interval.
I tried to estimate the pi stochastically
I tried to move the ball
I tried to summarize the umask command
I tried to recognize the wake word
I tried to summarize the graphical modeling.
I tried to touch the COTOHA API
I tried to debug.
I tried to paste
I tried web scraping to analyze the lyrics.
I tried to optimize while drying the laundry
I tried to save the data with discord
I tried to touch the API of ebay
I tried to correct the keystone of the image
Qiita Job I tried to analyze the job offer
LeetCode I tried to summarize the simple ones
I tried to implement the traveling salesman problem
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried to learn PredNet
I tried to implement PCANet
I tried the changefinder library!
I tried to introduce Pylint
I tried to touch jupyter
I tried to implement StarGAN (1)
I tried to learn the sin function with chainer
I tried to graph the packages installed in Python
I tried to detect the iris from the camera image
I tried to touch the CSV file with Python
I tried to predict the J-League match (data analysis)
I tried to solve the soma cube with python
I tried to approximate the sin function using chainer
[Python] I tried to graph the top 10 eyeshadow rankings
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to simulate the dollar cost averaging method
I tried to redo the non-negative matrix factorization (NMF)
I tried to identify the language using CNN + Melspectogram
I tried to notify the honeypot report on LINE
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
I tried to find the entropy of the image with python
I tried to find out the outline about Big Gorilla
I tried to introduce the block diagram generation tool blockdiag
[Horse Racing] I tried to quantify the strength of racehorses
I tried to simulate how the infection spreads with Python
I tried to implement Deep VQE
I tried to analyze the whole novel "Weathering with You" ☔️
[First COTOHA API] I tried to summarize the old story
I tried the TensorFlow tutorial 1st
I tried to create Quip API
I tried to find the average of the sequence with TensorFlow
I tried the Naro novel API 2
I tried to touch Python (installation)
I tried to notify the train delay information with LINE Notify
I tried to simulate ad optimization using the bandit algorithm.