[PYTHON] I tried to estimate the interval.

Interval estimation

This time, I would like to continue the estimation from last time . Last time, we estimated the points, so please take a look if you are interested.

Regarding data

This time, I tried to estimate the interval using the Pokemon dataset. The data is as below. This time as well, I would like to use the HP data of this data as the population for interval estimation.

First, I would like to find the average and variance of HP.

score = np.array(df['HP'])
mean = np.mean(score)
var = np.var(score)
print("HP average: {} ,Distributed: {} ".format(mean , var))

HP Average: 69.25875, Variance: 651.2042984374999 I would like to analyze these as the population mean and population variance.

Interval estimation of the population mean of the normal distribution (if the population variance is known)

This time, let's consider the case where the HP data mentioned earlier is used as the population, and it is assumed that it follows a normal distribution, and the population variance is also known.

Since we assume a normal distribution for the population, the sample mean $ \ bar {X} $ follows $ N (μ, σ ^ 2 / n) $. In other words, the estimator of the sample mean has a standard deviation of $ \ sqrt {σ ^ 2 / n} $, although the expected value is the population mean μ. The standard deviation of such an estimator is called the "standard error".

Also, since the sample mean $ \ bar {X} $ follows $ N (μ, σ ^ 2 / n) $, $ Z = (\ bar {X} -μ) / \ sqrt {σ ^ 2 / n} It can be standardized with $ and Z follows a standard normal distribution. What is good about this standardization is the confidence interval. It means that it will be easier to calculate.

First, calculate the population mean / variance and the sample mean / variance of the sample data. The sample size of this sample is 20.

np.random.seed(0)
n = 20
sample = np.random.choice(score , n)

p_mean = np.mean(score)
p_var = np.var(score)

s_mean = np.mean(sample)
s_var = np.var(sample , ddof = 1)

Population mean: 69.25875, population variance: 651.2042984374999 Sample mean: 68.8, sample variance (unbiased variance): 451.26000000000005

This time, I would like to use this sample mean to calculate the confidence interval for the population mean. (Suppose you know the population variance.) Consider finding a 95% confidence interval from the sample mean to the population mean. First, standardizing the sample mean $ \ bar {X} $ yields $ Z = (\ bar {X} -μ) / \ sqrt {σ ^ 2 / n} $. So let's first consider the 95% confidence interval for $ Z $. Then P(z_{0.975}≦(\bar{X}-μ)/\sqrt{σ^2/n} ≦z_{0.025})=0.95…① You can make the inequality. This equation has a 95% probability that the random variable $ Z = (\ bar {X} -μ) / \ sqrt {σ ^ 2 / n} $ will be in the interval $ [z_ {0.975}, z_ {0.025}] $. It represents that there is. Transforming this $ ① $ equation into an inequality for the population mean μ P( \bar{X}-z_{0.025}*\sqrt{σ^2/n}≦μ≦\bar{X}-z_{0.095} * \sqrt{σ^2/n})=0.95 It will be.

Therefore, to find the 95% confidence interval when the population variance is known, [\bar{X}-z_{0.025}*\sqrt{σ^2/n} , \bar{X}-z_{0.095} * \sqrt{σ^2/n}] It means that you should ask for.

I tried to implement it.

rv = stats.norm()
#rv.isf(0.025)Has a standard normal distribution probability of 0.It represents the point 025. Multiply it by the standard error.
lcl = s_mean - rv.isf(0.025) * np.sqrt(p_var/n)
ucl = s_mean - rv.isf(0.975) * np.sqrt(p_var/n)
lcl , ucl

(57.616, 79.984)

From the above, we found that the 95% confidence interval for the population mean is (57.616, 79.984). Since the population mean obtained earlier was 69.25875, we can see that the population mean is included in the confidence interval.

This confidence interval is sampled many times in the same way, and when interval estimation is performed, 95% of the interval estimates include the population mean. In a chewed form, when the interval is estimated 100 times, the confidence interval including the population mean is obtained 95 times, but the confidence interval obtained 5 times does not include the population mean. ..

Interval estimation of population variance

We will estimate the interval of the population variance. Let's consider the case where a normal distribution is assumed for the population and the population mean is not known.

Just as we standardized when calculating the confidence interval of the population mean and converted it to a random variable that follows a standard normal distribution, we also perform some conversion to the unbiased variance $ s ^ 2 $ to create a random variable that follows a typical probability distribution. need to do it. The probability distribution used at this time is the chi-square distribution. It is known that this variable Y follows a chi-square distribution with n-1 degrees of freedom by converting the unbiased variance $ s ^ 2 $ to $ Y = (n-1) s ^ 2 / σ ^ 2 $. I will.

Now, I would like to find the confidence interval for the population variance. First, find the 95% confidence interval for $ \ chi {} ^ 2 (n-1) $.

P(\chi{}^2_{0.975}(n-1) ≦ (n-1)s^2/σ^2 ≦\chi{}^2_{0.025}(n-1)) = 0.95

Since we want to find the confidence interval of the population variance this time, make sure that $ σ ^ 2 $ is in the middle.

P((n-1)s^2/\chi{}^2_{0.025}(n-1) ≦ σ^2 ≦(n-1)s^2/\chi{}^2_{0.975}(n-1)) = 0.95

From this, the 95% confidence interval for the population variance $ σ ^ 2 $ is

[(n-1)s^2/\chi{}^2_{0.025}(n-1) , (n-1)s^2/\chi{}^2_{0.975}(n-1)]

Will be.

rv = stats.chi2(df=n-1)
lcl = (n-1) * s_var / rv.isf(0.025)
hcl = (n-1) * s_var / rv.isf(0.975)

lcl , hcl

(260.984, 962.659)

The confidence interval for the population variance is (260.984, 962.659). Since the population variance was 651.204, we can see that it is included in the interval.

Interval estimation of population mean (when population variance is not known)

I proceeded with the analysis in the situation where the population variance was known when finding the confidence interval for the population mean. However, there are not many situations where the population mean is not known and the population variance is known. Therefore, this time, I would like to estimate the confidence interval of the population mean when the population variance is unknown.

When the population variance was known, the interval was estimated by the standard error $ \ sqrt {σ ^ 2 / n} $ of the sample mean $ \ bar {X} $. Since we do not know this population variance $ σ ^ 2 $ this time, we substitute $ \ sqrt {s ^ 2 / n} $, which uses the estimator unbiased variance $ s ^ 2 $, as the standard error. ..

First, transform the sample mean $ \ bar {X} $ using $ \ sqrt {s ^ 2 / n} $ as you would when you know the population variance.

t = (\bar{X} - μ) / \sqrt{s^2/n}

This $ t $ is

Y=(n-1)s^2/σ^2 Z = (\bar{X}-μ)/\sqrt{σ^2/n}

If you convert using these two,

t = Z / \sqrt{Y/(n-1)}

It can be represented by. Therefore, we can see that this $ t $ follows a t distribution with n-1 degrees of freedom.

Now we know that $ t = (\ bar {X} --μ) / \ sqrt {s ^ 2 / n} $ follows a t distribution with n-1 degrees of freedom, so from here we have a 95% confidence interval for the population mean. I will ask.

P(t_{0.975}(n-1)≦ (\bar{X} - μ) / \sqrt{s^2/n}　≦ t_{0.025}(n-1)) = 0.95

This equation is transformed so that the population mean μ is in the middle.

P(\bar{X} - t_{0.025}(n-1) * \sqrt{s^2/n}≦ μ ≦ \bar{X} - t_{0.975}(n-1)*\sqrt{s^2/n}) = 0.95

This gives the 95% confidence interval for the population mean

[ \bar{X} - t_{0.025}(n-1) * \sqrt{s^2/n} , \bar{X} - t_{0.975}(n-1)*\sqrt{s^2/n} ]

It will be.

rv = stats.t(df=n-1)
lcl = s_mean - rv.isf(0.025) * np.sqrt(s_var/n)
ucl = s_mean - rv.isf(0.975) * np.sqrt(s_var/n)

lcl , ucl

(58.858, 78.742)

We found that the 95% confidence interval for the population mean was (57.616, 79.984). Since the population mean obtained earlier was 69.25875, we can see that the population mean is included in the confidence interval.

Summary

This time I tried to estimate the interval. I thought it would be nice to move my hands and output because the understanding would be deepened by actually moving my hands and implementing it!

Reference materials
Basics of statistical analysis understood by python