[PYTHON] I tried to confirm whether the unbiased estimator of standard deviation is really unbiased by "throwing a coin 10,000 times"

When I was a student [^ major], when I touched the field of statistics in mathematics, I learned something like this.

When you pick some samples from the population and infer population trends from them ** (sampling) **

Probabilistically ** the sample mean represents the population mean **, but

A measure of ** "variation", such as standard deviation, ** tends to be stochastically less than that of the population ** of the sample.

in short,

** The standard deviation of the sample is smaller than the population **! ΩΩΩ

about it.

Immediately after that, I was taught the formula of ** "unbiased variance" ** by dividing by $ n -1 $ instead of the number of samples $ n $. Have you ever wanted to make sure?

Even if the proof of proper algebra says that "the expected value of unbiased variance is equal to the variance of the population", that is certainly the case, but after all ** I do not see concrete data [^ abstraction]. Haven't you experienced that it's not so refreshing **?

However, ** there is no data to confirm it **, and considering the bias of the sample, ** I can't be sure if I calculate it once or twice ** …….

wait a minute. Isn't the box (or board) in front of you ** for that **? ~~ I feel different. But at least it shouldn't be there to instigate each other on ** 5ch **. ~~ If you have a computer, you can throw coins 10,000 times ** because it's ** before breakfast **.

A simple question that was stolen in childhood, ** Let's hit it with computer science **.

[^ major]: My major was ghost studies. [^ abstraction]: Mathematics is a discipline that emphasizes abstraction (generalization) rather than concreteness (example), so it can't be helped.

Live in the virtual world

This time, we will use ** Python + JupyterLab ** to run the simulation. However, it doesn't matter what you use **. For the time being ** I'm using it for business ** Recently I've been posting strangely ** Python articles to Qiita **, so it doesn't mean anything more than that. So ** Stop mount battles and other crazy things in programming languages right now **. Especially in natural language [^ English].

`In[]`


import math

import matplotlib.pyplot as plt
import numpy as np

This time, the population is assumed to be a population that follows a normal distribution with a mean of 50 and a standard deviation of 10. As an aside, this $ \ mathcal {N} (50, 10 ^ 2) $ is also a ** definition of deviation value **. Generally, only the part that "transformed so that the average becomes 50" is known, but the part that ** standard deviation 10 ** is ** quite important . Thanks to this, " how outstanding the results are **" can be discussed in a unified manner. In addition, there are many people who make a ridiculous misunderstanding ** because the lower limit is 0 and the upper limit is 100 ** because they choose a meaningful number such as 50 [^ deviation].

`In[]`


mu = 50.
sigma = 10.

The number of specimens to be extracted is ** 10 **. It depends on the size of the population, but I think this is ** quite small ** for the survey. At this time, it was said that there were at least 2000 people even in trivia seeds ** Hey, is this article different from too many derailments **? Does even such a small sample properly match the population when viewed at the expected value? I'm looking forward to it.

`In[]`


samples_num = 10

[^ English]: I wonder why there are a certain number of people who diss the irregularity of English as if they were the heads of demons. It's unpleasant. [^ deviation]: When I was a kid, I was worried that if I transformed the below average from 0 to 50 and the above average from 50 to 100, it would be a very distorted distribution. I'm alone.

average

First, let's take a look at the ** average **, which is an index that is said to be "no problem (probabilistically) used as it is in a sample survey".

`In[]`


np.random.seed(555)

mean_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    mean_list.append(np.mean(samples))
mean = np.array(mean_list)

plt.hist(mean, bins=30, density=True, histtype='stepfilled')
plt.title(F'Mean: $\\mu = {np.mean(mean):.3f}$, $\\sigma = {np.std(mean):.3f}$')
plt.show()

As a result, the average of the results of 10000 sample surveys was ** 50.022 **. Since the average of the population is ** 50 **, it can be estimated ** properly **. I don't know if the standard deviation 3.218 is more or less, but it's probably ** more samples and less variability **.

standard deviation

Next is today's protagonist, ** Standard Deviation **.

`In[]`


np.random.seed(913)

std_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    std_list.append(np.std(samples))
std = np.array(std_list)

plt.hist(std, bins=30, density=True, histtype='stepfilled')
plt.title(F'Standard Deviation: $\\mu = {np.mean(std):.3f}$, $\\sigma = {np.std(std):.3f}$')
plt.show()

As a result, the average of 10000 sample surveys was ** 9.235 **. The standard deviation of the population is ** 10 **, so it is certainly ** underestimated **.

Bias of an estimator of standard deviation [^ unbiased]

The ** standard deviation ** is the ** (non-negative) ** square root ** of the variance. ** Variance ** is the ** average ** of the ** "difference from the mean squared" of each data. $ s = \sqrt{s^2} = \sqrt{\frac{1}{n}\sum_{i = 1}^n(x_i - \overline{x})^2} $ When randomly selecting a sample, ** it is unlikely that a large or small value that deviates too much from the mean will be selected ** (aside from variability such as uniform distribution and shit distribution). It will be underestimated.

The reason for this is the value obtained by dividing by ** $ n -1 $ instead of dividing by $ n $ to obtain the average, assuming that you refer to books and websites that are written more seriously about various statistics. ** matches the expected value ** of the population variance **. $ u^2 = \frac{1}{n - 1}\sum_{i = 1}^{n}(x_i - \overline{x})^2 = \frac{n}{n - 1}s^2 $

E(u^2) = \sigma^2

Then ** does the expected value of the square root ** match the standard deviation of the population **? ** Slightly different **, it seems to be calculated by the following formula. $ D = \sqrt{\frac{n - 1}{2}}\frac{\Gamma\left(\frac{n - 1}{2}\right)}{\Gamma\left(\frac{n}{2}\right)}u = \sqrt{\frac{n}{2}}\frac{\Gamma\left(\frac{n - 1}{2}\right)}{\Gamma\left(\frac{n}{2}\right)}s $

E(D) = \sigma

** What is this formula **. By the way, it seems that $ \ Gamma (n) = (n -1)! $ In the range of natural numbers. ** I don't know **.

Anyway, referring to this description, I will write a function to find the coefficient of $ s $.

`In[]`


def ustd_coefficient(n):
    try:
        return math.sqrt(n / 2) * math.gamma((n - 1) / 2) / math.gamma(n / 2)
    except OverflowError:
        return math.sqrt(n / (n - 1.5))

`In[]`


np.random.seed(333)

D_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    D_list.append(np.std(samples) * ustd_coefficient(len(samples)))
D = np.array(D_list)

plt.hist(D, bins=30, density=True, histtype='stepfilled')
plt.title(F'Unbiased Standard Deviation: $\\mu = {np.mean(D):.3f}$, $\\sigma = {np.std(D):.3f}$')
plt.show()

As a result, the average of 10000 sample surveys was ** 10.024 **. The standard deviation of the population is ** 10 **, so it seems that you can certainly estimate the standard deviation of the population **.

[^ unbiased]: This ridiculously lengthy phrase is due to the confusion of terms in this area. Is the sample variance (standard deviation) called the "sample variance (sample standard deviation)", or is the unbiased estimator called the "sample variance (sample standard deviation)" because it is used for sample surveys? Is the unbiased standard deviation the "unbiased estimator of the standard deviation of the population" or the "square root of the unbiased estimator of the population variance"? Ah, it's messed up.

Median

From here on, there is a ** bonus **.

People who bite a little bit of statistics seem to be dyed with ideas like ** "The average is shit! Let's get out of average addiction!" ** ^{[citation needed] </ sup>, but such people have been decided The ** median ** is what you bring out. At this time, let's try this too.}

** In the normal distribution, the mean and median match **, so the median of the population is also 50 **.

I will try to predict what will happen to me. First, ** median ** is the value that divides all values into 1: 1 **. So, when taking a sample, the probability that it will be ** below the median of the population ** and the probability that it will be ** above ** is clearly ** equal ** by definition. .. So the value that divides the sample into 1: 1 would still ** converge to that of the population **.

Let's check it.

`In[]`


np.random.seed(753)

median_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    median_list.append(np.median(samples))
median = np.array(median_list)

plt.hist(median, bins=30, density=True, histtype='stepfilled')
plt.title(F'Median: $\\mu = {np.mean(median):.3f}$, $\\sigma = {np.std(median):.3f}$')
plt.show()

As a result, the average of 10000 sample surveys was ** 50.040 **. Since the median of the population is ** 50 **, as expected, it can be estimated correctly from the ** sample **.

Regular interquartile range

Finally, let's look at the ** regular interquartile range **. As confirmed in Previous article, ** the standard deviation and the normal interquartile range match in the normal distribution **, so ** population The normal interquartile range of is also 10 **.

Ah, NumPy doesn't have a function to calculate the normal interquartile range, so I'll make it up.

`In[]`


def np_iqr(a, axis=None, out=None, overwrite_input=False, keepdims=False):
    q1 = np.quantile(a, q=0.25, axis=axis, overwrite_input=overwrite_input, keepdims=keepdims)
    q3 = np.quantile(a, q=0.75, axis=axis, overwrite_input=overwrite_input, keepdims=keepdims)
    return np.subtract(q3, q1, out=out)

def np_niqr(a, axis=None, out=None, overwrite_input=False, keepdims=False):
    return np.divide(np.iqr(a, axis=axis, overwrite_input=overwrite_input, keepdims=keepdims), 1.3489795003921634, out=out)

np.iqr = np_iqr
np.niqr = np_niqr

I will try to predict what will happen to me. First, the ** regular interquartile range ** is the ** difference between the top and bottom quantiles **. As the extreme of the quantiles, when considering the ** minimum and maximum **, the ** sample minimum ** is often ** greater than the population minimum ** * * Probably, the ** sample maximum ** will most likely be ** smaller ** than the population maximum **. That is, ** sample quantiles tend to be biased inward of that of the population ** (the median was just the middle quantile and was unaffected). As a result, we can expect that the ** regular interquartile range is underestimated ** than that of the population **.

Let's check it.

`In[]`


np.random.seed(315)

niqr_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    niqr_list.append(np.niqr(samples))
niqr = np.array(niqr_list)

plt.hist(niqr, bins=30, density=True, histtype='stepfilled')
plt.title(F'Normalized Interquartile Range: $\\mu = {np.mean(niqr):.3f}$, $\\sigma = {np.std(niqr):.3f}$')
plt.show()

As a result, the average of 10000 sample surveys was ** 8.640 **. Since the normal interquartile range of the population is ** 10 **, it is still ** underestimated ** as expected **.

Afterword

How do you find the unbiased estimator for the normal interquartile range?