[PYTHON] Commentary on unbiasedness and consistency for those who don't like it

Then, it is the text.

On Twitter, I observed something like "Unbiased estimator is difficult ... There is also a consistent estimator, but what's the difference?" By the way, when I started studying statistics, I had the same question for a long time. I remembered that I was there [^ 1].

When I asked Google Sensei "unbiased estimator, consistent estimator", it seems that there are less than 40,000 hits. When I add my own poor article to it, I feel that it is worth it, but I will write it as if it is okay.

The taste is a little different from the commentary on other sites, so maybe some people may be disappointed with this article.

So let's get started.

■ Why do you think about the nature of statistics in the first place?

For example, when considering average estimation, there are many options such as sample mean, median, and pruned mean. In this way, when you want to estimate the value of a certain parameter (such as mean or variance), there are generally various options.

So which method should I use among the various options? In other words, the question is what is a good estimate.

Unbiasedness and consistency are one of the criteria when considering such issues. [^ 2]

■ Let's take a look at the definition. .. ..

Perhaps, by the time I got to this article, I had been looking at the definitions of consistency and impartiality, and I think that the definition story is already good, so I think that I feel like giving an explanation that is disappointing, but for the time being, for myself. Because it is also a memo of. .. .. ..

◆ Consistency

There are actually various types of consistency, but here we will introduce the weak and strong consistent estimators. For the time being, it's okay if you don't know the difference. Please think about "There is something".

・ Weakly consistent estimator

The estimator column $ \ {T_n \} $ is all $ \ varepsilon> 0 $ and all $ \ theta \ in \ Theta $.

\lim_{n→\infty}P\,(\,|\,T_n-\theta\,|\,≧\varepsilon \,)\,=0

Then we call $ T_n $ the weakly matched estimator of $ \ theta $.

・ Strong consistent estimator

The estimator column $ \ {T_n \} $ is all $ \ theta \ in \ Theta $,

P(\,\lim_{n→\infty}T_n=\theta)=1

Then we call $ T_n $ the strong consistent estimator of $ \ theta $.
For those who are worried about the difference, ** 4.2 of "Probability theory for statistics, beyond" You should read the concept of stochastic convergence and its strength (P125) **.

◆ Unbiased

Unbiased means that it is literally not biased, but what exactly is "unbiased"? The biases in statistics are as follows:

・ Bias

Bias is a measure of how far the mean of the estimator is ** how far from the true value. It will be as follows when written in a mathematical formula.

If the estimator of the parameter $ \ theta $ is $ T (X) $, the bias is

E\,[\,T(X)\,]-\theta

Is defined as. The expected value of the estimator-the true value, isn't it?

Perhaps the word bias is used more often, but this time I will use the word bias to match the unbiased ("bias").

・ Unbiased

As mentioned above, the bias is defined as $ E , [, T (X) ,]-\ theta $. Being unbiased means not being biased,

E\,[\,T(X)\,]-\theta=0

That's fine. If you move $ \ theta $ to the right side,

E\,[\,T(X)\,]=\theta

It will be. The estimator $ T (X) $ that satisfies this is called the unbiased estimator of the parameter $ \ theta $.

■ If you miss the time, start here.

If you like the details, please read here.

Perhaps the most visible example of an unbiased estimator is the unbiased variance, and I think you have properly manually confirmed the expected value of the sample variance.

Here, we will first confirm that the sample variance is biased by simulation, and then confirm that the unbiased variance is unbiased by simulation.

◆ Sample dispersion

Sample variance is

\frac{1}{N} \sum_{i=1}^{N} (X_i-\bar{X})^{\,2}

It is defined by. As you know, sample variance is not an unbiased estimator. In other words, the variance estimation by sample variance is biased when averaged.

Let's experience this.

★ Experience the bias of sample dispersion.

Now suppose you get $ N $ of observations independently of a normal distribution with a true mean of $ 10 $ and a true variance of $ 25 $.

X_1,X_2,...,X_N \stackrel{i.i.d}{\sim} N(10,5^2)

Use these $ N $ observations to estimate the variance as a sample variance. Then, for each $ N $, repeat "Extract data-> Calculate sample variance" 10000 times. Let the mean of these 10000 sample variances be the estimated ** expected sample variance ** for each $ N $.

What are you talking about? I don't know? For those who say, this figure. .. .. 図112.png

If you don't understand the meaning in the balloon, click here. .. ..

図1.png

$ K $ in this figure is 10000 this time.

When I wake it up in the code, it looks like this. If you don't care about the code, look at the graph below.

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

np.random.seed(42)

M_var0 = 200
K_var0 = 10000
mu_var0 = 10
std_var0 = 5
EV_var0 = []

for N in range(M_var0):
    V_var0 = []
    for i in range(K):
        X = np.random.normal(mu, std, N)
        V_var0.append(np.var(X, ddof = 0)) #ddof=Sample variance at 0
                                           #ddof=Unbiased variance at 1
    EV_var0.append(np.mean(V_var0))

EV_var00 = np.array(EV_var0)
plt.figure(figsize = (10, 3.5))
plt.axhline(25, ls = "--", color = "red",  label = "True")
plt.plot(EV_var00, color = 'blue', linewidth = 0.5, label = "Mean ofEstimator")
plt.xlabel("N")
plt.ylabel("Variance")
plt.legend(loc = "lower right")
plt.show()

標本分散.png

The horizontal axis is $ N $ and the vertical axis is variance.

The mean value of the sample variance (blue line) seems to be smaller than the true variance 25 (red line). It is difficult to see near the true value, so let's enlarge it a little more.

EV_var00 = np.array(EV_var0)
plt.figure(figsize = (10, 3.5))
plt.axhline(25, ls = "--", color = "red",  label = "True")
plt.plot(EV_var00[10:], color = 'blue', linewidth = 0.5, label = "Mean of Estimator")
plt.xlabel("N")
plt.ylabel("Variance")
plt.legend(loc = "lower right")
plt.show()

(I'm just looking from $ N = 10 $. The numbers on the horizontal axis are pasted on the PPT and then written in the comment box. I'm sorry to wear it sideways.)

標本分散10から.png

After all, the mean value of the sample variance is smaller than the true variance.

This is the so-called bias.

In other words, fixing the number of observations $ N $, taking data over and over again to calculate the sample variance, and averaging this is not equal to the "true variance" (of the sample variance). Is smaller). This is called bias.

In fact, let the true variance of $ X $ be $ \ sigma ^ 2 $ and calculate the mean of the sample variances.

\begin{eqnarray}
E\biggl[\frac{1}{N} \sum_{i=1}^{N} (X_i-\bar{X})^{\,2}\biggl] &=& \frac{N-1}{N} \sigma^2\\
\\
&=& \sigma^2-\frac{1}{N}\sigma^2
\end{eqnarray}

Therefore, we can see that the sample variance is theoretically biased by $-\ frac {1} {N} \ sigma ^ 2 $ over the true variance $ \ sigma ^ 2 $.

Let's compare the bias of the sample variance obtained by the simulation with the theoretical bias obtained by the calculation.

n = np.arange(M_var0-1)
True_bias = -1 / n * 25 #Theoretical bias(bias)

plt.figure(figsize = (10, 3.5))
plt.plot(Bias_var0, color = 'blue', linewidth = 0.5, label = "Bias of Estimator")
plt.plot(True_bias, color = 'red', linewidth = 0.5, label = "True Bias")
plt.axhline(0, ls = "--", color = "red")
plt.xlabel("N")
plt.ylabel("Bias")
plt.legend(loc = "lower right")
plt.show()

(Since it was hard to see, from $ N = 10 $)

標本分散のバイアス.png

The blue line is the simulation bias, and the red solid line is the theoretical bias.

There is a bias that is almost theoretical.

As an aside, as you can see from the above calculation results, in the case of sample variance, the bias depends on $ N $, and if $ N $ is increased, the bias approaches $ 0 $.


★ Is the unbiased variance not biased?

Well, have you seen so far

"That means that if you use unbiased variance, your estimation will not be biased?"

In other words

"If you take the data over and over again, calculate the unbiased variance, and average this, is it equal to the true variance?"

I think that you think.

In fact, that's right. Let's try the same thing we did with sample variance with unbiased variance.

np.random.seed(42)

M_var1 = 200
K_var1 = 10000
mu_var1 = 10
std_var1 = 5
EV_var1 = []

for N in range(M_var1):
    V_var1 = []
    for i in range(K):
        X = np.random.normal(mu, std, N)
        V_var1.append(np.var(X, ddof = 1))
    EV_var1.append(np.mean(V_var1))

plt.figure(figsize = (10, 3.5))
plt.plot(EV_var1, color = 'blue', linewidth = 0.5, label = "Mean of Estimator")
plt.axhline(25, ls = "--", color = "red", label = "True")
plt.xlabel("Bias")
plt.ylabel("standard deviation")
plt.legend(loc = "lower right")
plt.show()

不偏分散.png

It may be hard to realize because the range of values on the vertical axis is small, but the mean value of the unbiased variance is close to the true variance regardless of the value of $ N $.

Let's compare the "mean value of sample variance" and the "mean value of unbiased variance" to realize that the unbiased variance is unbiased.

plt.figure(figsize = (10, 3.5))
plt.plot(EV_var1[10:], color = 'blue', linewidth = 0.5, label = "Unbiased Variance")
plt.plot(EV_var00[10:], color = 'green', linewidth = 0.5, label = "Sample Variance")
plt.axhline(25, ls = "--", color = "red", label = "True")
plt.xlabel("Bias")
plt.ylabel("standard deviation")
plt.legend(loc = "lower right")
plt.show()

(Since it was hard to see, from $ N = 10 $)

標本分散と不偏分散の比較.png

The green line is the mean of the sample variance and the blue line is the mean of the unbiased variance. You can see that the unbiased variance is significantly less biased than the sample variance.

It is persistent over and over again, but the image is that the number of observations $ N $ is fixed, the variance is repeatedly estimated, and the average value of the obtained estimates is equal to the true value. Will be.


★ Consistency

I think I somehow understood the bias. So what's the difference with consistency?

Consistency isn't an average or something like that, it's simply a single estimate of making $ N $ infinite and then matching the true value. So the experiment just needs to increase $ N $.

The bottom line is that the sample variances are consistent, which we will confirm by simulation. (For the proof, around here will be helpful.)

In the case of consistency, we will see the behavior when the data is enlarged steadily, such as estimated by $ N = 1 $ → estimated by $ $ N = 2 $ → $ .... (Estimate only once for each $ N $.)

np.random.seed(42)

M_consistent = 10000
mu_consistent = 10
std_consistent = 5
V_consistent = []

for N in range(M_consistent):
    X_consistent = np.random.normal(mu_consistent, std_consistent, N)
    V_consistent.append(np.var(X_consistent, ddof = 0))

plt.figure(figsize = (10, 3.5))
plt.plot(V_consistent, color = 'blue', linewidth = 0.3, label = "Estimator")
plt.axhline(25, ls = "--", color = "red", label = "True")
plt.xlabel("N")
plt.ylabel("Variance")
plt.legend(loc = "lower right")
plt.show()

標本分散の一致性.png

You can see that as $ N $ grows, it approaches the true variance.

The unbiased variance is almost the same.

Since the number divided is the difference between $ N $ and $ N-1 $, you can see that this difference seems to be negligible once it grows to some extent.

np.random.seed(42)

M_consistent1 = 10000
mu_consistent1 = 10
std_consistent1 = 5
V_consistent1 = []

for N in range(M_consistent1):
    X_consistent1 = np.random.normal(mu_consistent1, std_consistent1, N)
    V_consistent1.append(np.var(X_consistent1, ddof = 1))


plt.figure(figsize = (10, 3.5))
plt.plot(V_consistent1, color = 'blue', linewidth = 0.3, label = "Estimator")
plt.axhline(25, ls = "--", color = "red", label = "True")
plt.xlabel("N")
plt.ylabel("Variance")
plt.legend(loc = "lower right")
plt.show()

不偏分散の一致性.png


■ Summary

I was very tired even though I didn't do much. In the second half, it may seem a bit messy, but it's better to write it concisely. .. .. Right. .. ..

Roughly speaking,

** Unbiased **: With the same number of observations, is it equal to the true value when the data is taken and estimated over and over again and averaged?

** Consistency **: Is it equal to the true value if the number of observed values is extremely large in one estimation?

Is it like that?


★ References ★

[1] Noda, Miyaoka: Basics of Mathematical Statistics (1992) [2] Shimizu: Probability theory for statistics, beyond (2019) [3] Show the consistency of unbiased sample variance (link) [4] Unbiased variance (link) [5] The sample variance and unbiased estimator are consistent estimators (link)

[^ 1]: Do you understand now? I can't stop sweating when asked. I have to study the theory of asymptote. .. ..

[^ 2]: Of course, there are many other criteria like this.

Recommended Posts

Commentary on unbiasedness and consistency for those who don't like it
AWS ~ For those who will use it ~
For those who don't have Keras or TensorFlow on GPU on macOS Sierra
For those who get Getting Key Error:'length_seconds' on pytube
For those who can't install Python on Windows XP
For those who will take AI, ML, and data scientist school courses from now on
For those of you who don't know how to set a password with Jupyter on Docker
Don't print and import logging for logging
Java SE8 Gold measures (for those who are not good at it)
[YOLO v5] Object detection for people who are masked and those who are not