[PYTHON] Carefully derive the interquartile range of the standard normal distribution from the beginning

Target

Derivation of the interquartile range of the standard normal distribution Exact calculation method of the numerical value of "standard normal distribution table" which is often attached to the appendix of the book of statistics

Calculation procedure

For the time being, if the first quartile is $ Q_1 $ and the third quartile is $ Q_3 , it is a probability density function of the standard normal distribution. $ f(x)= \frac{1}{\sqrt{2\pi}}exp\left(-\frac{x^2}{2}\right)$$ For, the following relationship holds.

\int_{-\infty}^{Q_1} f(x) \ dx=\frac{1}{4}
\int_{-\infty}^{Q_3} f(x) \ dx=\frac{3}{4}

And if you take the difference between these two formulas,


\begin{align}
\int_{-\infty}^{Q_3} f(x) \ dx-\int_{-\infty}^{Q_1} f(x) \ dx&=\int_{Q_1}^{Q_3} f(x) \ dx \\
&=\frac{1}{2}
\end{align}

Here, since it is a standard normal distribution, the mean $ \ mu = 0 $ holds. Using the constant c,

Q_1=\mu-c=-c
Q_3=\mu+c=c

Because it can be expressed as

\begin{align}
\int_{Q_1}^{Q_3} f(x) \ dx&=\int_{-c}^{c} f(x) \ dx\\ &=\frac{1}{2}
\end{align}

Will be. Here, using the error function erf $ (x) $,

\begin{align}
\int_{-c}^{c} exp\left(-\frac{x^2}{2}\right) \ dx&=\left[\frac{\pi}{2} erf\left(\frac{x}{\sqrt{2}}\right)\right]_{-c}^{c}\\ 
&= \sqrt{2\pi} erf\left(\frac{c}{\sqrt{2}}\right)
\end{align}

(Thanks to Wolfram Alpha), so the following equation holds. $erf\left(\frac{c}{\sqrt{2}}\right)=\frac{1}{2}$

I would like to find c included in the error function here, but analysis approaches such as integration by parts are completely inconvenient. This is where you get stuck in the swamp of integration. What kind of policy should you ask for?

Programmatic equations for error functions

This time we will use Numpy as a module. Fortunately, numpy already has an error function built in, so it's really easy to use. (It seems that Scipy can do the same.)

I will also visualize the function later, so I will also include matplotlib.

import numpy as np
import matplotlib.pyplot as plt

Now, from here, I would like to find $ c $ for which the following equation holds. $ erf \ left (\ frac {c} {\ sqrt {2}} \ right) = \ frac {1} {2} $

By the way, this error function $erf(x)=\frac{2}{\sqrt{\pi}}\int_{0}^{x}exp\left(-t^2\right)dt$ It is expressed in the form of. If you visualize what kind of function this error function is

fig = plt.figure(figsize = (6, 4))
ax = fig.add_subplot(111)

x=np.linspace(-2.0,2.0,2000000)
y=np.erf(x/np.sqrt(2))

ax.set_xlabel("x", fontsize = 14)
ax.set_ylabel("erf(x)", fontsize = 14)

plt.plot(x, y, ls="-", label="erf(x)", color="red")
plt.title("Error function")
plt.legend()
plt.show()

erf(x).png

It's an outline like the Sigmoid function, and it increases monotonously in a narrow sense.

So, solve c with the following approach. ① Since it is a standard normal distribution, the range of c from the standard deviation feature (USE OF STANDARD DEVIATION) is [-1,1]. Then, increase by $ 10 ^ {-6} $ according to the order of the distribution table. ② $ erf (\ frac {c} {\ sqrt {2}}) <\ frac {1} {2} While it is less than $, execute the calculation and increase the integral value, and $ erf (\ frac) {c} {\ sqrt {2}}) ≒ \ frac {1} {2} Find the last-minute value c that is $.

Code this in the same way as when programming the error function.

for c in np.arange(-1.000000,1.000000):
    while np.erf(c/np.sqrt(2)) < 1/2:
        c=c+0.000001
    break
print(c)
0.6744900000064764

When I try to visualize it,

fig = plt.figure(figsize = (6, 4))
ax = fig.add_subplot(111)

x=np.linspace(-1.0,1.0,1000000)
y=np.erf(x/np.sqrt(2))

ax.set_xlabel("c", fontsize = 14)
ax.set_ylabel("erf score", fontsize = 14)

plt.plot(x, y, ls="-", label="erf(c/√2)")
ax.axhline(0.5, ls = "-.", color = "green", label="erf score = 1/2")
plt.title("Error function")
plt.legend()
plt.show()
![erf(x)2.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/223549/d2d3d5cf-d0f5-e72e-06c8-3e9c4e61e40a.png) The value of c at the intersection of the two lines corresponds to the previous output result of 0.67449 ...

last

From the program result, $ c = 0.674490 $ ..., so the interquartile range is

\begin{align}
c-(-c)&=2c \\ 
&= 2*(0.674490...) \\ 
&≒ 1.349
\end{align}

Therefore, the interquartile range of the standard normal distribution is $ 1.349 $.

Recommended Posts