1. Statistics learned with Python 1-3. Calculation of various statistics (statistics)

statistics is a standard Python library and is a package for statistical calculations. We will use these statistics to calculate various statistics. (As a general rule, write the code and check the result on Google Colaboratory)

** ⑴ Import the library used for numerical calculation **

First, use the ! Pip install xxxx command to install statistics on Colaboratory.

!pip install statistics
import statistics as stat #Import statistics

** ⑵ Prepare data **

data = [12, 3, 5, 2, 6, 7, 9, 6, 4, 11]

** ⑶ Calculate mother standard deviation / unbiased standard deviation **

stat.pstdev(data)

The pstdev function in statistics stands for ** p ** opulation ** st ** andard ** dev ** iation, which means that statistics calculates by specifying ** population standard deviation **. 001_003_001.PNG

Then calculate the unbiased standard deviation.

stat.stdev(data)

In English, ** unbiased standard deviation ** is imparts ** st ** andard ** dev ** iation, but in statistics this is the stdev function. 001_003_002.PNG

** Distributed </ font> **

The standard deviation is the square root of the variance. And ** variance ** is an index showing "how much the data deviates from the average value", and it is written in the formula as follows. σ^2 = \frac{1}{N} {\displaystyle {\sum_{i=1}^{N} (x_{i}-μ)^2}} The total number of data is $ N $. $ (X_ {i} -μ) $ in the formula is the $ i $ th value of the data $ x $ minus the mean $ μ $, which is called the ** deviation **. The square of this deviation, $ (x_ {i} -μ) ^ 2 $, is summed for all data from $ i = 1 $ th to $ N $ th $ \ sum_ {i = 1}. This is what ^ {N} $ means. This is collectively called the ** sum of squared deviations **. The variance is the sum of squared deviations multiplied by $ \ frac {1} {N} $, that is, divided by the number of data $ N $. By the way, if the "difference between the data and the mean value" is the distance between them, the variance can be said to be the "average value of the distance between the data and the mean value" when viewed as a whole data. It shows how far the data as a whole is from the average value, and how much it varies. To be precise, this is called ** sample variance **. It is known that the sample variance is biased, and it is common to use a ** unbiased variance ** that corrects this shortcoming.

** ⑷ Calculate unbiased variance **

stat.variance(data)

001_003_003.PNG

Just in case, let's take the square root of the unbiased variance and check it.

import numpy as np #Import Numpy
data_2 = stat.variance(data) #Variable the value of unbiased variance data_Store in 2
np.sqrt(data_2) # data_Take the square root of 2

001_003_004.PNG

The square root of the unbiased variance is certainly consistent with the unbiased standard deviation.

** unbiased distribution </ font> **

The formula for calculating the unbiased variance is shown below. σ^2 = \frac{1}{N-1} {\displaystyle {\sum_{i=1}^{N} (x_{i}-μ)^2}} The difference from the previous sample variance calculation formula is that $ \ frac {1} {N} $ becomes $ \ frac {1} {N-1} $. The unbiased variance is slightly larger than the sample variance because the denominator is reduced by 1. Why do you do this? What I want to recall here is that the mean is calculated in advance to calculate the variance. If it is true, I want to use the population mean, but I don't know that, so I have no choice but to use the sample mean. Since this sample mean is only a partial mean in the population, it is natural to think that it is slightly different from the true mean (population mean) in the population.

Therefore, I would like to consider the mechanism of this "deviation". 001_003_005.PNG By using the sample mean, the sample variance will always be smaller than the true variance it should be. To correct this bias, we use ** unbiased variance **, or "unbiased variance," and use the ** unbiased standard deviation **, which is the square root of the variance.

Recommended Posts