[Algorithm x Python] Calculation of basic statistics Part3 (range, variance, standard deviation, coefficient of variation)

I will write about algorithms and Python. This time, I will write not only how to calculate a simple calculation using a function, but also how to calculate it when the function is not used.

table of contents

  1. Find the range
  2. Find the variance 1-0. Find the population variance 1-1. Find unbiased variance
  3. Find the standard deviation 2-0. Find the population standard deviation 2-1. Find the unbiased standard deviation
  4. Find the coefficient of variation Finally

0. Find the range

◯ Range is the simplest ** quantity that represents the spread of data **. Maximum value-Minimum value can be easily calculated. However, if the data contains ** extreme values **, the range may be too wide for the values to characterize the data. (Distribution solves this problem)

How to find the range using the maximum and minimum values

◯ Consider the range using the data of the math test of a certain 30 class.

test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Find the maximum and minimum values
max_score = max(test_score)
min_score = min(test_score)
#Find the range
score_range = max_score-min_score

print('max_score = ',max_score)
print('min_score = ',min_score)
print('score_range = ',score_range)
max_score =  97
min_score =  9
score_range =  88

1. Find the variance

◯ Variance is used as a ** statistic showing data variability **. Variance is not as sensitive to extreme values as range, as the contribution of one element is small. However, since it is squared in the calculation process, the unit is different from the original data **. (The standard deviation solves this problem)

◯ If you want to know the variability of a certain data, ** population variance ** is required, ** that is the best **. However, there are many cases where it is not possible to grasp all the elements of the population and ** population variance is not directly required **. Unbiased dispersion is used in such cases.

Type of dispersion Intended use Feature
Mother dispersion Seeking population variance Only if you know all the elements of the populationAvailable
Sample variance Find the variance of the sample Not an estimate of population variance
Unbiased dispersion Estimate population variance from a sample Become an estimate of population variance

Why the variance is squared


Formula for finding the population variance

S^2 = \frac{1}{n} [(x_1-\bar{x})^2 +(x_2-\bar{x})^2 +...(x_n-\bar{x})^2 ] = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2 

Equation for unbiased variance

U^2 = \frac{1}{m-1}\sum_{i=1}^{m}(x_i-\bar{x})^2 = \frac{m}{m-1}s^2


◯ The reason why the sample size is m but the degree of freedom of unbiased variance is m-1 is that the m observations in the equation for unbiased variance are not completely independent of each other. * One of the observations is obtained from the other m-1 independent observations and the sample mean **.

◯ Mathematical explanation that the degree of freedom is m-1 Relationship between sample variance and unbiased variance

◯ When comparing the formulas for the denominator variance and the unbiased variance, it may seem strange that only the denominator is different in this way. However, in the limit where the population size n is considerably large ** and the sample size m is as large as n **, the unbiased variance almost matches the population variance **. It has been proven to be a good estimate of population variance **.


1-0. Find the population variance

◯ Population variance is the variance of the population. It is used when all the elements of the population are known.

◯ The population variance is the sum of the squares of the deviations (deviations) from the mean of each data and divided by the number of data. In other words, ** the more elements of the data deviate from the mean, the greater the population variance **.


How to find the population variance using the pvariance () function

◯ Calculate the population variance using the score data of a certain class of mathematics test as the population. In other words, find the degree of variability in the test scores for this class.

import statistics
#List of test scores
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]

population_variance = statistics.pvariance(test_score)
print('population_variance =',population_variance)
population_variance = 638.6455555555556

How to find the population variance using deviation

Formula for finding the population variance

S^2 = \frac{1}{n} [(x_1-\bar{x})^2 +(x_2-\bar{x})^2 +...(x_n-\bar{x})^2 ] = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2 

◯ ** (data value)-(average value) ** is called ** deviation (= deviation) **. In the above formula, the deviation is calculated, the square of the deviation is summed, and divided by the number of elements.

import statistics

test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Find the number of elements in the list
n = len(test_score)
#Find the average score of the test
score_mean  = statistics.mean(test_score)
#Make a list of squares of deviations
squared_deviation_list = [(score-score_mean)**2 for score in test_score]
#Mother dispersion= squared_deviation_sum of list/Element count
population_variance = sum(squared_deviation_list)/n

print('population_variance = ',population_variance)
population_variance =  638.6455555555556

How to find the population variance using the mean

◯ The population variance can be transformed into the following shapes. We will use this to find the population variance.

S^2 = \frac{1}{n} (x_1^2+x_2^2+...+x_n^2)-\bar{x}^2 = \frac{1}{n}\sum_{i=1}^{n}x_i^2-\bar{x}^2 

① Find the number of elements ② Find the average value ③ Square the elements and list them ④ Find the sum of the list of squared elements and divide this by the number of elements. ⑤ Subtract the square of the average value from it

#List of test scores
import statistics

test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Find the number of elements
n = len(test_score)
#Find the average of the tests
score_mean = statistics.mean(test_score)
#Square each element of the list
squared_test_score = [score**2 for score in test_score]
#"The sum of the squares of each element divided by the number of elements"-"Test mean squared"
population_variance = sum(squared_test_score)/n - (score_mean)**2
print('score_mean = ',score_mean)
print('population_variance = ',population_variance)
score_mean =  51.766666666666666
population_variance =  638.6455555555558

1-1. Find unbiased variance

◯ Unbiased variance is used when estimating the variance of the population from a sample. This is a convenient calculation that allows you to grasp the nature of the population without having to grasp all the elements of the population.

Equation for unbiased variance

U^2 = \frac{1}{m-1}\sum_{i=1}^{m}(x_i-\bar{x})^2 = \frac{m}{m-1}s^2

How to find unbiased variance using the variance () function

import statistics

#Test score sample(sample)
test_score_sample = [27, 22, 22, 73, 56, 61, 61, 22, 27, 63, 61, 22, 27, 61, 22, 61, 73, 61, 27, 73]

#Unbiased dispersion(=unbiased_distribution), Specimen test_score_Obtain from sample
unbiased_distribution = statistics.variance(test_score_sample)
print('unbiased_distribution =',unbiased_distribution)
unbiased_distribution = 434.2

How to find unbiased variance using the pvariance () function

◯ It can be used only when the average of the population is known.

◯ If you do not know the average of the population, you can execute this function on the sample data to obtain the sample variance with n degrees of freedom. That is, it is not an unbiased estimate of the population variance.

import statistics

#Population average
score_mean = 51.766666666666666

#Test score sample(sample)
test_score_sample = [27, 22, 22, 73, 56, 61, 61, 22, 27, 63, 61, 22, 27, 61, 22, 61, 73, 61, 27, 73]

#Unbiased dispersion(=unbiased_distribution), Specimen test_score_Obtain from sample
#Population mean is specified by the second argument
unbiased_distribution = statistics.pvariance(test_score_sample,score_mean)
print('unbiased_distribution =',unbiased_distribution)
unbiased_distribution = 412.49

How to find unbiased variance using deviation

◯ Divide the sum of the squares of the deviations by the sample size -1 as shown in the formula below to obtain the unbiased variance.

Equation for unbiased variance

u^2 = \frac{1}{m-1}\sum_{i=1}^{m}(x_i-\bar{x})^2 = \frac{m}{m-1}s^2

① Find the sample size ② Find the average value to find the deviation ③ Make a list with ** deviation squared ** as an element ④ Divide the sum of this list by m-1 (= sample size-1)

import statistics

#Test score sample(sample)
test_score_sample = [27, 22, 22, 73, 56, 61, 61, 22, 27, 63, 61, 22, 27, 61, 22, 61, 73, 61, 27, 73]

#Find the sample size
m = len(test_score_sample)
#Find the average of the samples to find the deviation
score_mean = statistics.mean(test_score_sample)
#deviation(=score-score_mean)Make a list of squares of
squared_deviation_list = [(score-score_mean)**2 for score in test_score_sample]

#Mother dispersion= squared_deviation_sum of list/sample size-1
unbiased_distribution = sum(squared_deviation_list)/m-1

print('unbiased_distribution = ',unbiased_distribution)
unbiased_distribution =  411.49000000000007

2. Find the standard deviation

◯ Standard deviation includes population standard deviation and unbiased standard deviation.

◯ The population standard deviation is the square root of the population variance and aligns the unit with the data to make it easier to understand the spread of the data.

◯ The unbiased standard deviation is an unbiased estimator of the population standard deviation.

2-0. Find the population standard deviation

◯ The population standard deviation is the square root of the population variance.

Find using the pstdev () function

import statistics

#Test score(population)
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]

population_standard_deviation = statistics.pstdev(test_score)
print('population_standard_deviation =',population_standard_deviation)
population_standard_deviation = 25.27143754430198

Find using population variance

◯ The square root of the population variance is the population standard deviation, so use that.

(1) Find the population variance with the pvariance () function ② Take the square root of the population variance

import statistics
import sympy

test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Seeking mother variance
population_variance = statistics.pvariance(test_score)
#Take the square root of the population variance
#root(The contents of the route,~Multiplication)
population_standard_deviation = sympy.root(population_variance,2)

print('population_variance = ',population_variance)
print('population_standard_deviation = ',population_standard_deviation)
population_variance =  638.6455555555556
population_standard_deviation =  25.2714375443020

Calculated using the mean of the squares of the deviations

◯ The population standard deviation is equivalent to ** the squared mean of the deviation **.

⓪ Find the mean of the data to find the deviation ① Make a list with ** deviation ** as an element. (2) Create a list with ** deviation squared ** as an element. (Square of each element) ③ Find the average value of this list. ④ Take the square root of the average value

◯ The root mean square is calculated by squared the numbers you want to average, add them up, divide by the number of elements n, and then square root the value **. It is used when you want to calculate the difference from the arrival time with respect to the transportation timetable.

◯ There is no difference in time lag between arriving 2 minutes late and arriving 2 minutes early. However, it offsets the ** error ** with the ** arithmetic mean ** with plus or minus. So, ** square to eliminate the minus and ** calculate.


import sympy
import statistics

#Test score data(population)
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Find the mean of the population to find the deviation
score_mean = statistics.mean(test_score)

#deviation(Difference between score and average)Make a list of, and find the squared average for this list.
deviation_list = [score-score_mean for score in test_score]

#Squared each element of the list into a new list squared_test_Make a score
squared_deviation_list = [i**2 for i in deviation_list]
#squared_deviation_Find the average value of list(Sum/sample size)
mean_square = sum(squared_deviation_list)/len(squared_deviation_list)
#mean_Take the square root of square
root_mean_square = sympy.root(mean_square,2)

print('RMS = population standard deviation = ',root_mean_square)
#Squared mean of deviation=Mother standard deviation= 25.2714375443020
RMS = population standard deviation = 25.2714375443020

2-1. Find the unbiased standard deviation

◯ ** It is an unbiased estimator of the population standard deviation **.

◯ Also, the unbiased standard deviation is not the square root of the unbiased variance itself, but the ** corrected square root of the unbiased variance **.

◯ Since the unbiased estimator is the unbiased estimator of the population variance, the square root of the unbiased variance is the unbiased standard deviation, and it is often thought that it is the unbiased estimator of the population standard deviation. However, the square root of ** unbiased variance is not the unbiased standard deviation. ** **

The unbiased standard deviation Us as an unbiased estimator of the population standard deviation is the square root of the unbiased variance U ^ 2 divided by the coefficient C4.

U^2 = \frac{1}{m-1}\sum_{i=1}^{m}(x_i-\bar{x})^2 = \frac{m}{m-1}s^2
C_4 = \frac{\sqrt{\frac{2}{n-1}} \Gamma(\frac{2}{n})}{\Gamma(\frac{n-1}{2})}

Unbiased standard deviation formula

U_s = \frac{\sqrt{U^2}}{C_4}

Inflating coefficient and discount coefficient: unbiased standard deviation and control chart coefficient Is the square root of the unbiased variance an unbiased estimator of the standard deviation? What is unbiased standard deviation? : For those who do not understand statistical tests

Find the unbiased standard deviation using the stdev () function

import statistics

#Test score sample(sample)
test_score_sample = [27, 22, 22, 73, 56, 61, 61, 22, 27, 63, 61, 22, 27, 61, 22, 61, 73, 61, 27, 73]

unbiased_standard_deviation = statistics.stdev(test_score_sample)
print('unbiased_standard_deviation = ',unbiased_standard_deviation)
unbiased_standard_deviation =  20.837466256721328

3. Find the coefficient of variation

◯ The coefficient of variation is the standard deviation divided by the average value.

◯ It is a numerical value that does not have a unit (= dimensionless) used when relatively evaluating the variation of data with different units and the relationship between the data and variation with respect to the average value. The coefficient of variation may be expressed in CV.

Formula for calculating the Coefficient of Variation

CV = \frac{S}{\bar{x}}

How to find the coefficient of variation using the population standard deviation and mean

import statistics

#Data of 10 times each of human and mouse weight
#The unit is kg
human_data = [75,77,75,76,78,76,75,76,77,75]
mouse_data = [0.04,0.05,0.02,0.03,0.02,0.03,0.05,0.06,0.07,0.03]

#pstdev()Use a function to find each population standard deviation
human_pstdev = statistics.pstdev(human_data)
mouse_pstdev = statistics.pstdev(mouse_data)

#mean()Use a function to find the average value of each
human_mean = statistics.mean(human_data)
mouse_mean = statistics.mean(mouse_data)

#Find the coefficient of variation
#Find the ratio of population standard deviation to mean
human_cv = human_pstdev/adult_mean
mouse_cv = mouse_pstdev/mouse_mean

print('human_pstdev = ',human_pstdev)
print('mouse_pstdev = ',mouse_pstdev)
print('human_cv = ',human_cv)
print('mouse_cv = ',mouse_cv)
#The population standard deviation is larger for humans(The numerical value representing the variation is large)
#Represents variation in kg
human_pstdev =  1.0
mouse_pstdev =  0.0161245154965971

#The coefficient of variation is larger in mice(Large degree of variation)
human_cv =  0.013106159895150722
mouse_cv =  0.40311288741492746

Finally

Thank you for reading. We would appreciate it if you could point out any mistakes or improvements. I look forward to working with you.

Recommended Posts

[Algorithm x Python] Calculation of basic statistics Part3 (range, variance, standard deviation, coefficient of variation)
[Algorithm x Python] Calculation of basic statistics Part2 (mean, median, mode)
[Statistics for programmers] Variance, standard deviation and coefficient of variation
[Algorithm x Python] Calculation of basic statistics (total value, maximum value, minimum value)
Calculation of standard deviation and correlation coefficient in Python
Basics of Python x GIS (Part 3)
Basics of Python x GIS (Part 2)
Variance, statistics up to standard deviation
[Python] Calculation of Kappa (k) coefficient
[Python] Calculation of image similarity (Dice coefficient)
1. Statistics learned with Python 1-3. Calculation of various statistics (statistics)
1. Statistics learned with Python 1-2. Calculation of various statistics (Numpy)
Calculate mean, median, mode, variance, standard deviation in Python