I will write about algorithms and Python. This time, I will write not only how to calculate a simple calculation using a function, but also how to calculate it when the function is not used.
◯ Range is the simplest ** quantity that represents the spread of data **. Maximum value-Minimum value can be easily calculated. However, if the data contains ** extreme values **, the range may be too wide for the values to characterize the data. (Distribution solves this problem)
◯ Consider the range using the data of the math test of a certain 30 class.
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Find the maximum and minimum values
max_score = max(test_score)
min_score = min(test_score)
#Find the range
score_range = max_score-min_score
print('max_score = ',max_score)
print('min_score = ',min_score)
print('score_range = ',score_range)
max_score = 97
min_score = 9
score_range = 88
◯ Variance is used as a ** statistic showing data variability **. Variance is not as sensitive to extreme values as range, as the contribution of one element is small. However, since it is squared in the calculation process, the unit is different from the original data **. (The standard deviation solves this problem)
◯ If you want to know the variability of a certain data, ** population variance ** is required, ** that is the best **. However, there are many cases where it is not possible to grasp all the elements of the population and ** population variance is not directly required **. Unbiased dispersion is used in such cases.
Type of dispersion | Intended use | Feature |
---|---|---|
Mother dispersion | Seeking population variance | Only if you know all the elements of the populationAvailable |
Sample variance | Find the variance of the sample | Not an estimate of population variance |
Unbiased dispersion | Estimate population variance from a sample | Become an estimate of population variance |
Formula for finding the population variance
S^2 = \frac{1}{n} [(x_1-\bar{x})^2 +(x_2-\bar{x})^2 +...(x_n-\bar{x})^2 ] = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2
Equation for unbiased variance
U^2 = \frac{1}{m-1}\sum_{i=1}^{m}(x_i-\bar{x})^2 = \frac{m}{m-1}s^2
◯ The reason why the sample size is m but the degree of freedom of unbiased variance is m-1 is that the m observations in the equation for unbiased variance are not completely independent of each other. * One of the observations is obtained from the other m-1 independent observations and the sample mean **.
◯ Mathematical explanation that the degree of freedom is m-1 Relationship between sample variance and unbiased variance
◯ When comparing the formulas for the denominator variance and the unbiased variance, it may seem strange that only the denominator is different in this way. However, in the limit where the population size n is considerably large ** and the sample size m is as large as n **, the unbiased variance almost matches the population variance **. It has been proven to be a good estimate of population variance **.
◯ Population variance is the variance of the population. It is used when all the elements of the population are known.
◯ The population variance is the sum of the squares of the deviations (deviations) from the mean of each data and divided by the number of data. In other words, ** the more elements of the data deviate from the mean, the greater the population variance **.
◯ Calculate the population variance using the score data of a certain class of mathematics test as the population. In other words, find the degree of variability in the test scores for this class.
import statistics
#List of test scores
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
population_variance = statistics.pvariance(test_score)
print('population_variance =',population_variance)
population_variance = 638.6455555555556
Formula for finding the population variance
S^2 = \frac{1}{n} [(x_1-\bar{x})^2 +(x_2-\bar{x})^2 +...(x_n-\bar{x})^2 ] = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2
◯ ** (data value)-(average value) ** is called ** deviation (= deviation) **. In the above formula, the deviation is calculated, the square of the deviation is summed, and divided by the number of elements.
import statistics
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Find the number of elements in the list
n = len(test_score)
#Find the average score of the test
score_mean = statistics.mean(test_score)
#Make a list of squares of deviations
squared_deviation_list = [(score-score_mean)**2 for score in test_score]
#Mother dispersion= squared_deviation_sum of list/Element count
population_variance = sum(squared_deviation_list)/n
print('population_variance = ',population_variance)
population_variance = 638.6455555555556
◯ The population variance can be transformed into the following shapes. We will use this to find the population variance.
S^2 = \frac{1}{n} (x_1^2+x_2^2+...+x_n^2)-\bar{x}^2 = \frac{1}{n}\sum_{i=1}^{n}x_i^2-\bar{x}^2
① Find the number of elements ② Find the average value ③ Square the elements and list them ④ Find the sum of the list of squared elements and divide this by the number of elements. ⑤ Subtract the square of the average value from it
#List of test scores
import statistics
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Find the number of elements
n = len(test_score)
#Find the average of the tests
score_mean = statistics.mean(test_score)
#Square each element of the list
squared_test_score = [score**2 for score in test_score]
#"The sum of the squares of each element divided by the number of elements"-"Test mean squared"
population_variance = sum(squared_test_score)/n - (score_mean)**2
print('score_mean = ',score_mean)
print('population_variance = ',population_variance)
score_mean = 51.766666666666666
population_variance = 638.6455555555558
◯ Unbiased variance is used when estimating the variance of the population from a sample. This is a convenient calculation that allows you to grasp the nature of the population without having to grasp all the elements of the population.
Equation for unbiased variance
U^2 = \frac{1}{m-1}\sum_{i=1}^{m}(x_i-\bar{x})^2 = \frac{m}{m-1}s^2
import statistics
#Test score sample(sample)
test_score_sample = [27, 22, 22, 73, 56, 61, 61, 22, 27, 63, 61, 22, 27, 61, 22, 61, 73, 61, 27, 73]
#Unbiased dispersion(=unbiased_distribution), Specimen test_score_Obtain from sample
unbiased_distribution = statistics.variance(test_score_sample)
print('unbiased_distribution =',unbiased_distribution)
unbiased_distribution = 434.2
◯ It can be used only when the average of the population is known.
◯ If you do not know the average of the population, you can execute this function on the sample data to obtain the sample variance with n degrees of freedom. That is, it is not an unbiased estimate of the population variance.
import statistics
#Population average
score_mean = 51.766666666666666
#Test score sample(sample)
test_score_sample = [27, 22, 22, 73, 56, 61, 61, 22, 27, 63, 61, 22, 27, 61, 22, 61, 73, 61, 27, 73]
#Unbiased dispersion(=unbiased_distribution), Specimen test_score_Obtain from sample
#Population mean is specified by the second argument
unbiased_distribution = statistics.pvariance(test_score_sample,score_mean)
print('unbiased_distribution =',unbiased_distribution)
unbiased_distribution = 412.49
◯ Divide the sum of the squares of the deviations by the sample size -1 as shown in the formula below to obtain the unbiased variance.
Equation for unbiased variance
u^2 = \frac{1}{m-1}\sum_{i=1}^{m}(x_i-\bar{x})^2 = \frac{m}{m-1}s^2
① Find the sample size ② Find the average value to find the deviation ③ Make a list with ** deviation squared ** as an element ④ Divide the sum of this list by m-1 (= sample size-1)
import statistics
#Test score sample(sample)
test_score_sample = [27, 22, 22, 73, 56, 61, 61, 22, 27, 63, 61, 22, 27, 61, 22, 61, 73, 61, 27, 73]
#Find the sample size
m = len(test_score_sample)
#Find the average of the samples to find the deviation
score_mean = statistics.mean(test_score_sample)
#deviation(=score-score_mean)Make a list of squares of
squared_deviation_list = [(score-score_mean)**2 for score in test_score_sample]
#Mother dispersion= squared_deviation_sum of list/sample size-1
unbiased_distribution = sum(squared_deviation_list)/m-1
print('unbiased_distribution = ',unbiased_distribution)
unbiased_distribution = 411.49000000000007
◯ Standard deviation includes population standard deviation and unbiased standard deviation.
◯ The population standard deviation is the square root of the population variance and aligns the unit with the data to make it easier to understand the spread of the data.
◯ The unbiased standard deviation is an unbiased estimator of the population standard deviation.
◯ The population standard deviation is the square root of the population variance.
import statistics
#Test score(population)
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
population_standard_deviation = statistics.pstdev(test_score)
print('population_standard_deviation =',population_standard_deviation)
population_standard_deviation = 25.27143754430198
◯ The square root of the population variance is the population standard deviation, so use that.
(1) Find the population variance with the pvariance () function ② Take the square root of the population variance
import statistics
import sympy
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Seeking mother variance
population_variance = statistics.pvariance(test_score)
#Take the square root of the population variance
#root(The contents of the route,~Multiplication)
population_standard_deviation = sympy.root(population_variance,2)
print('population_variance = ',population_variance)
print('population_standard_deviation = ',population_standard_deviation)
population_variance = 638.6455555555556
population_standard_deviation = 25.2714375443020
◯ The population standard deviation is equivalent to ** the squared mean of the deviation **.
⓪ Find the mean of the data to find the deviation ① Make a list with ** deviation ** as an element. (2) Create a list with ** deviation squared ** as an element. (Square of each element) ③ Find the average value of this list. ④ Take the square root of the average value
◯ The root mean square is calculated by squared the numbers you want to average, add them up, divide by the number of elements n, and then square root the value **. It is used when you want to calculate the difference from the arrival time with respect to the transportation timetable.
◯ There is no difference in time lag between arriving 2 minutes late and arriving 2 minutes early. However, it offsets the ** error ** with the ** arithmetic mean ** with plus or minus. So, ** square to eliminate the minus and ** calculate.
import sympy
import statistics
#Test score data(population)
test_score = [42, 66, 39, 27, 9, 97, 48, 13, 39, 63, 78, 93, 91, 86, 69, 56, 39, 23, 11, 48, 34, 56, 73, 89, 68, 24, 22, 61, 49, 40]
#Find the mean of the population to find the deviation
score_mean = statistics.mean(test_score)
#deviation(Difference between score and average)Make a list of, and find the squared average for this list.
deviation_list = [score-score_mean for score in test_score]
#Squared each element of the list into a new list squared_test_Make a score
squared_deviation_list = [i**2 for i in deviation_list]
#squared_deviation_Find the average value of list(Sum/sample size)
mean_square = sum(squared_deviation_list)/len(squared_deviation_list)
#mean_Take the square root of square
root_mean_square = sympy.root(mean_square,2)
print('RMS = population standard deviation = ',root_mean_square)
#Squared mean of deviation=Mother standard deviation= 25.2714375443020
RMS = population standard deviation = 25.2714375443020
◯ ** It is an unbiased estimator of the population standard deviation **.
◯ Also, the unbiased standard deviation is not the square root of the unbiased variance itself, but the ** corrected square root of the unbiased variance **.
◯ Since the unbiased estimator is the unbiased estimator of the population variance, the square root of the unbiased variance is the unbiased standard deviation, and it is often thought that it is the unbiased estimator of the population standard deviation. However, the square root of ** unbiased variance is not the unbiased standard deviation. ** **
The unbiased standard deviation Us as an unbiased estimator of the population standard deviation is the square root of the unbiased variance U ^ 2 divided by the coefficient C4.
U^2 = \frac{1}{m-1}\sum_{i=1}^{m}(x_i-\bar{x})^2 = \frac{m}{m-1}s^2
C_4 = \frac{\sqrt{\frac{2}{n-1}} \Gamma(\frac{2}{n})}{\Gamma(\frac{n-1}{2})}
Unbiased standard deviation formula
U_s = \frac{\sqrt{U^2}}{C_4}
Inflating coefficient and discount coefficient: unbiased standard deviation and control chart coefficient Is the square root of the unbiased variance an unbiased estimator of the standard deviation? What is unbiased standard deviation? : For those who do not understand statistical tests
import statistics
#Test score sample(sample)
test_score_sample = [27, 22, 22, 73, 56, 61, 61, 22, 27, 63, 61, 22, 27, 61, 22, 61, 73, 61, 27, 73]
unbiased_standard_deviation = statistics.stdev(test_score_sample)
print('unbiased_standard_deviation = ',unbiased_standard_deviation)
unbiased_standard_deviation = 20.837466256721328
◯ The coefficient of variation is the standard deviation divided by the average value.
◯ It is a numerical value that does not have a unit (= dimensionless) used when relatively evaluating the variation of data with different units and the relationship between the data and variation with respect to the average value. The coefficient of variation may be expressed in CV.
Formula for calculating the Coefficient of Variation
CV = \frac{S}{\bar{x}}
import statistics
#Data of 10 times each of human and mouse weight
#The unit is kg
human_data = [75,77,75,76,78,76,75,76,77,75]
mouse_data = [0.04,0.05,0.02,0.03,0.02,0.03,0.05,0.06,0.07,0.03]
#pstdev()Use a function to find each population standard deviation
human_pstdev = statistics.pstdev(human_data)
mouse_pstdev = statistics.pstdev(mouse_data)
#mean()Use a function to find the average value of each
human_mean = statistics.mean(human_data)
mouse_mean = statistics.mean(mouse_data)
#Find the coefficient of variation
#Find the ratio of population standard deviation to mean
human_cv = human_pstdev/adult_mean
mouse_cv = mouse_pstdev/mouse_mean
print('human_pstdev = ',human_pstdev)
print('mouse_pstdev = ',mouse_pstdev)
print('human_cv = ',human_cv)
print('mouse_cv = ',mouse_cv)
#The population standard deviation is larger for humans(The numerical value representing the variation is large)
#Represents variation in kg
human_pstdev = 1.0
mouse_pstdev = 0.0161245154965971
#The coefficient of variation is larger in mice(Large degree of variation)
human_cv = 0.013106159895150722
mouse_cv = 0.40311288741492746
Thank you for reading. We would appreciate it if you could point out any mistakes or improvements. I look forward to working with you.
Recommended Posts