[PYTHON] Variance, statistics up to standard deviation

First, create two types of data with different distributions.

Data generation


import numpy as np
np.random.seed(seed=32)
groupA = np.random.normal(100, 20, 100000) #Average 100,Randomly generate 100000 values with a standard deviation of 20
groupB = np.random.normal(100, 50, 100000) #Average 100,Randomly generate 100000 values with a standard deviation of 50

print("groupA sample: {}".format(groupA[0:5]))
print("groupA sample: {}".format(groupB[0:5]))
groupA sample: [  93.02211098  119.67406867  111.61845661  101.40568882  115.55065353]
groupA sample: [ 122.71129387  134.76068077   88.52182502   92.52887435  107.2470646 ]

average

First of all, looking at the average, as shown below, the values of both groups are around 100 on average (it is natural because 100 is specified on average).

mean


meanA, meanB = np.mean(groupA), np.mean(groupB)
print("group A average= {},group B average= {}".format(meanA, meanB))
group A average= 100.0881959959255,group B average= 100.13663565328969

However, when writing the histogram as follows, the distribution is clearly different. Group B has a wider mountain hem than group A and feels gentle (it is natural because the standard deviation value is different).

import seaborn as sns
sns.distplot(groupA, bins=100, label='groupA', kde=False)
sns.distplot(groupB, bins=100, label='groupB', kde=False)
plt.legend()
plt.show()

image

In this way, when comparing only the average, Group A and Group B seem to be equivalent groups, but the information that group B has a larger variation in value than group A is discarded. The summary statistics that express these variations in numbers are the variance and standard deviation.

Distributed

The variance can be calculated with the following formula

S^2 = \frac{1}{n}{\sum_{i=1}^n(x_i-\bar{x})^2}

In other words, (each value-mean value) squared is added together and divided by the number of data. By subtracting the average value from each value, it seems to be an index showing how much it deviates from the average, but it may be a negative value, so it is squared. After that, by adding them up and dividing by the number of data, the degree of variation in the data can be expressed. By the way, the result of each value-mean value is the deviation, and the value obtained by squaring it and adding them all is called the sum of squared deviations.

Writing in python looks like this.

deviation


groupA[0] - groupA.mean()

Deviation square sum


s = np.sum((xi - groupA.mean())**2 for xi in groupA)
print(s)
40178555.707617663

Distributed


var = sum / len(groupA) 
print(var)
401.78555707618096

groupA,B Try to get the variance of both


a = ((groupA - groupA.mean())**2).sum()/len(groupA)
b = ((groupB - groupB.mean())**2).sum()/len(groupB)
print("Distribution of groupA:{:.2f}\distribution of ngroupB:{:.2f}".format(a, b))
Distribution of groupA:401.79
groupB distribution:2496.21

So, we can see the variance of each, group B has a larger variance than A, and the numerical values are not concentrated near the mean value. In other words, it can be seen that there are variations.

standard deviation

I found that B was more variable, but I'm not sure if it was said that the variance was 2496, which was a set of numbers with an average of around 100 for both group A and group B. In such a case, it is better to give the standard deviation.

S=\sqrt{S^2}

The standard deviation is the square root of the variance. Since it was squared when calculating the variance, it returns to the original dimension by taking the square root here. Therefore, it becomes easier to understand how much the numbers vary.

standard deviation


print("standard deviation of groupA:{:.2f}\standard deviation of ngroupB:{:.2f}".format(math.sqrt(a), math.sqrt(b)))
standard deviation of groupA:20.04
group B standard deviation:49.96

Finally

numpy is so convenient that you can easily calculate the mean, variance, and standard deviation.

mean = groupA.mean()
var = groupA.var()
std = groupA.std()
print("average:{:.2f}Distributed:{:.2f}standard deviation:{:.2f}".format(mean, var, std))
average:100.09 Dispersion:401.79 standard deviation:20.04

Recommended Posts

Variance, statistics up to standard deviation
[Statistics] First "standard deviation" (to avoid frustration with statistics)
[Statistics for programmers] Variance, standard deviation and coefficient of variation
Stock price and statistics (mean, standard deviation)
[Algorithm x Python] Calculation of basic statistics Part3 (range, variance, standard deviation, coefficient of variation)
All up to 775/664, 777/666, 755/644, etc.
Calculate mean, median, mode, variance, standard deviation in Python
Decorator to silence standard output
Python / numpy / statistics> difference> standard deviation> population / sample> unbiased English notation