Calculate mean, median, mode, variance, standard deviation in Python

Calculate sum, mean, median, mode, variance, standard deviation in Python

When using Python, think about what the sum, mean, median, mode, variance, and standard deviation represent, and what kind of processing is being performed.

environment

mac version10.10.5 OSX Yosemite Python 3.6.1 |Anaconda 4.4.0 (x86_64)|

Reference book

Introduction to mathematics starting from python Introduction to Complete Self-study Statistics ↑ I recommend it because the statistics are very easy to understand.

Statistical calculation function

If you want to know the answer rather than the calculation method, you can use this for a moment ↓ {0: .2f} is used to display up to two digits after the decimal point.


from statistics import mean, median,variance,stdev

data = [100,200,300,400,500,500,600,700,800,800]

m = mean(data)
median = median(data)
variance = variance(data)
stdev = stdev(data)
print('average: {0:.2f}'.format(m))
print('Median: {0:.2f}'.format(median))
print('Distributed: {0:.2f}'.format(variance))
print('standard deviation: {0:.2f}'.format(stdev))

There are various other methods such as using numpy. Introduction to Python Numerical Library NumPy

Below, we will take a closer look at the calculation method.

average

About the average

data = [800,200,700,300,100,400,500,500,600,800]
s = sum(data)
N = len(data)
mean = s / N
print('average:{0:.2f}'.format(mean))

It can be calculated by the above formula. First, add all the numbers in the data array using the sum function, and count the number of numbers in the data array using the len function. Divide the total of the data array by the number to get the average.

Median

The median is the value in the middle of a collection of numbers. In other words, the median rank is the same regardless of whether you count from the top or the bottom. If the number for which you want to find the median is odd, then the median is one. However, if the number is even, the median is two, so the average of the two is the median.

Example:

The test scores of three people (Mr. A, Mr. B, and Mr. C) are displayed. If Mr. A has 80 points, Mr. B has 60 points, and Mr. C has 100 points, the median is Mr. A (80 points) who has the same ranking regardless of whether counting from the top or the bottom. If Mr. D is added here (Mr. D 70 points), the median will be 80 points for Mr. A and Mr. D because the same ranking will be given to Mr. A and Mr. D regardless of whether they are counted from the top or the bottom. The median is 75 points, which is the average of 70 points (80 + 70/2).

In this data array, there are an even number, but if statement is used, conditional branching is performed so that it can be obtained even if it is an odd number. Also, when calculating the median, it is necessary to sort the data in the data array in ascending order, so use the sort () method to sort the numbers in the array in ascending order.

data = [100,200,300,400,500,500,600,700,800,800]
N = len(data)
data.sort()
#If even
if N % 2 == 0:
    median1 = N/2
    median2 = N/2 + 1
    #Because python counts elements from 0-1
    #Also, the division operator returns a decimal point even if the result is an integer.(6 / 3 = 3.0)Make it an integer with the int function
    median1 = int(median1) - 1
    median2 = int(median2) - 1
    median = (data[median1] + data[median2]) / 2
    print('The median of the data is:',median)
    #If odd
else:
    median = (N + 1) / 2
    #Because python counts elements from 0-1
    median = int(median) - 1
    median = data[median]
    print('The median of the data is:',median)

Mode

The mode is the value that appears most often. In the following [1,1,1,1,2,2,3,4] array, 1 appears four times, so 1 is the mode. First, it is convenient to use the most_common () method of the Counter class to find the most elements.

>>> from collections import Counter
>>> list = [1,1,2,2,3,4,5,5,5]
>>> c = Counter(list)
>>> c.most_common()
[(1,2),(2,2),(3,1),(4,1),(5,3)]

From the left, 1 is 2 times, 2 is 2 times, 3 is 1 time, 4 is 1 time, and 5 is 3 times.

When you want the largest number >>> c.most_common(1) If you enter, [(5, 3)] will be displayed. If you want to calculate only the number of appearances or the number of appearances most, >>> mode = c.most_common(1) >>> mode[0] [(5,3)] >>> mode[0][0] 5 >>> mode[0][1] 3 Is displayed.

This time we see multiple modes in the data array, so consider the case where there are multiple modes.

from collections import Counter

def calculate_mode(data):
    c = Counter(data)
    #Extracts all elements and their number of occurrences.
    freq_scores = c.most_common()
    #c.most_Most elements in common[0]Maximum number of appearances[1]To[0][1]Specified by
    max_count = freq_scores[0][1]

    modes = []
     #Check if the number of appearances and the maximum number of appearances are equal.
    for num in freq_scores:
        if num[1] == max_count:
            modes.append(num[0])
    return(modes)

if __name__ == '__main__':
    data = [100,200,300,400,500,500,600,700,800,800]
    modes = calculate_mode(data)

    print('The most frequent number is:')
    for mode in modes:
        print(mode)

Variance and standard deviation

Understanding the variance and standard deviation requires the idea of mean and deviation, so I will explain them together.

name Mathematics (score)
Mr. A 60
Mr. B 80
Mr. C 90
Mr. D 40
Mr. E 70

Based on the above five math scores, consider the mean, deviation, variance, and standard deviation.

Value to be sought a formula
Average score Total of 5 math scores ÷ Number of people
deviation Each individual's score-Average score
Distributed Total of squares of deviation ÷ number of people
standard deviation Square root of variance (root value)

First of all, the average score is displayed based on the test results of the above 5 people.

average

(60 + 80 + 90 + 40 + 70) ÷ 5 = ** 68 is the average score **. Divide the total score of the 5 people by the number of people who took the test.

deviation

The average score is subtracted from the score of each individual who took the test.

name a formula(Score-Average score) deviation
Mr. A 60-68 -8
Mr. B 80-68 12
Mr. C 90-68 22
Mr. D 40-68 -28
Mr. E 70-68 2

The deviation can be calculated by the above formula. Also, the deviation value represents the difference from the average value, so adding all the deviations gives ** 0 **.

Distributed

Variance is a measure of how data is scattered. If you use the deviation obtained by subtracting the score from the average, it seems that you can see how the data is scattered (variance), but if you add all the deviation values, the total will always be 0, so the average of the deviation values squared. Let the value be the variance value.

name a formula --
Mr. A -8² 64
Mr. B 12² 144
Mr. C 22² 484
Mr. D -28² 784
Mr. E 4
total --- 1480
Distributed 1480÷5 296

The sum of the squares of the deviations of the above 5 people (1480) ÷ number of people (5 people) = ** 296 ** is the variance value.

standard deviation

Since the variance value is squared, the value becomes very large. For this reason, using the variance value makes it difficult to see how the data is scattered, so finding the square root of the variance value makes it easier to see. This easy-to-read value is the standard deviation. The square root of 296 is 17.20 ,,, so the standard deviation is 17.20 ,,,.

The formula for finding the variance and standard deviation values in python is

def calculate_mean(data):
    s = sum(data)
    N = len(data)
    mean =s/N

    return mean

#Find the deviation from the mean
def find_difference(data):
    mean = calculate_mean(data)
    diff = []

    for num in data:
        diff.append(num-mean)
    return diff

def calculate_variance(data):
    diff = find_difference(data)
    #Find the square of the difference
    squared_diff = []
    for d in diff:
        squared_diff.append(d**2)

    #Find the variance
    sum_squared_diff = sum(squared_diff)
    variance = sum_squared_diff/len(data)
    return variance

if __name__ == '__main__':
    data = [100,200,300,400,500,500,600,700,800,800]
    variance = calculate_variance(data)
    print('The value of the variance is:{0}'.format(variance))

    std = variance**0.5
    print('The standard deviation is:{0}'.format(std))

That's it.

Recommended Posts

Calculate mean, median, mode, variance, standard deviation in Python
[Python] How to handle inf and NaN in numpy mean, standard deviation, maximum / minimum
Calculate mW <-> dBm in Python
Transposed matrix in Python standard
Using Python mode in Processing
Calculate free-space path loss in Python
Accelerometer Alan Variance Calculation in Python
[Statistics for programmers] Mean, median, mode
Make standard output non-blocking in Python
Try to calculate Trace in Python
[Algorithm x Python] Calculation of basic statistics Part2 (mean, median, mode)
Variance, statistics up to standard deviation
Calculate the previous month in Python
Calculate and display standard weight with python
Stock price and statistics (mean, standard deviation)
2. Mean and standard deviation with neural network!
Try to calculate RPN in Python (for beginners)
Keep key names case in Python standard ConfigParser
How to output "Ketsumaimo" as standard output in Python
Standard .py file used in Python trials (template)-2020
Portfolio optimization with Python (Markowitz's mean variance model)
Flatten an irregular 2D standard list in Python
[Algorithm x Python] Calculation of basic statistics Part3 (range, variance, standard deviation, coefficient of variation)