[PYTHON] [Statistics for programmers] Mean, median, mode

table of contents

Statistics for Programmers-Table of Contents

Overview

When there is numerical data, the value that represents the data is called the representative value. There are the following three typical values. Which one is the representative value depends on the shape of the data distribution.

Average value

The average value is the sum of all the data divided by the number of data.

\bar{x} = \frac{(x_1+x_2+x_3+・ ・ ・+x_n)}{n}

For frequency distribution tables, you can use "class value" and "frequency" to get the average value. If you have n classes, the class value is v, and the frequency is f, you can calculate with the following formula.

\bar{X} = \frac{(f_1v_1 + f_2v_2+・ ・ ・+ f_3v_3)}{(f_1 + f_2 +・ ・ ・+ f_n)}

As an example, let's calculate the average value based on the frequency distribution table of the test scores of 10 students.

class Class value frequency
0 points or more and less than 25 points 12.5 1
25 points or more and less than 50 points 37.5 3
50 points or more and less than 75 points 62.5 4
75 points or more 87.5 2

The average score for this test is calculated below.

\bar{X}=\frac{({1\times12.5}) + ({3\times37.5}) + ({4\times62.5}) + ({2\times87.5})}{(1+3+4+2)}

By the way, although it is a little off topic, there are multiple methods for calculating the average value depending on the application. Please refer to this as well. ** Related article: There is more than one way to calculate the average value **

Median

The median is the value that is in the middle when the data is arranged in ascending or descending order. If the number of data is even, the median is two, and the median is the sum of them and divided by two.

When the number of data is odd

1, 3, 4, 5, 7

In this case, the median is 4.

When the number of data is even

1, 3, 4, 5, 7, 10

In this case, the median is 4 and 5, so it can be calculated by the following formula, and the median is 4.5.

4.5 = \frac{4+5}{2}

Mode

The mode is the value with the largest number of data.

1, 3, 4, 5, 7, 7, 10

For example, the mode in the above case would be 7.

In the case of the frequency distribution table, the class value with the highest frequency is the mode. In the frequency distribution table of the scores of the previous test, the one with the highest frequency is 4 of" 50 points or more and less than 75 points ", so the mode value is the class value 62.5. ..

class Class value frequency
0 points or more and less than 25 points 12.5 1
25 points or more and less than 50 points 37.5 3
50 points or more and less than 75 points 62.5 4
75 points or more 87.5 2

Also, if there are the same number of 5 and 7, the mode will be 5 and 7, as shown below.

1, 3, 4, 5, 5, 7, 7, 10

Also, in the following cases, the mode does not exist.

1, 3, 4, 5, 7, 10

Relationship between histogram distribution and mean / median / mode

In the histogram distribution, if there is one peak in the peak, the following is often true. This is called Pearson's rule of thumb.

Of the following three, it always holds if it is symmetrical, but the other two are empirical rules and do not always hold.

When the distribution is symmetrical

If the distribution of the histogram is symmetrical as shown below, the mean, median, and mode are all the same at the position of the red line.

graph_1.png

If the distribution is biased to the left

If the distribution is not symmetrical but biased to the left (tailed to the right) As shown below, the mode, median, and mean are often arranged in that order. (The line is drawn at the approximate position)

graph_2.png

If the distribution is biased to the right

If the distribution is not symmetrical but biased to the right (tailed to the left) As shown below, the average value, median value, and mode value are often arranged in this order. (The line is drawn at the approximate position)

graph_3.png

Which should be the representative value

Which of the mean, median, and mode should be the representative value depends on the distribution of the data. The advantages and disadvantages of each are summarized.

Representative value merit Demerit
Average value Can reflect all data Will be dragged if there is an extreme value
Median Less susceptible to extreme values Hard to notice changes other than the middle value
Mode Less susceptible to extreme values It is difficult to refer to when the number of data is small

Which one should be the representative value depends on how the data is distributed. Basically, if the difference between the average value and the median value is small, I think it is better to use the average value as the representative value. If the difference between the two is large, I think it is safe to look at the median and mode as well.

In the histogram in the example above, all had one mountain, but there can be multiple mountains. In such a case, it is difficult to determine the representative value, but it may be necessary to devise the method of collecting data in the first place.

that's all

Related article

-There is more than one way to calculate the average value

reference

-Statistics web-Average / Median / Mode -How to find the mean, median, mode and some examples -[Basic] How to use the average value, median value, and mode value properly?

Recommended Posts

[Statistics for programmers] Mean, median, mode
[Statistics for programmers] Bayes' theorem
[Statistics for programmers] Box plot
[Algorithm x Python] Calculation of basic statistics Part2 (mean, median, mode)
[Statistics for programmers] What is an event?
[Statistics for Programmers] Table of Contents-Data Science
[Statistics for programmers] Conditional probabilities and multiplication theorems
[Statistics for programmers] Lorenz curve and Gini coefficient
Program for studying statistics
Calculate mean, median, mode, variance, standard deviation in Python
[Statistics for programmers] Variance, standard deviation and coefficient of variation
[Statistics for programmers] Random variables, probability distributions, and probability density functions