[PYTHON] [Statistics for programmers] Box plot

table of contents

Statistics for Programmers-Table of Contents

What is a boxplot?

A box plot is a graph that makes it easy to understand the distribution and variation of data. For example, suppose a 10-person test scores below.

No Math score National language score
1 74 81
2 65 62
3 40 32
4 62 67
5 85 41
6 67 50
7 82 85
8 71 70
9 60 67
10 99 97

Based on this data, make a box plot using matplotlib.

%matplotlib inline
import matplotlib.pyplot as plt

#Math score
math = [74, 65, 40, 62, 85, 67, 82, 71, 60, 99]
#National language score
literature = [81, 62, 32, 67, 41, 50, 85, 70, 67, 97]
#Tuple of points
points = (math, literature)

#Box plot
fig, ax = plt.subplots()

bp = ax.boxplot(points)
ax.set_xticklabels(['math', 'literature'])

plt.title('Box plot')
plt.xlabel('exams')
plt.ylabel('point')
#Y-axis memory range
plt.ylim([0,100])
plt.grid()

#drawing
plt.show()

The graph below will be completed.

download (10).png

This blue part is called a box, and the vertical black line is called a beard. The whiskers are drawn between the minimum and maximum values. The box represents the quartile.

The next section describes what a quartile is.

What is a quartile?

There are a few terms you need to understand quartiles, so I'll walk you through them.

Percentile

When the data are arranged in ascending order, the data at the N% position, counting from the beginning, is called the N percentile. For example, in the 30th percentile, the data is 30% from the head. The 0th percentile is the minimum value and the 100th percentile is the maximum value.

Quartile

The quartile of the main subject. Quartiles are the following percentiles. By separating at these three positions, the number of data can be divided into four.

Percentile alias
25th percentile First quartile
50th percentile Second quartile (median)
75th percentile Third quartile

Interquartile range / IQR

The one calculated by the following formula is called the interquartile range or IQR.

Interquartile range(IQR) =75th percentile (third quartile) -25th percentile (first quartile)

Box plot and quartile

download (8).png

Since the quartile is divided into four by the number of data, the same number of data is included in the following sections.

--Minimum value-> First quartile --First quartile-> Second quartile --Second quartile-> Third quartile --Third quartile-> Maximum value

However, in the graph above, the length of each section is different. This means that the data varies. The interval of the 1st quartile-> the 2nd quartile is very short, which means that most people have a test result of around 40 points, which is biased.

In addition, the total number of data is half of the total when the three sections of the first quartile-> the second quartile-> the third quartile are combined. In other words, half of the people's test scores are between 39 and 70 points.

Box plot with outliers

The median and quartiles are not subtracted even if the maximum and minimum values are extreme, but the maximum and minimum values are naturally subtracted. Extreme data may exist due to abnormal data such as measurement errors. Therefore, if there are extreme values, we will consider them as outliers and explain how to create a boxplot.

In the case of the boxplot mentioned above, if the maximum and minimum values are extreme, the whiskers will be long. In the case of a boxplot that considers outliers, the length of the whiskers should be 1.5 times or less of the box on the maximum value side and the minimum value side, respectively, and data exceeding that length is regarded as an outlier value.

It seems that python's matplotlib automatically detects outliers. In the code below, I added 170 points and 190 points to the score results of the national language. The test is out of 100, so the two should be outliers. The scale of the graph has been increased to 200. Now let's create a graph.

%matplotlib inline
import matplotlib.pyplot as plt

#National language score
literature = [81, 62, 32, 67, 41, 50, 85, 100, 170, 190]
#Tuple of points
points = (literature)

#Box plot
fig, ax = plt.subplots()

bp = ax.boxplot(points)
ax.set_xticklabels(['literature'])

plt.title('Box plot')
plt.xlabel('exams')
plt.ylabel('point')
#Y-axis memory range
plt.ylim([0,200])
plt.grid()

#drawing
plt.show()

download (9).png

There are two +s near the top of the graph. For these two, 170 points and 190 points were regarded as outliers. In python's matplotlib, it seems to be automatically determined like this without defining outliers, so it is very convenient.

that's all

reference

-Statistics web-What is a boxplot -Box plot in Python -Meaning of box plot

Recommended Posts

[Statistics for programmers] Box plot
[Statistics for programmers] Bayes' theorem
[Statistics for programmers] Mean, median, mode
[Statistics for programmers] What is an event?
[Statistics for Programmers] Table of Contents-Data Science
[Statistics for programmers] Conditional probabilities and multiplication theorems
[Statistics for programmers] Lorenz curve and Gini coefficient
Program for studying statistics
[Statistics for programmers] Variance, standard deviation and coefficient of variation
[Statistics for programmers] Random variables, probability distributions, and probability density functions
Prolog Object Orientation for Python Programmers
Make a Tweet box for Pepper
Movement statistics for time series forecasting