[PYTHON] [Statistics] Understand the mechanism of Q-Q plot by animation.

This time, following the previous ROC curve, an animation about the meaning of ** QQ plot ** published in the official textbook of Statistics Test Level 2 I will write an article that explains using graphs. This is also a slightly quirky graph, and I think it requires some tips to understand, so I would like to try to explain it. I can write Q-Q plots with qqnorm even in R, but I don't understand how it works in a black box, so I wrote it by myself in Python.

1. About the data used this time

Therefore, the data used is the rent data of the condominium in the textbook. This is the data.

Mansion2.data

<tr>
  <th>185</th>
  <td>11</td>
  <td>B</td>
  <td>8600</td>
  <td>1K</td>
  <td>20.79</td>
Walk_min distance Price Type Area Direction Year
0 8 B 7900 1K 30.03 South 3
1 9 B 8500 1K 21.9 South 5
2 10 B 10800 1K 27.05 South 4
3 10 B 10800 1K 29.67 South 4
... ... ... ... ... ... ... ...
Northeast 0
186 8 B 7100 1K 22 West 17
187 9 B 18400 1LDK 54.68 West 10

Download this data here in the middle of the official textbook of Statistics Test Level 2 You can do it from the "Data for download" link in. Unzip the downloaded zip file and ** Mansion2.data ** in the [Chapter 2]-[Body] folder is the data used this time.

And once I get the data, I would like to first draw a graph and give an image of the data: blush:

mansion-plot-compressor.png Fig.1

The price range is closer to the left side, and it is a histogram with a long hem on the right side. Also, it can be seen that there seems to be a correlation between price and size.

Since this Q-Q plot focuses on price, we will go one step further on price and try to interpret the graph. It is "whether or not this distribution follows a normal distribution".

Well, actually, when I apply the density function of the normal distribution based on the mean and standard deviation obtained from this data as shown below, it clearly does not match, but I will proceed without worrying about it w

qq-_02-compressor.png Fig.2

cum_hist_norm-compressor.png Fig.3

The Python code for drawing the above set of graphs is here.

2. What is a Q-Q plot?

Well, anyway, let's take a look at the Q-Q plot graph itself.

Q-Q_plot-compressor.png Fig.4

Yes, this is the data "Price" that we are targeting this time, that is, the Q-Q plot is drawn for the rent data. At first glance, I'm not sure what the graph shows. The textbook explanation is "Q-Q plot is a graph for comparing the obtained data with the theoretical distribution and examining the similarity." ** If they are similar, the plotted points will line up in a straight line **, that's right.

So how do you interpret the above graph? The above Fig.4 is considered to be a modification of the shape of Fig.2. In other words, it is a graph that allows you to visually understand how similar the obtained rent data and the normal distribution density function, which is the theoretical distribution, are by whether or not they are straight lines.

3. The origin of Q-Q plot

By the way, this graph is to measure the similarity with the theoretical distribution by the degree of straight line, but I think that I have to understand how to draw this graph, so I will explain why it can be said so.

Let's explain using rent data again. This is the shape of the distribution. price_hist-compressor.png Fig.5

Two intermediate product graphs are used to create a Q-Q plot from here.

The first thing to use is to arrange this rent data one by one in ascending order and draw dots to draw a graph. There are a total of 188 data, which are evenly arranged between 0 and 1. House_price_sorted-compressor.png Fig.6

As the second graph, we will assume a normal distribution as the theoretical distribution this time, so we will write a graph of the normal cumulative distribution function and use it. This also represents the cumulative density function with 188 points, the same number as the rent data. cumulative_norm-compressor.png Fig.7

By combining these two graphs, you can draw a graph of Q-Q plot. Let's see it in an animation graph.

Q-Q_plot_House_price-compressor.gif Fig.8

The intermediate product graph Fig. 6 is the upper right graph, and Fig. 7 is the lower left graph. The upper left is the target Q-Q plot. First, the horizontal axis of the rent data graph in the upper right represents the quantiles, and the vertical axis of the normal cumulative distribution function in the lower left also represents the quantiles. Slide this quantile from 0 to 1 at the same time in the upper right and lower left respectively. The black line represents it. The points that intersect the black line are displayed as red dots. The Q-Q plot is a plot of these red dots at the same time. The dotted line shows that. The "Q" in this Q-Q plot stands for Quantile, and I think it has this name because it moves the quantiles in the upper right and lower left graphs at the same time.

(Python code is here)

4. Q-Q plot of random numbers following a normal distribution

By the way, if the data and the theoretical distribution are the same, the Q-Q plot will be a straight line, so I would like to try this as well. That means using random numbers that follow a normal distribution. Here is a histogram of 188 random numbers that follow a normal distribution. Norm_hist-compressor.png

If you draw a Q-Q plot ... It's definitely a straight line: relaxed: Q-Q_plot_Norm-compressor.gif

5. Q-Q plot of random numbers according to exponential distribution

Next is the exponential distribution. It's a distribution with a long hem to the right. Exp_hist-compressor.png

For such a shape, the regular Q-Q plot will be convex to the lower right. Q-Q_plot_Exp_Dist-compressor.gif

6. Q-Q plot of random numbers following the F distribution

It is a type of F distribution with a slightly long hem to the right. F_hist-compressor.png This also has a slightly convex Q-Q plot in the lower right corner. Q-Q_plot_F_Dist-compressor.gif

6. Q-Q plot of random numbers according to β distribution

Next, let's write a Q-Q plot using the long-tailed type distribution on the left, the beta distribution of $ \ alpha = 6, \ beta = 2 $. Beta_hist-compressor.png This time, on the contrary, a convex Q-Q plot is drawn on the upper left. Q-Q_plot_Beta_Dist-compressor.gif

A little different is the beta distribution of $ \ alpha = 0.5, \ beta = 0.5 $, with vertices on both sides. In this case, you can draw a Q-Q plot that is convex to the lower right halfway and convex to the upper left in the second half. Beta_hist2-compressor.png Q-Q_plot_Beta_Dist2-compressor.gif

The full text of the Python code for drawing the graphs on this page is here

Recommended Posts

[Statistics] Understand the mechanism of Q-Q plot by animation.
[Statistics] Understand what an ROC curve is by animation.
[Ev3dev] Let's understand the mechanism of LCD (screen) control
Decoding experiment of the mechanism of public electric wiretapping by CIA
I investigated the mechanism of flask-login!
Understand the contents of sklearn's pipeline
[Statistics] Visualize and understand the Hamiltonian Monte Carlo method with animation.
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
Plot of regression line by residual plot
[Required subject DI] Implement and understand the mechanism of DI with Go
Animation of geographic data by geopandas
Explain the mechanism of PEP557 data class
Understand the benefits of the Django Rest Framework
[Python3] Understand the basics of Beautiful Soup
[Python] Understand the content of error messages
Understand the "temporary" part of UNIX / Linux
Pandas of the beginner, by the beginner, for the beginner [Python]
Let's investigate the mechanism of Kaiji's cee-loline
[Python3] Understand the basics of file operations
Check the operation of OpenCV3 installed by Anaconda
Sort the elements of the array by specifying the conditions
Transition animation of the most popular programming languages (#programming languages #popular)
Linux: Understand the information displayed by the top command
Minimize the number of polishings by combinatorial optimization
Judging the finish of mahjong by combinatorial optimization
The basis of graph theory with matplotlib animation
Search by the value of the instance in the list
How to avoid the cut-off label of the graph created by the plot module using matplotlib