[PYTHON] Let's look at the scatter plot before data analysis

When you start data analysis, [summary statistics] such as data mean and variance (https://ja.wikipedia.org/wiki/%E8%A6%81%E7%B4%84%E7%B5%B1% I think you should check E8% A8% 88% E9% 87% 8F). However, sometimes it is not enough to just check the summary statistics.

For example, in the case of Data like this [^ 1]

import pandas as pd
import seaborn as sns

#Data reading
df = pd.read_csv('https://git.io/vD7ui')

#Scatter plot
sns.lmplot(x='x', y='y', col='data', hue='data', col_wrap=2, fit_reg=False, data=df)

散布図

If you look at the scatter plot, you can see that the data are different, but the mean and standard deviation take the same value.

#average
df.groupby('data').mean()

data	x	y
0	9	7.500909
1	9	7.500909
2	9	7.500000
3	9	7.500909

#standard deviation
df.groupby('data').std()

data	x	y
0	3.316625	2.031568
1	3.316625	2.031657
2	3.316625	2.030424
3	3.316625	2.030579

You can see that the fine values are different, but they are almost the same.

Also, the regression line will be exactly the same.

#Scatter plot+Regression line
sns.lmplot(x='x', y='y', col='data', hue='data', col_wrap=2, fit_reg=True, data=df)

散布図 + 回帰直線

In pandas, you can display summary statistics together with the describe method.

#Summary statistics
df.groupby('data').describe()

x	y
data			
0	count	11.000000	11.000000
mean	9.000000	7.500909
std	3.316625	2.031568
min	4.000000	4.260000
25%	6.500000	6.315000
50%	9.000000	7.580000
75%	11.500000	8.570000
max	14.000000	10.840000
1	count	11.000000	11.000000
mean	9.000000	7.500909
std	3.316625	2.031657
min	4.000000	3.100000
25%	6.500000	6.695000
50%	9.000000	8.140000
75%	11.500000	8.950000
max	14.000000	9.260000
2	count	11.000000	11.000000
mean	9.000000	7.500000
std	3.316625	2.030424
min	4.000000	5.390000
25%	6.500000	6.250000
50%	9.000000	7.110000
75%	11.500000	7.980000
max	14.000000	12.740000
3	count	11.000000	11.000000
mean	9.000000	7.500909
std	3.316625	2.030579
min	8.000000	5.250000
25%	8.000000	6.170000
50%	8.000000	7.040000
75%	8.000000	8.190000
max	19.000000	12.500000

The mean and standard deviation are as you saw earlier, but you can see that the quartiles are slightly different. Especially data3 is very different.

In this way, data with different scatter plots but the same statistics and regression line Anscombe's example It is called B3% E3% 82% B9% E3% 82% B3% E3% 83% A0% E3% 81% AE% E4% BE% 8B). Therefore, it is important to draw a scatter plot as well as statistics.

However, in actual data, it is rare that it is two-dimensional. In that case, [Principal Component Analysis (PCA)](https://ja.wikipedia.org/wiki/%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6 It is necessary to devise such as using% 9E% 90) to reduce the dimension to 2 dimensions and visualize it.

[^ 1]: Rows with the same value in the data column represent the same data