[PYTHON] Let's look at the scatter plot before data analysis

When you start data analysis, [summary statistics] such as data mean and variance (https://ja.wikipedia.org/wiki/%E8%A6%81%E7%B4%84%E7%B5%B1% I think you should check E8% A8% 88% E9% 87% 8F). However, sometimes it is not enough to just check the summary statistics.

For example, in the case of Data like this [^ 1]

import pandas as pd
import seaborn as sns

#Data reading
df = pd.read_csv('https://git.io/vD7ui')

#Scatter plot
sns.lmplot(x='x', y='y', col='data', hue='data', col_wrap=2, fit_reg=False, data=df)

散布図

If you look at the scatter plot, you can see that the data are different, but the mean and standard deviation take the same value.

#average
df.groupby('data').mean()
data x y
0 9 7.500909
1 9 7.500909
2 9 7.500000
3 9 7.500909
#standard deviation
df.groupby('data').std()
data x y
0 3.316625 2.031568
1 3.316625 2.031657
2 3.316625 2.030424
3 3.316625 2.030579

You can see that the fine values are different, but they are almost the same.

Also, the regression line will be exactly the same.

#Scatter plot+Regression line
sns.lmplot(x='x', y='y', col='data', hue='data', col_wrap=2, fit_reg=True, data=df)

散布図 + 回帰直線

In pandas, you can display summary statistics together with the describe method.

#Summary statistics
df.groupby('data').describe()
x	y
data			
0	count	11.000000	11.000000
mean	9.000000	7.500909
std	3.316625	2.031568
min	4.000000	4.260000
25%	6.500000	6.315000
50%	9.000000	7.580000
75%	11.500000	8.570000
max	14.000000	10.840000
1	count	11.000000	11.000000
mean	9.000000	7.500909
std	3.316625	2.031657
min	4.000000	3.100000
25%	6.500000	6.695000
50%	9.000000	8.140000
75%	11.500000	8.950000
max	14.000000	9.260000
2	count	11.000000	11.000000
mean	9.000000	7.500000
std	3.316625	2.030424
min	4.000000	5.390000
25%	6.500000	6.250000
50%	9.000000	7.110000
75%	11.500000	7.980000
max	14.000000	12.740000
3	count	11.000000	11.000000
mean	9.000000	7.500909
std	3.316625	2.030579
min	8.000000	5.250000
25%	8.000000	6.170000
50%	8.000000	7.040000
75%	8.000000	8.190000
max	19.000000	12.500000

The mean and standard deviation are as you saw earlier, but you can see that the quartiles are slightly different. Especially data3 is very different.

In this way, data with different scatter plots but the same statistics and regression line Anscombe's example It is called B3% E3% 82% B9% E3% 82% B3% E3% 83% A0% E3% 81% AE% E4% BE% 8B). Therefore, it is important to draw a scatter plot as well as statistics.

However, in actual data, it is rare that it is two-dimensional. In that case, [Principal Component Analysis (PCA)](https://ja.wikipedia.org/wiki/%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6 It is necessary to devise such as using% 9E% 90) to reduce the dimension to 2 dimensions and visualize it.

[^ 1]: Rows with the same value in the data column represent the same data

Recommended Posts

Let's look at the scatter plot before data analysis
Let's analyze the questionnaire survey data [4th: Sentiment analysis]
Let's make the analysis of the Titanic sinking data like that
Challenge image classification by TensorFlow2 + Keras 2 ~ Let's take a closer look at the input data ~
Let's take a look at the feature map of YOLO v3
Let's play with the corporate analysis data set "CoARiJ" created by TIS ②
Plot multiple maps and data at the same time with Python's matplotlib
Data analysis before kaggle's titanic feature generation
[Data analysis] Let's analyze US automobile stocks
Data analysis in Python Summary of sources to look at first for beginners
Let's take a look at the Scapy code. How are you processing the structure?
Scatter plot
Label each point on the seaborn scatter plot
[Python] Colored map plot at the city level
Before the coronavirus, I first tried SARS analysis
[Data analysis] Should I buy the Harumi flag?