[PYTHON] [Updated as appropriate] I tried to organize the basic visualization methods

Articles sent by data scientists from the manufacturing industry
This time, I tried to organize the visualization methods that are often used in business.
I would like to be able to update from time to time.

Introduction

This time I would like to organize using the Auto MPG dataset. This dataset is data showing the fuel economy of automobiles from the late 1970s to the early 1980s.

Data confirmation

#Installation of required libraries
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import os

file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
file_name = os.path.splitext(os.path.basename(file_path))[0]
column_names = ['MPG','Cylinders', 'Displacement', 'Horsepower', 'Weight',
                  'Acceleration', 'Model Year', 'Origin'] 

df = pd.read_csv(
    file_path, #File Path
    names = column_names, #Specify column name
    na_values ='?', # ?Read as missing value
    comment = '\t', #Skip right after TAB
    sep = ' ', #Use blank lines as delimiters
    skipinitialspace = True, #Skip the blank after the comma
    encoding = 'utf-8'
) 
df.head()

スクリーンショット 2021-01-07 9.52.59.png

Check the number of records and columns

#Check the number of records and columns
df.shape

Confirmation of missing values

#Check the number of missing values
df.isnull().sum()

Check the attributes of each column of DataFrame

#Check the attributes of each column of DataFrame
df.dtypes

Visualization of missing values

It is used when there is regularity in missing values. It is useful because it is easy to understand when explaining to the site.

plt.figure(figsize=(14,7))
sns.heatmap(df.isnull())

欠損値.png

Checking summary statistics

#Summary statistics
df.describe()

スクリーンショット 2021-01-07 9.57.43.png

Histogram creation

#histogram
df['MPG'].plot(kind='hist', bins=12)

ヒストグラム.png

Creating a kernel density estimate

The histogram looks different when you change the size of the bin, so the graph created by kernel density estimation is used more often.

#Kernel density estimation
sns.kdeplot(data=df['MPG'], shade=True)

カーネル密度推定.png

Creating a scatter plot

Scatter plot + histogram

#Scatter plot+histogram
sns.jointplot(x='Model Year', y='MPG', data=df, alpha=0.3)

散布図＋ヒストグラム.png

Hexagonal scatterplot matrix

# hexagonal bins
sns.jointplot(x='Model Year', y='MPG', data=df, kind='hex')

六角形.png

Hexagonal scatter plot

A slightly modern and fashionable scatter plot.

# hexagonal bins
sns.jointplot(x='Model Year', y='MPG', data=df, kind='hex')

Scatter plot of kernel density estimation

Generate contour-like graphs.

# density estimates
sns.jointplot(x='Model Year', y='MPG', data=df, kind='kde', shade=True)

Scatterplot matrix

#Scatterplot matrix
sns.pairplot(df[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")

散布図行列.png

Creating a boxplot

Visualize data variability.

countplot

#Count plot by age
ax = sns.countplot(x='Model Year', data=df, color='cornflowerblue')

Box plot

#Box plot(boxplot)
sns.boxplot(x='Model Year', y='MPG', data=df.sort_values('Model Year'), color='cornflowerblue')

violin plot A graph that allows you to check the density of the data distribution.

# violin plot 
sns.violinplot(x='Model Year', y='MPG', data=df.sort_values('Model Year'), color='cornflowerblue')

violin plot.png

swarm plot A graph that can be confirmed by the dots of the data distribution.

# swarm plot
fig, ax = plt.subplots(figsize=(20, 5))
ax.tick_params(labelsize=20)
sns.swarmplot(x='Model Year', y='MPG', data=df.sort_values('Model Year'))

swarm plot.png

Heat map

Correlation coefficient matrix

#Correlation coefficient matrix (excluding rows with a value of 0)
df = df[(df!=0).all(axis=1)]
corr = df.corr()
corr

swarm plot.png

Heat map of correlation coefficient matrix

I personally like the "cool warm" shades of cmap. If you do not specify anything, the color will be subtle and it will be difficult to see in the materials.

#Correlation coefficient heat map
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

ヒートマップ.png

at the end

Thank you for reading to the end. This time, I tried to organize the basic visualization methods. I will update it as my memo as appropriate.

If you have a request for correction, we would appreciate it if you could contact us.