[PYTHON] Basic visualization techniques learned from Kaggle Titanic data


[Updated from time to time] Mainly using the snippets in EDA / Feature Engineering Snippets Used in Kaggle Table Data Competition [Kaggle Titanic Data] Use (https://www.kaggle.com/c/titanic/data) to visualize basic data.


import numpy as np 
import pandas as pd
import pandas_profiling as pdp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

cmap = plt.get_cmap("tab10")
%matplotlib inline

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option("display.max_colwidth", 10000)
target_col = "Survived"
data_dir = "/kaggle/input/titanic/"

Check the folder

!ls -GFlash /kaggle/input/titanic/
total 100K
4.0K drwxr-xr-x 2 nobody 4.0K Jan  7  2020 ./
4.0K drwxr-xr-x 5 root   4.0K Jul 12 00:15 ../
4.0K -rw-r--r-- 1 nobody 3.2K Jan  7  2020 gender_submission.csv
 28K -rw-r--r-- 1 nobody  28K Jan  7  2020 test.csv
 60K -rw-r--r-- 1 nobody  60K Jan  7  2020 train.csv

Read data

train = pd.read_csv(data_dir + "train.csv")
test = pd.read_csv(data_dir + "test.csv")
submit = pd.read_csv(data_dir + "gender_submission.csv")

Check the data

スクリーンショット 2020-07-12 9.17.32.png

Check the number of records and columns

print("{} rows and {} features in train set".format(train.shape[0], train.shape[1]))
print("{} rows and {} features in test set".format(test.shape[0], test.shape[1]))
print("{} rows and {} features in submit set".format(submit.shape[0], submit.shape[1]))
891 rows and 12 features in train set
418 rows and 11 features in test set
418 rows and 2 features in submit set

Check the number of defects for each column

Check how many defects are in each column.

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Visualization of missing values

Check if the defect has regularity.

sns.heatmap(train.isnull(), cbar=False)
スクリーンショット 2020-07-12 10.52.09.png

Check the summary statistics for each column

Check the summary statistics such as mean, standard deviation, maximum, minimum, and mode for each column to get a rough idea of the data.

スクリーンショット 2020-07-12 11.58.21.png

Aggregate the number (frequency) of data

Check the target percentage

sns.countplot(x=target_col, data=train)
スクリーンショット 2020-07-12 11.44.12.png

Check the percentage of category values

col = "Pclass"
sns.countplot(x=col, data=train)
スクリーンショット 2020-07-12 12.41.46.png

Check the percentage of a column for each target value

col = "Pclass"
sns.countplot(x=col, hue=target_col, data=train)
スクリーンショット 2020-07-12 10.32.41.png
col = "Sex"
sns.countplot(x=col, hue=target_col, data=train)
スクリーンショット 2020-07-12 10.35.00.png


The vertical axis is frequency and the horizontal axis is class, which visualizes the distribution of data. Try some to show different data characteristics for different bin sizes.

col = "Age"
train[col].plot(kind="hist", bins=10, title='Distribution of {}'.format(col))
スクリーンショット 2020-07-12 10.15.22.png
col = "Fare"
train[col].plot(kind="hist", bins=50, title='Distribution of {}'.format(col))
スクリーンショット 2020-07-12 11.01.08.png

Histogram by category

f, ax = plt.subplots(1, 3, figsize=(15, 4))
sns.distplot(train[train['Pclass']==1]["Fare"], ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(train[train['Pclass']==2]["Fare"], ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(train[train['Pclass']==3]["Fare"], ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
スクリーンショット 2020-07-12 11.15.58.png

Histogram of columns by target category

col = "Age"
fig, ax = plt.subplots(1, 2, figsize=(15, 6))
train[train[target_col]==1][col].plot(kind="hist", bins=50, title='{} - {} 1'.format(col, target_col), color=cmap(0), ax=ax[0])
train[train[target_col]==0][col].plot(kind="hist", bins=50, title='{} - {} 0'.format(col, target_col), color=cmap(1), ax=ax[1])
スクリーンショット 2020-07-12 9.51.20.png

Histogram of a column for each target value (when overlapping)

col = "Age"
train[train[target_col]==1][col].plot(kind="hist", bins=50, alpha=0.3, color=cmap(0))
train[train[target_col]==0][col].plot(kind="hist", bins=50, alpha=0.3, color=cmap(1))
plt.title("histgram for {}".format(col))
スクリーンショット 2020-07-12 12.19.53.png

Kernel density estimation

Roughly speaking, it is a curved histogram. You can get Y for X.

sns.kdeplot(label="Age", data=train["Age"], shade=True)
スクリーンショット 2020-07-12 13.06.27.png

Cross tabulation

Calculate the number of occurrences of each category of category data.

pd.crosstab(train["Sex"], train["Pclass"])
スクリーンショット 2020-07-12 12.06.55.png
pd.crosstab([train["Sex"], train["Survived"]], train["Pclass"])
スクリーンショット 2020-07-12 12.10.17.png

Pivot table

Average of quantitative data by category

pd.pivot_table(index="Pclass", columns="Sex", data=train[["Age", "Fare", "Survived", "Pclass", "Sex"]])
スクリーンショット 2020-07-12 15.37.00.png

Minimum value of quantitative data for each category

pd.pivot_table(index="Pclass", columns="Sex", data=train[["Age", "Fare", "Pclass", "Sex"]], aggfunc=np.min)
スクリーンショット 2020-07-12 15.41.22.png

Scatter plot

Check the relationship between the two columns.

Scatter plot

sns.scatterplot(x="Age", y="Fare", data=train)
スクリーンショット 2020-07-12 12.29.39.png

Scatter plot (color-coded by category)

sns.scatterplot(x="Age", y="Fare", hue=target_col, data=train)
スクリーンショット 2020-07-12 12.30.53.png

Scatterplot matrix

sns.pairplot(data=train[["Fare", "Survived", "Age", "Pclass"]], hue="Survived", dropna=True)
スクリーンショット 2020-07-12 16.11.47.png

Box plot

Visualize data variability.

Box plot by category

Check the variation of data for each category.

sns.boxplot(x='Pclass', y='Age', data=train)
スクリーンショット 2020-07-12 12.57.48.png

Strip chart

The figure which represented the data by a dot. It is used when one of the two data is categorical.

sns.stripplot(x="Survived", y="Fare", data=train)
スクリーンショット 2020-07-12 10.58.21.png
sns.stripplot(x='Pclass', y='Age', data=train)
スクリーンショット 2020-07-12 13.15.33.png

Heat map

Heat map of correlation coefficient for each column

sns.heatmap(train.corr(), annot=True)
スクリーンショット 2020-07-12 11.09.56.png


Recommended Posts

Basic visualization techniques learned from Kaggle Titanic data
Python application: data visualization part 1: basic
Check raw data with Kaggle's Titanic (kaggle ⑥)
[Kaggle] From data reading to preprocessing and encoding
Machine learning starting from scratch (machine learning learned with Kaggle)
Overview of machine learning techniques learned from scikit-learn
Challenge Kaggle Titanic
Data analysis Titanic 1