[Python] EDA memo

Purpose of this article

Make a note of frequently used code in EDA (Exploratory Data Analysis), which is performed at the beginning of data analysis. This time, in particular, we assume the case of classification problems (such as the prediction of passenger survival on the Titanic).

Code commentary

Library load

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)

import seaborn as sns
import matplotlib.pyplot as plt
import japanize_matplotlib

import os
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

Data reading

df = sns.load_dataset("titanic")
df = df.replace(float("nan"),np.nan) #Later unique()For the calculation of
df

image.png

Check the contents of each variable

for colname in df.columns:
    uni = len(df[colname].unique())
    print("{0:<20} : {1}".format(colname, uni))

image.png

Objective variable setting, variable type definition

target="survived"
cate_list = ["pclass", "sex", "sibsp", "parch", "embarked", "class", 
                 "who", "adult_male", "deck", "embark_town", "alone"] #Ignore alive
num_list = ["age", "fare"]
all_list = cate_list + num_list


Confirmation of NaN etc.

See the article here

image.png

Categorical data

If you just want to check it easily, use the following two types.

sns.countplot(x="pclass", hue=target, data=df)

image.png

sns.catplot(x="pclass", hue=target, data=df,kind="count")

image.png

Furthermore, if you want to know about NaN and average, define and use the following function.

def category_plot(x, hue, data, order=[]):
    #NaN to string
    flag_nan = False
    data[x] = data[x].astype("str").replace("nan","NaN")
    if "NaN" in data[x].values:
        flag_nan = True

    x_unique_list = sorted(data[x].unique())
    x_unique_len = len(x_unique_list)
    x_unique_len_dropna = x_unique_len-1 if flag_nan else x_unique_len

    hue_unique_list = sorted(data[hue].unique())
    hue_unique_len = len(hue_unique_list)

    if order==[]:
        if flag_nan:
            order = x_unique_list
            order.remove("NaN")
            order = order + ["NaN"]
        else:
            order = x_unique_list
    else:
        pass
    
    colors = plt.get_cmap("tab10").colors

    sns.countplot(x=x, hue=hue, data=data, order=order,hue_order=hue_unique_list)

    for i,ui in enumerate(hue_unique_list):
        h = data.loc[data[hue]==ui,:].shape[0] / x_unique_len_dropna
        plt.plot([0-0.5,x_unique_len_dropna-1+0.5],[h,h],color=colors[i], linestyle="dashed", label="{0} (average)".format(ui))
    plt.legend()
    plt.show()

category_plot(x="pclass", hue=target, data=df)

image.png

category_plot(x="embarked", hue=target, data=df)

image.png

category_plot(x="deck", hue=target, data=df)

image.png

Numerical data

If you look at the following two types of plots, it's almost OK

sns.catplot(x=target, y="age", data=df,kind="swarm")

image.png

sns.catplot(x=target, y="age", data=df,kind="violin")

image.png

reference

seaborn:seaborn.catplot seaborn:seaborn.countplot

Recommended Posts

[Python] EDA memo
Python memo
python memo
Python memo
python memo
Python memo
Python memo
Python memo
[Python] Memo dictionary
python beginner memo (9.2-10)
python beginner memo (9.1)
★ Memo ★ Python Iroha
Python 3 operator memo
[My memo] python
Python3 metaclass memo
[Python] Basemap memo
Python beginner memo (2)
[Python] Numpy memo
Python class (Python learning memo ⑦)
My python environment memo
python openCV installation (memo)
Visualization memo by Python
[Python] Memo about functions
python regular expression memo
Binary search (python2.7) memo
[My memo] python -v / python -V
Python3 List / dictionary memo
[Memo] Python3 list sort
Python Tips (my memo)
[Python] Memo about errors
DynamoDB Script Memo (Python)
Python basic memo --Part 2
python recipe book Memo
Basic Python command memo
Python OpenCV tutorial memo
Python basic grammar memo
TensorFlow API memo (Python)
python useful memo links
Python decorator operation memo
Python basic memo --Part 1
Effective Python Memo Item 3
Divisor enumeration Python memo
Python memo (for myself): Array
Python exception handling (Python learning memo ⑥)
Twitter graphing memo with Python
[Line / Python] Beacon implementation memo
Python and ruby slice memo
Python Basic Grammar Memo (Part 1)
Raspberry Pi + Python + OpenGL memo
Python basic grammar (miscellaneous) Memo (3)
Python immutable type int memo
Python memo using perl --join
Python data type summary memo
Python basic grammar (miscellaneous) Memo (2)
[MEMO] [Development environment construction] Python
[Python] virtualenv creation procedure memo
Python basic grammar (miscellaneous) Memo (4)
Regarding speeding up python (memo)
Python control syntax, functions (Python learning memo ②)
Difference between java and python (memo)
[Python] Operation memo of pandas DataFrame