[Python] df Read and do the first memo (NaN confirmation etc.)

Purpose of this article

Taking the Titanic data as an example, make a note of what you do first to see the characteristics of the data. Usually, pandas-profiling may be better because it gives you more detailed information.

Library load

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)

import warnings
warnings.filterwarnings('ignore')
import collections

Data preparation

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

Data reading & a little processing

filename = "/content/titanic.csv"
df = pd.read_csv(filename, encoding='utf-8')

#Make NaN properly
df["Name"] = [di if np.random.rand()>0.1 else float("nan") for di in df["Name"]]
df["Sex"] = [di if np.random.rand()>0.01 else float("nan") for di in df["Sex"]]
df["Age"] = [di if np.random.rand()>0.05 else float("nan") for di in df["Age"]]

#family name
df["f_Name"] = [str(di).split(" ")[-1] if len(str(di).split(" "))>1 else float("nan") for di in df["Name"]]

You can create a data frame like this.

image.png

I will use collections.Counter later, but if the NaN value isfloat ("nan"), it will not be aggregated well, so replace it with np.nan. For more information, see here

df = df.replace(float("nan"), np.nan)

Define the data types one by one.

target = "Survived"
cate_list = ["Pclass", "Name", "f_Name", "Sex", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
num_list = ["Age", "Fare"]

all_list = cate_list+num_list

The following is the main process.

n = df.shape[0]
max_n_unique = 10

n_unique_list=[]
min_data_list=[]
max_data_list=[]
major_data_rate_list=[]

#category only
for colname in all_list:

    if colname in cate_list: #cate
        n_unique = len(df[colname].unique())
        min_data = np.nan
        max_data = np.nan

        if n_unique>max_n_unique: #If there are many categories
            c = collections.Counter(df[colname])
            c_dict = dict(c.most_common(max_n_unique-1))
            #k_list = [k for k,v in c_dict.items()]
            v_list = [v/n for k,v in c_dict.items()]
            major_data_rate = np.sum(v_list)
        else:
            major_data_rate = np.nan

    else: #num
        n_unique = np.nan
        major_data_rate = np.nan
        min_data = df[colname].min()
        max_data = df[colname].max()


    n_unique_list.append(n_unique)
    major_data_rate_list.append(major_data_rate)
    min_data_list.append(min_data)
    max_data_list.append(max_data)

have_nan = df.loc[:,all_list].isnull().any(axis=0)
nan_rate = df.loc[:,all_list].isnull().sum(axis=0)/n

summary_df = pd.DataFrame({"colname":all_list,
                           "have_nan":have_nan.values,
                           "nan_rate":nan_rate.values,
                           "n_unique":n_unique_list,
                           "major_data_rate":major_data_rate_list,
                           "min_data":min_data_list,
                           "max_data":max_data_list
                           })

You can create a data frame that summarizes the characteristics of such variables.

image.png

major_data_rate considers the number specified by max_n_unique, for example, 10 frequently occurring Top 10 data as major, and calculates the ratio of that data. (It is assumed that other than Top 10 will be summarized by ʻothers` etc. in the later processing.)

reference

stack overflow:Why does collections.Counter treat numpy.nan as equal? CS109:A Titanic Probability GitHub:pandas-profiling

Recommended Posts

[Python] df Read and do the first memo (NaN confirmation etc.)
The story of Python and the story of NaN
Statistical basics and Python, graphing, etc. (memo)
The simplest Python memo in Japan (classes and objects)
Receive the form in Python and do various things
[Python] Read the csv file and display the figure with matplotlib
[2020 version] Let Python do all the tax and take-home calculations
The websocket of toio (nodejs) and python / websocket do not connect.
Python and ruby slice memo
Note: Get the first and last items of Python OrderedDict non-destructively
[Python] How to get the first and last days of the month
Difference between java and python (memo)
First Python 3 ~ The beginning of repetition ~
Run Pylint and read the results
Read and use Python files from Python
Have python read the command output
See python for the first time
Python memo ① Folder and file operations
[Python] Read the Flask source code
The first step in Python Matplotlib
Create and read messagepacks in Python
[Python] Visualize the heat of Tokyo and XX prefectures (DataFrame usage memo)
Work memo to migrate and update Python 2 series scripts on the cloud to 3 series
Python beginner ~ Round off the Nth decimal place and output ~ (for memo)
Read the file with python and delete the line breaks [Notes on reading the file]