A function that easily calculates a listwise removal tree (Python)

About this article

What is a listwise removal tree in the first place?

Such a guy ↓ It took about half an hour to build a chewy function in Excel. (You'll want to hurt your eyes and hips)

risttttt.png

(Please tell us if there is a formal name in this figure)

What's the hassle?

If you just want to put the missing data of each variable, it ends with ** df.isnull.sum () **, but ...

  1. The missing data for ** x1 ** was ● people.
  2. In the data excluding the missing data of ** x1 **, the missing data of ** x2 ** was ▲ people.
  3. In the data excluding the missing data of ** x1 ** and ** x2 **, the missing data of ** x3 ** was ■ people. Four. ···

You need to write something like an analytic function in SQL.

Ah, it's a hassle (in python).


Then to the main subject

python


import pandas as pd
import numpy as np

def caluculate_missing_tree(df):
    d ={}
    d[0]= df.loc[df[df.columns[0]].isnull() != True]
    for i in range(len(df.columns)-1):
        d[1+i]= d[i].loc[d[i][d[i].columns[1+i]].isnull() != True]

    le = []
    colnames = []
    missing_tree = pd.DataFrame()

    for i in range(len(df.columns)):
        le.append(len(d[i]))
    for i in range(len(df.columns)):
        colnames.append(df.columns[i])


    missing_tree['col_name'] = colnames
    missing_tree['Size'] = le

    return missing_tree


Just insert a dataframe containing variables in the order you want to draw the tree into the argument of ** caluculate_missing_tree () **.

For example, try with ** titanic ** data.

python



import pandas as pd 
import numpy as np
import os 

df = pd.read_csv("train.csv")
df.shape #(891, 12)

df.isnull().sum()  #Missing data for each variable

--------------------------------
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

If you feed this one to this function ...

python




caluculate_missing_tree(df)

--------------------------------------

	col_name	Size
0	PassengerId	891
1	Survived	891
2	Pclass  	891
3	Name	    891
4	Sex	        891
5	Age	        714
6	SibSp	    714
7	Parch	    714
8	Ticket	    714
9	Fare	    714
10	Cabin	    185
11	Embarked	183


I was able to calculate in an instant. happy.

Description of contents

The idea is that **. loc ** should be used to create more and more datasets that meet the conditions (not missing data).

python


df  <-original data
df1 = df.loc[df['x1'].isnull() != True]] <-Data with x1 missing data removed
df2 = df1.loc[df1['x2'].isnull() != True]] <- x1,Data excluding x2 missing data
df3 = df2.loc[df2['x3'].isnull() != True]] <- x1, x2,Data excluding x3 missing data
...
...

Like this.

Furthermore, when I think about writing ** for sentence **, it looks like this.

python


d[0]= df.loc[df[df.columns[0]].isnull() != True] <-This is outside the for statement

---Image of for from here---
d[1]= d[0].loc[d[0][d[0].columns[1]].isnull() != True]
d[2]= d[2-1].loc[d[2-1][d[2-1].columns[2]].isnull() != True]
d[3]= d[3-1].loc[d[3-1][d[3-1].columns[3]].isnull() != True]

However, it was a little difficult to automate the creation of df with ** for statement **.

Create a list to store multiple data frames. I used the method of storing the data frame corresponding to each variable there.

python


    d ={}
    d[0]= df.loc[df[df.columns[0]].isnull() != True]
    for i in range(len(df.columns)-1):
        d[1+i]= d[i].loc[d[i][d[i].columns[1+i]].isnull() != True]

Like this. For example, in the ** titanic ** data, ・ ** d [0] is Passenger ID ** ・ ** d [1] is Passenger ID, Survived ** -** d [2] corresponds to missing data of passengerID, Survived, Pclass **.

After that, it would be ** inevitable ** to come up with the idea of ​​creating a data frame with ** variable names ** and ** sample size ** in the column names for ease of confirmation.

python



    le = []
    colnames = []
    missing_tree = pd.DataFrame()

    for i in range(len(df.columns)):
        le.append(len(d[i]))
    for i in range(len(df.columns)):
        colnames.append(df.columns[i])


    missing_tree['col_name'] = colnames
    missing_tree['Size'] = le

    return missing_tree

The len (df.columns) of the data frame in which the missing values ​​of each variable were deleted was stored in ** le **. Similarly, the variable names corresponding to each data frame were stored in ** colnames ** and visualized.

python




caluculate_missing_tree(df)

--------------------------------------

	col_name	Size
0	PassengerId	891
1	Survived	891
2	Pclass  	891
3	Name	    891
4	Sex	        891
5	Age	        714
6	SibSp	    714
7	Parch	    714
8	Ticket	    714
9	Fare	    714
10	Cabin	    185
11	Embarked	183


Jajan (second time)

Implementation method in R

Recommended Posts

A function that easily calculates a listwise removal tree (Python)
The eval () function that calculates a string as an expression in python
Publish a Python module that calculates meteorological factors
Create a function in Python
A function that divides iterable into N pieces in Python
[Python] What is a zip function?
Call a Python function from p5.js.
A function that returns a random name
Created a library for python that can easily handle morpheme division
A function that measures the processing time of a method in python
[Python] Make the function a lambda function
[python] I made a class that can write a file tree quickly
[Python] A program that calculates the number of chocolate segments that meet the conditions
[Python] A program that calculates the number of socks to be paired
[Python] Note: A self-made function that finds the area of the normal distribution
Create a Python function decorator with Class
[Python] A program that creates stairs with #
Precautions when pickling a function in python
A tool for easily entering Python code
[Python] Create a LineBot that runs regularly
Draw a tree in Python 3 using graphviz
A typed world that begins with Python
A program that plays rock-paper-scissors using Python
[Python] A program that rounds the score
python function ①
I tried to create a class that can easily serialize Json in Python
I made a familiar function that can be used in statistics with Python
[Python] A program that calculates the difference between the total numbers on the diagonal line.
[Python] A program that calculates the number of updates of the highest and lowest records
python function ②
Draw a graph of a quadratic function in Python
A memo that I wrote a quicksort in Python
To execute a Python enumerate function in JavaScript
[Python] A tool that allows intuitive relative import
[Python / Tkinter] A class that creates a scrollable Frame
Get the caller of a function in Python
A nice nimporter that connects nim and python
Create a page that loads infinitely with python
A program that removes duplicate statements in Python
You can easily create a GUI with Python
Python: Create a class that supports unpacked assignment
Create a decision tree from 0 with Python (1. Overview)
A Vim plugin that automatically formats Python styles
Create a Python console application easily with Click
A python script that generates a sample dataset for checking the operation of a classification tree