[Python] Missing rate (missing rate) Delete columns above a certain percentage (Pandas)

background

If the feature has a missing value (such as NaN), fill it with the average or 0. However, as the ratio of missing values ​​increases, the ratio of filling with average or 0 increases, but at the same time, "most of the features are the same value". The problems that come to mind are as follows.

--If all are missing, all are filled with the same value and the variance is 0. (Cannot be standardized.) ――Even if not all values ​​are missing, all values ​​will be the same depending on the division in cross-validation.

Remove features from the model that contain more than a certain percentage of missing values

Therefore, we will avoid the above problem by quickly deleting the features containing missing values ​​of a certain percentage or more. Assuming that the data is given by pd.DataFrame, the procedure is as follows.

  1. Get an array showing the sum of the missing values ​​in each column
  2. Divide 1. by the number of data to make a ratio
  3. Add column numbers containing missing values ​​to the list rather than a fixed rate
  4. Put the list of 3. in pd.DataFrame.drop () and delete them all at once.

main_in


import pandas as pd

def drop_manyNuNcolumns(df,rate):
    NaN_sum = df.isnull().sum() # 1.
    rate_NaN = NaN_sum / df.shape[0] # 2.
    drop_list = list() # 3.
    for i in range(rate_NaN.shape[0]-1):
        if rate_NaN[i] > rate:
            drop_list.append(i)
        else:
            pass
    return df.drop(data.columns[drop_list],  axis=1) # 4.

Recommended Posts

[Python] Missing rate (missing rate) Delete columns above a certain percentage (Pandas)
[Python] How to extract / delete / convert a matrix containing missing values (NaN)
Adding Series to columns in python pandas