background

If the feature has a missing value (such as NaN), fill it with the average or 0. However, as the ratio of missing values increases, the ratio of filling with average or 0 increases, but at the same time, "most of the features are the same value". The problems that come to mind are as follows.

--If all are missing, all are filled with the same value and the variance is 0. (Cannot be standardized.) ――Even if not all values are missing, all values will be the same depending on the division in cross-validation.

Remove features from the model that contain more than a certain percentage of missing values

Therefore, we will avoid the above problem by quickly deleting the features containing missing values of a certain percentage or more. Assuming that the data is given by pd.DataFrame, the procedure is as follows.

Get an array showing the sum of the missing values in each column
Divide 1. by the number of data to make a ratio
Add column numbers containing missing values to the list rather than a fixed rate
Put the list of 3. in pd.DataFrame.drop () and delete them all at once.

`main_in`


import pandas as pd

def drop_manyNuNcolumns(df,rate):
    NaN_sum = df.isnull().sum() # 1.
    rate_NaN = NaN_sum / df.shape[0] # 2.
    drop_list = list() # 3.
    for i in range(rate_NaN.shape[0]-1):
        if rate_NaN[i] > rate:
            drop_list.append(i)
        else:
            pass
    return df.drop(data.columns[drop_list],  axis=1) # 4.

[Python] Missing rate (missing rate) Delete columns above a certain percentage (Pandas)

background

Remove features from the model that contain more than a certain percentage of missing values

main_in

`main_in`