If the feature has a missing value (such as NaN), fill it with the average or 0. However, as the ratio of missing values increases, the ratio of filling with average or 0 increases, but at the same time, "most of the features are the same value". The problems that come to mind are as follows.
--If all are missing, all are filled with the same value and the variance is 0. (Cannot be standardized.) ――Even if not all values are missing, all values will be the same depending on the division in cross-validation.
Therefore, we will avoid the above problem by quickly deleting the features containing missing values of a certain percentage or more. Assuming that the data is given by pd.DataFrame, the procedure is as follows.
main_in
import pandas as pd
def drop_manyNuNcolumns(df,rate):
NaN_sum = df.isnull().sum() # 1.
rate_NaN = NaN_sum / df.shape[0] # 2.
drop_list = list() # 3.
for i in range(rate_NaN.shape[0]-1):
if rate_NaN[i] > rate:
drop_list.append(i)
else:
pass
return df.drop(data.columns[drop_list], axis=1) # 4.