Introduction

I'm addicted to trying to complete missing values in a DataFrame that contains a string of string data, so make a note of it.

What is missing value completion?

Complementing non-existent data by using surrounding data.

Missing value example

For example, suppose you have the following data. The fifth row of the gender column is unknown. There are various methods of complementation, such as using the average value and median value, but this time I would like to simply complement with male, which is the gender of the majority.

What i did

In order to input data to the learner, it is necessary to convert the character string data into numerical data. So, after converting male to 0 and female to 1, I wrote the code to perform missing value completion. That is, replace unknown with 0 (male).

import pandas as pd
from sklearn.preprocessing import Imputer 

df_sample = pd.read_csv('sample.csv')
gender_map = {'male':0, 'female':1}
df_sample['gender'] = df_sample['gender'].map(gender_map) #Replace male with 0 and female with 1.
imr = Imputer(missing_values='unknown', strategy='most_frequent', axis=0)#Complementary object creation
imputed_data = imr.fit_transform(df_sample) #Apply completion

Specify the missing values in missing_values and the completion method (mean for mean, median for median) for strategy.

Here, I encountered the following error.

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Looking at the DataFrame after replacing the character string with a numerical value, it was as follows.

This was because each column in the DataFrame was a Series object and all the values were of the same type. In other words, the character string unknown and the numerical value 0,1 cannot coexist.

Solution

I changed the'unknown'part of the code to'NaN' and it worked.

import pandas as pd
from sklearn.preprocessing import Imputer 

df_sample = pd.read_csv('sample.csv')
gender_map = {'male':0, 'female':1}
df_sample['gender'] = df_sample['gender'].map(gender_map) #Replace male with 0 and female with 1.
imr = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)#Complementary object creation
imputed_data = imr.fit_transform(df_sample) #Apply completion