I'm addicted to trying to complete missing values in a DataFrame that contains a string of string data, so make a note of it.
Complementing non-existent data by using surrounding data.
For example, suppose you have the following data. The fifth row of the gender column is unknown. There are various methods of complementation, such as using the average value and median value, but this time I would like to simply complement with male, which is the gender of the majority.
In order to input data to the learner, it is necessary to convert the character string data into numerical data. So, after converting male to 0 and female to 1, I wrote the code to perform missing value completion. That is, replace unknown with 0 (male).
import pandas as pd
from sklearn.preprocessing import Imputer
df_sample = pd.read_csv('sample.csv')
gender_map = {'male':0, 'female':1}
df_sample['gender'] = df_sample['gender'].map(gender_map) #Replace male with 0 and female with 1.
imr = Imputer(missing_values='unknown', strategy='most_frequent', axis=0)#Complementary object creation
imputed_data = imr.fit_transform(df_sample) #Apply completion
Specify the missing values in missing_values and the completion method (mean for mean, median for median) for strategy.
Here, I encountered the following error.
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Looking at the DataFrame after replacing the character string with a numerical value, it was as follows.
This was because each column in the DataFrame was a Series object and all the values were of the same type. In other words, the character string unknown and the numerical value 0,1 cannot coexist.
I changed the'unknown'part of the code to'NaN' and it worked.
import pandas as pd
from sklearn.preprocessing import Imputer
df_sample = pd.read_csv('sample.csv')
gender_map = {'male':0, 'female':1}
df_sample['gender'] = df_sample['gender'].map(gender_map) #Replace male with 0 and female with 1.
imr = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)#Complementary object creation
imputed_data = imr.fit_transform(df_sample) #Apply completion
Python Machine Learning Programming Theory and Implementation by Expert Data Scientists (Impress)
Recommended Posts