[PYTHON] Note that I was addicted to sklearn's missing value interpolation (Imputer)

Introduction

I'm addicted to trying to complete missing values in a DataFrame that contains a string of string data, so make a note of it.

What is missing value completion?

Complementing non-existent data by using surrounding data.

Missing value example

For example, suppose you have the following data. The fifth row of the gender column is unknown. There are various methods of complementation, such as using the average value and median value, but this time I would like to simply complement with male, which is the gender of the majority. sample.png

What i did

In order to input data to the learner, it is necessary to convert the character string data into numerical data. So, after converting male to 0 and female to 1, I wrote the code to perform missing value completion. That is, replace unknown with 0 (male).

import pandas as pd
from sklearn.preprocessing import Imputer 

df_sample = pd.read_csv('sample.csv')
gender_map = {'male':0, 'female':1}
df_sample['gender'] = df_sample['gender'].map(gender_map) #Replace male with 0 and female with 1.
imr = Imputer(missing_values='unknown', strategy='most_frequent', axis=0)#Complementary object creation
imputed_data = imr.fit_transform(df_sample) #Apply completion

Specify the missing values in missing_values and the completion method (mean for mean, median for median) for strategy.

Here, I encountered the following error.

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Looking at the DataFrame after replacing the character string with a numerical value, it was as follows. sample2.png

This was because each column in the DataFrame was a Series object and all the values were of the same type. In other words, the character string unknown and the numerical value 0,1 cannot coexist.

Solution

I changed the'unknown'part of the code to'NaN' and it worked.

import pandas as pd
from sklearn.preprocessing import Imputer 

df_sample = pd.read_csv('sample.csv')
gender_map = {'male':0, 'female':1}
df_sample['gender'] = df_sample['gender'].map(gender_map) #Replace male with 0 and female with 1.
imr = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)#Complementary object creation
imputed_data = imr.fit_transform(df_sample) #Apply completion

References

Python Machine Learning Programming Theory and Implementation by Expert Data Scientists (Impress)

Recommended Posts

Note that I was addicted to sklearn's missing value interpolation (Imputer)
A story that I was addicted to at np.where
I was addicted to multiprocessing + psycopg2
A story that I was addicted to calling Lambda from AWS Lambda.
A note I was addicted to when making a beep on Linux
A note I was addicted to when creating a table with SQLAlchemy
I was addicted to pip install mysqlclient
Note that I was addicted to accessing the DB with Python's mysql.connector using a web application.
I was addicted to Flask on dotCloud
What I was addicted to Python autorun
I was addicted to trying Cython with PyCharm, so make a note
A note I was addicted to when running Python with Visual Studio Code
A story that I was addicted to when I made SFTP communication with python
I set up TensowFlow and was addicted to it, so make a note
I was addicted to scraping with Selenium (+ Python) in 2020
I was addicted to trying logging.getLogger in Flask 1.1.x
What I was addicted to when using Python tornado
Memo (March 2020) that I was addicted to when installing Arch Linux on MacBook Air 11'Early 2015
[IOS] GIF animation with Pythonista3. I was addicted to it.
What I was addicted to when migrating Processing users to Python
[Fixed] I was addicted to alphanumeric judgment of Python strings
I wrote AWS Lambda, and I was a little addicted to the default value of Python arguments
The record I was addicted to when putting MeCab on Heroku
What I was addicted to when introducing ALE to Vim for Python
What I was addicted to with json.dumps in Python base64 encoding
I was addicted to confusing class variables and instance variables in Python
[Fabric] I was addicted to using boolean as an argument, so make a note of the countermeasures.
Two things I was addicted to building Django + Apache + Nginx on Windows
I was addicted to running tensorflow on GPU with NVIDIA driver 440 + CUDA 10.2
A story I was addicted to when inserting from Python to a PostgreSQL table
A story I was addicted to trying to install LightFM on Amazon Linux
I was addicted to creating a Python venv environment with VS Code
A story I was addicted to trying to get a video url with tweepy
Use Python from Java with Jython. I was also addicted to it.
I was addicted to not being able to use Markdown on pypi's long_description
The file name was bad in Python and I was addicted to import
[Python] I was addicted to not saving internal variables of lambda expressions
I thought it was the same as python, and I was addicted to the problem that the ruby interpreter did not start.