[PYTHON] Data preprocessing (2) Data is changed from Categorical to Numerical.

Need for OneHotEncoder

Why is OneHotEncoder important for using Scikit-learn?

When using classification and regression in machine learning, computers basically treat numbers as consecutive numbers. In other words, when there is a number from 1 to 10, 1 is always recognized as being larger than 10.

What are you talking about! You will think.

But think about it.

For example, if an animal is converted to a numerical value as shown below, does it really mean that Tiger is numerically larger than Human? For example, based on the table below, if you take the average of Tiger and Cat, you end up with a Panda, which is a strange situation.

Animal Transform to Numbers
Tiger 0
Panda 1
Cat 2
Human 3
Python 4

In this way, OneHotEncoder is used when the numbers before and after the numbers do not make any sense when converted to numbers.

How does OneHotEncoder deal with the problems mentioned above?

To make it easier to understand, when the above figure is processed by OneHotEncoder, it becomes like this.

Tiger Panda Cat Human Python
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1

In other words, it is a method of counting each discontinuous object by dividing it into columns. By doing this, each animal can be treated as an independent value, not as a continuous number.

code

python



>>>
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Retrieved from Official Documentation

Need for LabelEncoder

What is LabelEncoder?

Let's take the table used above again here. This table replaces the names of animals such as Tiger and Panda with numbers. LabelEncoder performs this replacement operation. So, after adapting LabelEncoder, apply OneHotEncoder.

Animal Transform to Numbers
Tiger 0
Panda 1
Cat 2
Human 3
Python 4

code

python


from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])

le.transform(["tokyo", "tokyo", "paris"]) 
>>>array([2, 2, 1]...)

list(le.inverse_transform([2, 2, 1]))
>>>['tokyo', 'tokyo', 'paris']

#By the way, when using a column with df, it looks like this:
#LabelEncoder is applied to the column called City in df.

df.City = le.fit_transform(df.City)
#Or
df.City = le.fit_transform(df['City'].values)

#When you want to undo

df.City = le.inverse_trainsform(df.City)


This is the Official Documentation

Pandas pd.get_dummies function

get_dummies is like OneHotEncoder.

I was able to use LabelEncoder well, but I couldn't use OneHotEncoder well. So, as a result of my research, I got the information that Pandas get_dummies seems to do almost the same thing. By the way, if anyone who teaches OneHotEncoder or knows a site that is organized in a nice way, please let me know. So, get_dummies seems to play the same role as OneHotEncoder by creating a column for each element of categorical value. The atmosphere is as follows.

Tiger Panda Cat Human Python
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1



df = pd.get_dummies(df, columns = ['animal'])

#Create a column for each element of animal, 0,Notated by 1.

Here Official Document

The guy who summarizes the differences in a nice way (English)

It nicely summarizes the differences between LabelEncoder and OneHotEncoder. This

Recommended Posts

Data preprocessing (2) Data is changed from Categorical to Numerical.
SIGNATE Quest ① From data reading to preprocessing
[Kaggle] From data reading to preprocessing and encoding
I want to say that there is data preprocessing ~
From Elasticsearch installation to data entry
Data retrieval from MacNote3 and migration to Write
[Python] Flow from web scraping to data analysis
[AWS] Migrate data from DynamoDB to Aurora MySQL
Sum from 1 to 10
How to scrape image data from flickr with python
Automatic data migration from yahoo root lab to Strava
Send log data from the server to Splunk Cloud
Send data from Python to Processing via socket communication
DataNitro, implementation of function to read data from sheet