[PYTHON] Feature Engineering Traveling with Pokemon-Category Variables-

In machine learning, character string data such as category data cannot be included in the machine learning model unless it is converted to numerical data. Also, numerical data that is not an ordinal scale should be treated as a categorical variable. In this article, I'll show you how to convert categorical variables into a machine-understandable form.

This time, as in the case of Features Engineering Traveling with Pokemon-Numerical Edition-, [Pokemon Dataset](https://www. use kaggle.com/abcsds/pokemon).

Loading the library

import pandas as pd
from sklearn.feature_extraction import FeatureHasher

Data read

df = pd.read_csv('./data/121_280_bundle_archive.zip')
df.head()

data

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

Dummy encoding

Dummy encoding is the most popular & frequently appearing technique when dealing with categorical variables in feature engineering. Each categorical variable is represented by bits 0 and 1. The bit of the part that corresponds to the category value is 1, and the bit of the part that does not correspond is 0. IMG_8246.JPG

pandas has a simple function for dummy encoding. Let's take a look at the code.

# One-hot Encoding
gdm = pd.get_dummies(df['Type 1'])
gdm = pd.concat([df['Name'], gdm], axis=1)

You can see that the bit corresponding to the Grass type of Bulbasaur is 1.

Name Bug Dark Dragon Electric Fairy Fighting Fire Flying Ghost Grass Ground Ice Normal Poison Psychic Rock Steel Water
Bulbasaur 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Ivysaur 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Venusaur 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
VenusaurMega Venusaur 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Charmander 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

Label encoding

The dummy encoding is represented by 0, 1 bits, but the label encoding is represented by an integer. IMG_6885.JPG

Feature hashing

The difference between feature hashing and conventional conversion is that feature hashing has a smaller number of categories after conversion. There is no problem if you can imagine reducing the number of input features by using a magic function called a hash function. ~~ It's easier to remember any study with an image. ~~

IMG_8811.JPG

Let's take a look at the code. sklearn has a FeatureHasher module, so let's use it. We will narrow down the types of Pokemon to 5 by feature amount hashing.

You might think, "When do you use it?" Remember to use it when you usually have too many categorical variables.

fh = FeatureHasher(n_features=5, input_type='string')
hash_table = pd.DataFrame(fh.transform(df['Type 1']).todense())

Features after conversion

0 1 2 3 4
2 0 0 0 -1
2 0 0 0 -1
2 0 0 0 -1
2 0 0 0 -1
1 -1 0 -1 1

at the end

When should I choose one of these methods? One answer is to use different methods depending on the specifications of the computer for analysis. While dummy encoding and label encoding are simple, too many categorical variables can cause memory errors. At that time, you may consider feature amount hashing that compresses the feature amount.

However, recently it has become available in High-spec calculator is free, and it is often used in Kaggle. GBDT, the decision tree model that has been used, can handle label encoding. It is thought that the turn of feature hashing is not so big. IMG_5875.JPG

Recommended Posts

Feature Engineering Traveling with Pokemon-Category Variables-
Feature Engineering Traveling with Pokemon-Numerical Edition-
HJvanVeen's "Feature Engineering" Note
Feature Engineering for Machine Learning Beginning with Part 3 Google Colaboratory-Scaling
Kaggle House Prices ① ~ Feature Engineering ~
Set environment variables with lambda-uploader