[PYTHON] Feature Engineering Traveling with Pokemon-Numerical Edition-

Recently, I have had the opportunity to participate in data analysis competitions, but in many cases the results were not effective unless appropriate features were extracted depending on the data set provided in the competition. In order to summarize my thoughts, I will post an article in the hope that it will be useful for people who are in trouble as well as myself in feature extraction. This time, I will write some features of numerical data.

Regarding usage data, I used Pokemon Dataset because it is easier to imagine data trends with familiar data. The data consists of flags of Pokemon types up to the 6th generation (X / Y), effort value, generation, and legendary Pokemon.

Library loading

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#For feature engineering
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import PowerTransformer

figsize  = (10, 7)

Data read

df = pd.read_csv('./data/121_280_bundle_archive.zip')

data

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

scaling

Scaling is an image that pushes the distribution of features into a certain range without changing them (see the figure). There are two typical scaling methods: standardization and Min-Max scaling.

Standardization

Standardization pushes the feature distribution into a standard normal distribution with mean 0 and variance 1. The standardization formula is as follows. Specifically, it is calculated by calculating the average from the features and dividing it by the standard deviation (how far from the average). It's easier to imagine looking at the code and the results after scaling.

#Standardization
scaler = StandardScaler()
scaled_hp = scaler.fit_transform(df[['HP']])
scaled_hp = pd.DataFrame(scaled_hp, columns=['HP'])


#Compare before and after standardization
fig, ax = plt.subplots(1, 2, figsize=figsize)
sns.distplot(df['HP'], ax=ax[0])
ax[0].set_title('Untransformed')
sns.distplot(scaled_hp, ax=ax[1])
ax[1].set_title('Transformed')

plt.savefig('Standardization.png')

標準化.png

Min-Max scaling

Min-Max scaling is used to push features from 0 to 1. The standardization formula is as follows. Since the maximum and minimum features are used, there is a risk of being easily affected by outliers. Here is also the code and the scaled one. You can see that the distribution remains the same and the values are pushed between 0 and 1.

# Min-Max scaling
scaler = MinMaxScaler()
scaled_hp = pd.DataFrame(scaler.fit_transform(df[['HP']]))

#Compare before and after scaling
fig, ax = plt.subplots(1, 2, figsize=figsize)
sns.distplot(df['HP'], ax=ax[0])
ax[0].set_title('Untransformed')
sns.distplot(scaled_hp, ax=ax[1])
ax[1].set_title('Transformed')

plt.savefig('./Min-Max scaling.png')

Min-Maxスケーリング.png

Should I choose standardization or scaling?

In feature engineering, if you are wondering whether to choose standardization or scaling, which one should you choose? Basically, scaling is used for image data, and standardization is used for other numerical values. Image data has an upper and lower limit of 0 to 255, but other data may have a wide range of possible values. Therefore, if you scale with a lot of outliers, the scaled distribution may be affected by the outliers.

Non-linear transformation

Nonlinear transformations change the distribution of features, compared to standardization and Min-Max scaling, which were linear transformations that maintain the distribution of features. I think it's faster to look at the converted distribution than to hear the explanation. I will introduce two non-linear transformations. If the distribution of the data does not follow a normal distribution, use it to convert it to a normal distribution.

Power conversion

A typical method of non-linear transformation is power transformation. Logarithmic conversion does not work if the data contains 0s, so log (x + 1) is usually used for conversion. I will post the converted Pokemon data.

#Power conversion
log_hp = df['HP']
log_hp = pd.DataFrame(np.log1p(log_hp))

#Compare before and after standardization
fig, ax = plt.subplots(1, 2, figsize=figsize)
sns.distplot(df['HP'], ax=ax[0])
ax[0].set_title('Untransformed')
sns.distplot(log_hp, ax=ax[1])
ax[1].set_title('Transformed')

plt.savefig('Power conversion.png')

べき変換.png

Box-Cox conversion

The Box-Cox transform is only available when the data is positive and forces the data that does not follow the normal distribution to a normal distribution. There is not much difference, but here is a comparison of the Box-Cox conversion code and the distribution before and after the conversion.

# Box-Cox conversion
#Check if the data takes a negative number
check = df['Speed'].min()
if check > 0:
    pt = PowerTransformer(method='box-cox')
    boxcox = pd.DataFrame(pt.fit_transform(df[['Speed']]), columns=['Speed'])

    # Box-Compare Cox transforms
    fig, ax = plt.subplots(1, 2, figsize=figsize)
    sns.distplot(df['Speed'], ax=ax[0])
    ax[0].set_title('Untransformed')
    sns.distplot(boxcox, ax=ax[1])
    ax[1].set_title('Transformed')

    #Comparison of skewness and kurtosis
    print(f'Skewness before conversion: {df["Speed"].skew(): .2f},kurtosis: {df["Speed"].kurtosis(): .2f}')
    print(f'Skewness after conversion: {boxcox["Speed"].skew(): .2f},kurtosis: {boxcox["Speed"].kurtosis(): .2f}')

    #Output result
    #Skewness before conversion:  0.36,kurtosis: -0.24
    #Skewness after conversion: -0.05,kurtosis: -0.40
    
    plt.savefig('Box-Cox conversion.png')

Box-Cox変換.png

Clipping Clipping is literally a technique for clipping data at a certain threshold. Data larger than or smaller than the threshold (outliers) can be excluded. Let's take the example of Pokemon. Cuts data with Speed of 10 or less and 150 or more.

# Clipping
# Clipping
clipping_hp = df['Speed'].clip(10, 130)

#Compare before and after the clip
fig, ax = plt.subplots(1, 2, figsize=figsize)
sns.distplot(df['Speed'], ax=ax[0])
ax[0].set_title('Unclipped')
sns.distplot(clipping_hp, ax=ax[1])
ax[1].set_title('Clipped')

plt.savefig('Clipping.png')

Clipping.png

Binning Binning is a technique that replaces data that falls within a certain interval with a value that represents that interval. As an example, when the speed data is divided into sections of ~ 50, 50 ~ 100, 100 ~ 150, 150 ~ and replaced with different variables (here, 0, 1, 2, 3) for each section. Let's take a look at the code. You have replaced the data in a particular interval with another variable (category data in this case).

# Binning
binned = pd.cut(df['Speed'], 4, labels=False)

print('---------Before Binning----------')
print(df['Speed'].head())
print('---------After Binning----------')
print(binned.head())

#Execution result
# ---------Before Binning----------
# 0    45
# 1    60
# 2    80
# 3    80
# 4    65
# Name: Speed, dtype: int64
# ---------After Binning----------
# 0    0
# 1    1
# 2    1
# 3    1
# 4    1
# Name: Speed, dtype: int64

at the end

The conversion method to be applied differs depending on whether the data is image data or whether it follows a normal data distribution. It may be interesting to experiment with different data and see how the data is transformed. In the case of Pokemon data, there was not much data that deviated from the normal distribution, but [Data set for housing price forecast](https://www.kaggle.com/c/house-prices-advanced-regression- You may try it with techniques)! IMG_9573.JPG

Recommended Posts

Feature Engineering Traveling with Pokemon-Numerical Edition-
Feature Engineering Traveling with Pokemon-Category Variables-
HJvanVeen's "Feature Engineering" Note
Feature Engineering for Machine Learning Beginning with Part 3 Google Colaboratory-Scaling
Seq2Seq (3) ~ CopyNet Edition ~ with chainer
Kaggle House Prices ① ~ Feature Engineering ~