Recently, I have had the opportunity to participate in data analysis competitions, but in many cases the results were not effective unless appropriate features were extracted depending on the data set provided in the competition. In order to summarize my thoughts, I will post an article in the hope that it will be useful for people who are in trouble as well as myself in feature extraction. This time, I will write some features of numerical data.
Regarding usage data, I used Pokemon Dataset because it is easier to imagine data trends with familiar data. The data consists of flags of Pokemon types up to the 6th generation (X / Y), effort value, generation, and legendary Pokemon.
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
#For feature engineering
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import PowerTransformer
figsize = (10, 7)
df = pd.read_csv('./data/121_280_bundle_archive.zip')
data
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
Scaling is an image that pushes the distribution of features into a certain range without changing them (see the figure). There are two typical scaling methods: standardization and Min-Max scaling.
Standardization pushes the feature distribution into a standard normal distribution with mean 0 and variance 1. The standardization formula is as follows. Specifically, it is calculated by calculating the average from the features and dividing it by the standard deviation (how far from the average). It's easier to imagine looking at the code and the results after scaling.
#Standardization
scaler = StandardScaler()
scaled_hp = scaler.fit_transform(df[['HP']])
scaled_hp = pd.DataFrame(scaled_hp, columns=['HP'])
#Compare before and after standardization
fig, ax = plt.subplots(1, 2, figsize=figsize)
sns.distplot(df['HP'], ax=ax[0])
ax[0].set_title('Untransformed')
sns.distplot(scaled_hp, ax=ax[1])
ax[1].set_title('Transformed')
plt.savefig('Standardization.png')
Min-Max scaling is used to push features from 0 to 1. The standardization formula is as follows. Since the maximum and minimum features are used, there is a risk of being easily affected by outliers. Here is also the code and the scaled one. You can see that the distribution remains the same and the values are pushed between 0 and 1.
# Min-Max scaling
scaler = MinMaxScaler()
scaled_hp = pd.DataFrame(scaler.fit_transform(df[['HP']]))
#Compare before and after scaling
fig, ax = plt.subplots(1, 2, figsize=figsize)
sns.distplot(df['HP'], ax=ax[0])
ax[0].set_title('Untransformed')
sns.distplot(scaled_hp, ax=ax[1])
ax[1].set_title('Transformed')
plt.savefig('./Min-Max scaling.png')
In feature engineering, if you are wondering whether to choose standardization or scaling, which one should you choose? Basically, scaling is used for image data, and standardization is used for other numerical values. Image data has an upper and lower limit of 0 to 255, but other data may have a wide range of possible values. Therefore, if you scale with a lot of outliers, the scaled distribution may be affected by the outliers.
Nonlinear transformations change the distribution of features, compared to standardization and Min-Max scaling, which were linear transformations that maintain the distribution of features. I think it's faster to look at the converted distribution than to hear the explanation. I will introduce two non-linear transformations. If the distribution of the data does not follow a normal distribution, use it to convert it to a normal distribution.
A typical method of non-linear transformation is power transformation. Logarithmic conversion does not work if the data contains 0s, so log (x + 1) is usually used for conversion. I will post the converted Pokemon data.
#Power conversion
log_hp = df['HP']
log_hp = pd.DataFrame(np.log1p(log_hp))
#Compare before and after standardization
fig, ax = plt.subplots(1, 2, figsize=figsize)
sns.distplot(df['HP'], ax=ax[0])
ax[0].set_title('Untransformed')
sns.distplot(log_hp, ax=ax[1])
ax[1].set_title('Transformed')
plt.savefig('Power conversion.png')
The Box-Cox transform is only available when the data is positive and forces the data that does not follow the normal distribution to a normal distribution. There is not much difference, but here is a comparison of the Box-Cox conversion code and the distribution before and after the conversion.
# Box-Cox conversion
#Check if the data takes a negative number
check = df['Speed'].min()
if check > 0:
pt = PowerTransformer(method='box-cox')
boxcox = pd.DataFrame(pt.fit_transform(df[['Speed']]), columns=['Speed'])
# Box-Compare Cox transforms
fig, ax = plt.subplots(1, 2, figsize=figsize)
sns.distplot(df['Speed'], ax=ax[0])
ax[0].set_title('Untransformed')
sns.distplot(boxcox, ax=ax[1])
ax[1].set_title('Transformed')
#Comparison of skewness and kurtosis
print(f'Skewness before conversion: {df["Speed"].skew(): .2f},kurtosis: {df["Speed"].kurtosis(): .2f}')
print(f'Skewness after conversion: {boxcox["Speed"].skew(): .2f},kurtosis: {boxcox["Speed"].kurtosis(): .2f}')
#Output result
#Skewness before conversion: 0.36,kurtosis: -0.24
#Skewness after conversion: -0.05,kurtosis: -0.40
plt.savefig('Box-Cox conversion.png')
Clipping Clipping is literally a technique for clipping data at a certain threshold. Data larger than or smaller than the threshold (outliers) can be excluded. Let's take the example of Pokemon. Cuts data with Speed of 10 or less and 150 or more.
# Clipping
# Clipping
clipping_hp = df['Speed'].clip(10, 130)
#Compare before and after the clip
fig, ax = plt.subplots(1, 2, figsize=figsize)
sns.distplot(df['Speed'], ax=ax[0])
ax[0].set_title('Unclipped')
sns.distplot(clipping_hp, ax=ax[1])
ax[1].set_title('Clipped')
plt.savefig('Clipping.png')
Binning Binning is a technique that replaces data that falls within a certain interval with a value that represents that interval. As an example, when the speed data is divided into sections of ~ 50, 50 ~ 100, 100 ~ 150, 150 ~ and replaced with different variables (here, 0, 1, 2, 3) for each section. Let's take a look at the code. You have replaced the data in a particular interval with another variable (category data in this case).
# Binning
binned = pd.cut(df['Speed'], 4, labels=False)
print('---------Before Binning----------')
print(df['Speed'].head())
print('---------After Binning----------')
print(binned.head())
#Execution result
# ---------Before Binning----------
# 0 45
# 1 60
# 2 80
# 3 80
# 4 65
# Name: Speed, dtype: int64
# ---------After Binning----------
# 0 0
# 1 1
# 2 1
# 3 1
# 4 1
# Name: Speed, dtype: int64
The conversion method to be applied differs depending on whether the data is image data or whether it follows a normal data distribution. It may be interesting to experiment with different data and see how the data is transformed. In the case of Pokemon data, there was not much data that deviated from the normal distribution, but [Data set for housing price forecast](https://www.kaggle.com/c/house-prices-advanced-regression- You may try it with techniques)!
Recommended Posts