[PYTHON] Feature Engineering for Machine Learning Beginning with Part 2 Google Colaboratory-Logarithmic Transformation and Box-Cox Transformation

Introduction

In this article Explains logarithmic transformation and Box-Cox transformation. This article is mainly based on "Features Engineering for Machine Learning". Please check it out if you become.

Also, the content of this article is explained in more detail on YouTube, so if you are interested, please check it out.

What is logarithmic transformation?

Logarithmic transformation is mainly used for the following purposes.

--Make it follow a normal distribution --Reduce variance

The logarithmic function is a function as shown in the figure below, and the value of x is small because the range of [1,10] is copied to [0,1] and the range of [10,100] is copied to [1,2]. It can be seen that the case is copied in a wide range, and the case is copied in a narrow range when the value of x is large.

By using this logarithmic transformation, it is possible to compress the upper side of the heavy-tailed distribution as shown below and expand the lower side to bring it closer to the normal distribution. Many machine learning methods do not make any assumptions about the population ** nonparametric models **, so it is not necessary to approach the normal distribution, but the distribution of the statistical population is assumed ** parametric model * If you use *, the data distribution must be normal.

Further, the variance can be reduced by using the logarithmic conversion for the following data having a large variance.

** Before applying logarithmic transformation (variance: 5.0e + 06) **

** After applying logarithmic transformation (variance: 0.332007) **

** Sample code for applying logarithmic transformation **

log.py


import numpy as np
import pandas as pd


##Random number fixed
np.random.seed(100)

data_array = []
for i in range(1, 10000):
  max_num = i if i > 3000 else 1000 
  s = np.random.randint(0, max_num, 10)
  data_array.extend(s)

data = pd.DataFrame({'Listen Count': data_array})

data_log = pd.DataFrame()
##Add 1 to prevent it from becoming 0
data_log['Listen Count'] = np.log10(data['Listen Count'] + 1)

What is Box-Cox conversion?

Conversion that can be defined by the following formula

y=\begin{eqnarray}
\left\{
\begin{array}{l}
\frac{x^\lambda - 1}{\lambda}~~~~~(if ~~ \lambda\neq0) \\
\log(x)~~~~~(if ~~ \lambda=0)
\end{array}
\right.
\end{eqnarray}

By using the Box-Cox transformation to make the data follow a normal distribution to some extent, the data can be made to follow a normal distribution. (* However, it can be used only when the data is positive.)

The graph below shows this conversion. image.png You need to determine the lambda value before using the Box-Cox transform. By using the maximum likelihood method here, the lambda is determined so that the converted data is closest to the normal distribution.

If you use the Box-Cox transformation on the data that is actually distributed as shown in the figure below, you can see that it can be converted to a distribution that seems to be a normal distribution.

** Before applying Box-Cox conversion **

** After applying Box-Cox conversion **

** Sample code for applying Box-Cox conversion **

from scipy import stats
import numpy as np
import pandas as pd

##Random number fixed
np.random.seed(100)

##Data generation
data_array = []
for i in range(1, 1000):
  s = np.random.randint(1, i * 100, 10)
  data_array.extend(s)
data = pd.DataFrame({'Listen Count': data_array})

##Box-Cox conversion
rc_bc, bc_params = stats.boxcox(data['Listen Count'])
print(bc_params) ##0.3419237117680786

Q-Q plot

The Q-Q plot is a plot of measured and ideal values. In other words, if it is a straight line, it can be said that the measured value is normally distributed. Below is a plot of the original data, the data after logarithmic conversion, and the data after Box-Cox conversion.

raw data

** After logarithmic conversion **

** After Box-Cox conversion **

From these results, we can see that the Box-Cox transformation was able to follow the most normal distribution.

Finally

I'm planning to post reviews and explanation videos of technical books on YouTube, focusing on machine learning. We also introduce companies that you should know if you go to IT. Please like, subscribe to the channel, and give us a high rating, as it will motivate youtube and Qiita updates. YouTube: https://www.youtube.com/channel/UCywlrxt0nEdJGYtDBPW-peg Twitter: https://twitter.com/tatelabo

reference

https://yolo-kiyoshi.com/2018/12/26/post-1037/ https://toukei-lab.com/box-cox%E5%A4%89%E6%8F%9B%E3%82%92%E7%94%A8%E3%81%84%E3%81%A6%E6%AD%A3%E8%A6%8F%E5%88%86%E5%B8%83%E3%81%AB%E5%BE%93%E3%82%8F%E3%81%AA%E3%81%84%E3%83%87%E3%83%BC%E3%82%BF%E3%82%92%E8%A7%A3%E6%9E%90 https://toukeier.hatenablog.com/entry/2019/09/08/224346 https://sigma-eye.com/2018/09/23/qq-plot/

Recommended Posts

Feature Engineering for Machine Learning Beginning with Part 2 Google Colaboratory-Logarithmic Transformation and Box-Cox Transformation
Feature Engineering for Machine Learning Beginning with Part 3 Google Colaboratory-Scaling
Feature engineering for machine learning starting with the 4th Google Colaboratory --Interaction features
Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data
Beginning with Python machine learning
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
Predict power demand with machine learning Part 2
[Shakyo] Encounter with Python for machine learning
5th Feature Engineering for Machine Learning-Feature Selection
Machine learning with Pytorch on Google Colab
Python learning notes for machine learning with Chainer Chapters 11 and 12 Introduction to Pandas Matplotlib
Machine learning starting with Python Personal memorandum Part2
Machine learning starting with Python Personal memorandum Part1
[Python] Collect images with Icrawler for machine learning [1000 images]
Easy Machine Learning with AutoAI (Part 4) Jupyter Notebook Edition
Easy machine learning with scikit-learn and flask ✕ Web app
Python learning memo for machine learning by Chainer Chapters 1 and 2
Practical machine learning with Scikit-Learn and TensorFlow-TensorFlow gave up-
[Machine learning] Start Spark with iPython Notebook and try MLlib
Build a machine learning scikit-learn environment with VirtualBox and Ubuntu
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
For those who want to start machine learning with TensorFlow2
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Deep Learning with Shogi AI on Mac and Google Colab