[PYTHON] Feature Engineering for Machine Learning Beginning with Part 3 Google Colaboratory-Scaling

Introduction

This article discusses scaling and normalization. This article is mainly based on "Features Engineering for Machine Learning". Please check it out if you become.

What is scaling?

The range of numerical data that can be taken may or may not be fixed. Basically, the range of data that can be taken from count data is not fixed, and in the case of a model that is sensitive to the scale of features such as linear regression, learning may not be successful due to outliers or differences in scale between features. .. Unifying the scale in such a case is called scaling. Scaling includes Min-Max scaling, standardization, L2 normalization, etc., so let's introduce them in order. If you want to know more about scaling, please see the article here.

Min-Max scaling

For Min-Max scaling, set the minimum value to 0 and the maximum value to 1. If outliers are included, the range of normal values that can be taken may become too narrow due to the influence of the outliers, so standardization is basically used.

\tilde{x} = \frac{x - min(x)}{max(x) - min(x)}
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

##Random number fixed
np.random.seed(100)
data_array = []
for i in range(1, 100):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(100))
data = pd.DataFrame({'Listen Count': data_array})

print(data.max()) # 977.0
print(data.min()) # 0

scaler = MinMaxScaler()
data_n = scaler.fit_transform(data)
data_n = pd.DataFrame(data_n)

print(data_n.max()) ## 1.0
print(data_n.min()) ## 0

Standardization

In standardization, the mean is 0 and the variance is 1. If the original feature has a normal distribution The standardized feature has a standard normal distribution.

\tilde{x} = \frac{x - mean(x)}{sqrt(var(x))}
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

##Random number fixed
np.random.seed(100)
data_array = []
for i in range(1, 100):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(100))
data = pd.DataFrame({'Listen Count': data_array})

scaler = StandardScaler()
data_n = scaler.fit_transform(data)
data_n = pd.DataFrame({'Listen Count': data_n.ravel()})

print(data_n.var()) ##1.000918
print(data_n.mean()) ##6.518741e-17

L2 normalization

L2 normalization normalizes features by dividing by the L2 norm.

\tilde{x} = \frac{x}{||x||_2} \\
||x||_2 = \sqrt{x_1^2 + x_2^2+ ...+x_m^2 }
import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize

##Random number fixed
np.random.seed(100)
data_array = []
for i in range(1, 100):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(100))
data = pd.DataFrame({'Listen Count': data_array})

##L2 normalization
data_l2_normalized = normalize([data['Listen Count']],norm='l2')
data_l2 = pd.DataFrame({'Listen Count': data_l2_normalized.ravel()})

print(np.linalg.norm(data_l2_normalized,ord=2)) ## 0.999999999

Finally

I'm thinking of posting a video about IT on YouTube. Please like, subscribe to the channel, and give us a high rating, as it will motivate youtube and Qiita updates. YouTube: https://www.youtube.com/channel/UCywlrxt0nEdJGYtDBPW-peg Twitter: https://twitter.com/tatelabo

Recommended Posts

Feature Engineering for Machine Learning Beginning with Part 3 Google Colaboratory-Scaling
Feature Engineering for Machine Learning Beginning with Part 2 Google Colaboratory-Logarithmic Transformation and Box-Cox Transformation
Feature engineering for machine learning starting with the 4th Google Colaboratory --Interaction features
Beginning with Python machine learning
Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data
Predict power demand with machine learning Part 2
Amplify images for machine learning with python
[Shakyo] Encounter with Python for machine learning
5th Feature Engineering for Machine Learning-Feature Selection
Machine learning starting with Python Personal memorandum Part2
Machine learning starting with Python Personal memorandum Part1
[Python] Collect images with Icrawler for machine learning [1000 images]
Easy Machine Learning with AutoAI (Part 4) Jupyter Notebook Edition
Rebuilding an environment for machine learning with Miniconda (Windows version)
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
About learning with google colab
Machine learning with Python! Preparation
For those who want to start machine learning with TensorFlow2
Machine learning Minesweeper with PyTorch
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Try machine learning with Kaggle
Manga Recommendations with Machine Learning Part 1 First, try dividing without thinking
Building a Windows 7 environment for getting started with machine learning with Python
Machine Learning with docker (42) Programming PyTorch for Deep Learning By Ian Pointer
[PyTorch Tutorial ⑤] Learning PyTorch with Examples (Part 2)
I tried machine learning with liblinear
Machine learning with python (1) Overall classification
Try deep learning with TensorFlow Part 2
Feature Engineering Traveling with Pokemon-Category Variables-
Try machine learning with scikit-learn SVM
<For beginners> python library <For machine learning>
OpenCV feature detection with Google Colaboratory
Machine learning meeting information for HRTech
[PyTorch Tutorial ⑤] Learning PyTorch with Examples (Part 1)
[Recommended tagging for machine learning # 4] Machine learning script ...?
Quantum-inspired machine learning with tensor networks
Get started with machine learning with SageMaker
"Scraping & machine learning with Python" Learning memo
Feature Engineering Traveling with Pokemon-Numerical Edition-
First Steps for Machine Learning (AI) Beginners
Machine learning imbalanced data sklearn with k-NN
An introduction to OpenCV for machine learning
Why Python is chosen for machine learning
A story about machine learning with Kyasuket
"Usable" one-hot Encoding method for machine learning
Manage deals with Trello + Google Colaboratory (Part 1)
[Python] Web application design for machine learning
An introduction to Python for machine learning
Creating a development environment for machine learning
Build AI / machine learning environment with Python
EV3 x Pyrhon Machine Learning Part 3 Classification
Kaggle: Introduction to Manual Feature Engineering Part 1
Data acquisition from analytics API with Google API Client for python Part 2 Web application
Align the number of samples between classes of data for machine learning with Python
Python learning notes for machine learning with Chainer Chapters 11 and 12 Introduction to Pandas Matplotlib
Kaggle House Prices ① ~ Feature Engineering ~
Note
Feature Engineering Traveling with Pokemon-Category Variables-
Note
Note
Feature Engineering Traveling with Pokemon-Numerical Edition-
Predicting Credit Card Defaults Feature Engineering
5th Feature Engineering for Machine Learning-Feature Selection
Kaggle: Introduction to Manual Feature Engineering Part 1