[PYTHON] Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data

Introduction

This article describes binarization and discretization used as preprocessing for count data. This article is mainly based on "Features Engineering for Machine Learning". Please check it out if you become.

Also, the content of this article is explained in more detail on YouTube, so if you are interested, please check it out.

What is binarization?

As the name implies, it is the process of making the target value binary. For example, consider the following example.

Example
I want to create a system that recommends songs recommended to users.
I want to use the number of times the user listened to a song as a feature, but how should I format the data?

When I took out the data of a certain user there, it is assumed that the data was as follows. The first column is the song ID, and the second column is the number of times the song has been played.

image.png

The histogram of this data is as follows.

image.png

Now, in order to recommend a song recommended to the user, it is important to know whether the user was interested in the song. However, if the above is left as it is, a song that has been listened to 20 times will give the model information that it likes 20 times as much as a song that has been listened to only once. Therefore, assuming that you are interested if you have played the song even once, the song that has been played more than once is binarized to 1, and the song that has never been played is binarized to 0. By doing this, I was able to eliminate the differences between songs and divide them into songs that I was interested in and songs that I was not interested in.

This is represented in a graph as follows.

image.png

The implemented code is shown below.

binary.py


import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

##Random number fixed
np.random.seed(100)

##Generate pseudo data
data_array = []
for i in range(1, 1000):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(9000))
data = pd.DataFrame({'Listen Count': data_array})

data_binary = pd.DataFrame()
##True by multiplying by 1,False to 1,Convert to 0
data_binary['Listen Count'] = (data['Listen Count'] > 0) * 1

What is discretization?

Discretization by fixed width

By discretizing, continuous data can be treated as the same group, so

--The effect of scale can be removed --You can eliminate outliers

There is a merit

For example, if the age of a person is given as numerical data, all ages are divided into groups by grouping 0 to 10 as group 1, 10 to 20 as group 2 ....., and 80 or more as group 9. Will be possible. You may feel that you can leave the numerical data as it is, but for example, if there are several people who lived up to 110 years old, it will be pulled by the large data and the influence of other factors will be reduced. May be done. However, by grouping 80 and 110 years old as elderly people in the same group, such problems can be solved.

This time, the ages are divided by 10 years old, but depending on the lifestyle, 0 to 12 (from childhood to elementary school) may be divided into groups 1 and 12 to 17 (junior high and high school students) may be divided into groups 2.

Also, if the number spans multiple digits, it may be grouped by a power of 10, such as 0-9, 10-99, 100-999, and so on.

** When dividing by 10 **

discretization.py


import numpy as np

small_counts = np.random.randint(0, 100, 20)
print(small_counts)

print(np.floor_divide(small_counts, 10))

Execution result

image.png

image.png

** When grouping by a power of 10 **

discretization.py


import numpy as np

large_counts = []
for i in range(1, 100, 10):
  tmp = np.random.randint(0, i * 1000, 5)
  large_counts.extend(tmp)

print(np.array(large_counts))
print(np.floor(np.log10(large_counts)))

image.png

image.png

Discretization by quantile

Discretization with a fixed width is very convenient, but if there is a large gap in the count data, for example, there will be multiple groups that do not contain data. In such cases, use the quantile. The quantile divides the data into two by the median and further divides the divided data into two by the median. So the quartile divides the data into four, and the decile divides the data into ten.

For example, the deciile order of the following distribution data is shown in the table below.

image.png

image.png

If this value is shown in the graph, it will be as shown in the figure below, and you can see that the width is calculated so that the amount of data is even.

image.png

The implemented program is shown below.

** When grouping by quantiles (graph) **

quantile.py


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

##Random number fixed
np.random.seed(100)

data_array = []
for i in range(1, 1000):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(2000))
data = pd.DataFrame({'Listen Count': data_array})

deciles = data['Listen Count'].quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])

print(deciles)

plt.vlines(deciles, 0, 5500, "blue", linestyles='dashed') 

image.png

** When grouping by quantiles **

quantile.py


import numpy as np

large_counts = []
for i in range(1, 100, 10):
  tmp = np.random.randint(0, i * 1000, 5)
  large_counts.extend(tmp)
np.array(large_counts)

#Convert to quartile
print(pd.qcut(large_counts, 4, labels=False))

image.png

Finally

I'm planning to post reviews and explanation videos of technical books on YouTube, focusing on machine learning. We also introduce companies that you should know if you go to IT. Please like, subscribe to the channel, and give us a high rating, as it will motivate youtube and Qiita updates.

YouTube: https://www.youtube.com/channel/UCywlrxt0nEdJGYtDBPW-peg Twitter: https://twitter.com/tatelabo

Recommended Posts

Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data
Feature engineering for machine learning starting with the 4th Google Colaboratory --Interaction features
Feature Engineering for Machine Learning Beginning with Part 3 Google Colaboratory-Scaling
Align the number of samples between classes of data for machine learning with Python
[Machine learning] Check the performance of the classifier with handwritten character data
Summary of mathematical scope and learning resources required for machine learning and data science
I tried to process and transform the image and expand the data for machine learning
Predicting the goal time of a full marathon with machine learning-③: Visualizing data with Python-
The first step of machine learning ~ For those who want to implement with python ~
[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-
Predict the gender of Twitter users with machine learning
Summary of the basic flow of machine learning with Python
Record of the first machine learning challenge with Keras
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
[Homology] Count the number of holes in data with Python
[Python] The biggest weakness / disadvantage of Google Colaboratory [For beginners]
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Data set for machine learning
For those of you who glance at the log while learning with machine learning ~ Muscle training with LightGBM ~
A story stuck with the installation of the machine learning library JAX
Summary of recommended APIs for artificial intelligence, machine learning, and AI
How to use machine learning for work? 01_ Understand the purpose of machine learning
Significance of machine learning and mini-batch learning
OpenCV feature detection with Google Colaboratory
A collection of tips for speeding up learning and reasoning with PyTorch
One-click data prediction for the field realized by fully automatic machine learning
[Summary of books and online courses used for programming and data science learning]
Basic machine learning procedure: ③ Compare and examine the selection method of features
Python learning memo for machine learning by Chainer until the end of Chapter 2
I measured the speed of list comprehension, for and while with python2.7.
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Judge the authenticity of posted articles by machine learning (Google Prediction API).
Setting to make the scale and label of the figure easy to see even in the dark theme with google Colaboratory