[PYTHON] Try to process Titanic data with preprocessing library DataLiner (Encoding)

Introduction

This is a continuation of the previous article. https://qiita.com/shallowdf20/items/eb35a9cf3c24403debb1

This time, I would like to introduce the Encoding type processing of DataLiner.

Installation

! pip install -U dataliner

Data preparation

Prepare Titanic data as before.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]

image.png

By encoding categorical variables such as Sex and Name in some way, you will be able to handle character strings that cannot be handled as they are in mathematical formulas and models. Let's take a look now.

OneHotEncoding This is the most common encoding method. Replace the categorical variables with 0 and 1 dummy variables. Let's take a look first.

trans = dl.OneHotEncoding()
trans.fit_transform(X)

image.png

The number of columns has increased at once. For example, there are types of names for the number of data = 891 people, so 891 columns will increase at once. After this, you can drop the variable with DropLowAUC etc., but it is better to apply GroupRareCategory or DropHighCardinality in advance. Try encoding after applying DropHighCardinality in the pipeline.

from sklearn.pipeline import make_pipeline

process = make_pipeline(
    dl.DropHighCardinality(),
    dl.OneHotEncoding(),
)
process.fit_transform(X)

image.png It fits in the number of common sense features. For example, when making a variable that exists only for men and women, such as Titanic's gender, as a dummy variable, it is common to drop one of the columns to avoid collinearity. DataLiner prepares an argument called drop_first, and by default True = automatically drops it.

CountEncoding This is an encoding that replaces categorical variables with the number of occurrences of the category. First, let's take a look at the number of passengers by gender on Titanic.

df['Sex'].value_counts()

male 577 female 314 Name: Sex, dtype: int64

There are 577 men and 314 women.

Now, let's do CountEncoding.

trans = dl.CountEncoding()
trans.fit_transform(X)

image.png

If you look at the Sex column, you can see that it is certainly replaced by the count. Since the category column is automatically recognized, other category columns such as Name and Embarked are also replaced with numbers.

RankedCountEncoding CountEncoding is a simple yet powerful technique, but it has the disadvantage of being vulnerable to outliers, for example, an unusually large number of one category. Also, if there are categories with the same number of occurrences, they will be replaced with the same number, so it will be indistinguishable. (Of course, that's correct if it shouldn't be distinguished in essence)

Therefore, this encoder replaces the number of occurrences, creates a ranking in descending order of the number of appearances, and replaces the categorical variables in that order. In the previous example, 577 men are ranked first in Sex, and 314 women are ranked second in Sex.

trans = dl.RankedCountEncoding()
trans.fit_transform(X)

image.png

Men and women are replaced by 1 and 2, respectively. Also, if you check the Name, you can see that the number of occurrences is 1 but they are Encoded with different numbers. Even if the number of occurrences is the same, it is encoded using the index after ranking, so if the category is different, you can always replace it with another number.

FrequencyEncoding CountEncoding encodes by the number of occurrences, but FrequencyEncoding encodes by the frequency of occurrence. It's easy to handle because the result automatically falls between 0 and 1.

For example, for Sex, males are encoded with 577 / (577 + 314) at 0.647 ... and females with 314 / (577 + 314) at 0.352 ...

trans = dl.FrequencyEncoding()
trans.fit_transform(X)

image.png

As with CountEncoding, the property that categories with the same number of occurrences cannot be distinguished remains, but the point that they are vulnerable to outliers has been improved. RankedFrequencyEncoding is not prepared because it gives the same result as RankedCountEncoding.

TargetMeanEncoding Although the name became famous in Kaggle etc., the idea itself is the basis of data analysis. Specifically, it replaces the categorical variable with the average value of the objective variable for each category.

Since the objective variable is the life and death of passengers, for example, in Sex, it is replaced by the survival rate by gender. However, since the information of the target to be predicted, which is the average of the objective variables, is used, if the number of data is small, the encoded value and the objective variable will correspond and a leak will easily occur. In the implementation of DataLiner, like Bayes, the average of the entire objective variable is adopted as the prior probability of each category, and the conversion is weighted by the number of data.

trans = dl.TargetMeanEncoding()
trans.fit_transform(X, y)

image.png

RankedTargetMeanEncoding It ranks the results of TargetMeanEncoding and replaces them in that order. For example, if the survival rate is 1st for women and 2nd for men, then 1 for women and 2 for men. In Target Mean Encoding, which adopts prior probabilities, when the number of certain categories is small compared to the total number of data, they are encoded with almost similar (= close to prior probabilities) numbers despite different categories. RankedTargetMeanEncoding will encode them as distinctly different.

trans = dl.RankedTargetMeanEncoding()
trans.fit_transform(X, y)

image.png

in conclusion

So, this time I introduced the Encoding related items of DataLiner. Next, I would like to introduce the conversion system.

Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/

Recommended Posts

Try to process Titanic data with preprocessing library DataLiner (Encoding)
Try to process Titanic data with preprocessing library DataLiner (conversion)
Try to process Titanic data with preprocessing library DataLiner (Drop edition)
Try converting to tidy data with pandas
Try to aggregate doujin music data with pandas
[Kaggle] From data reading to preprocessing and encoding
Try to factorial with recursion
[AWS] Try adding Python library to Layer with SAM + Lambda (Python)
Try to extract Azure SQL Server data table with pyodbc
Try to get data while port forwarding to RDS with anaconda.
Try to extract the features of the sensor data with CNN
Generate error correction code to restore data corruption with zfec library
Try to solve the shortest path with Python + NetworkX + social data
Try to operate Facebook with Python
How to deal with imbalanced data
How to deal with imbalanced data
Try to profile with ONNX Runtime
How to Data Augmentation with PyTorch
Process Pubmed .xml data with python
Try to output audio with M5STACK
Try to image the elevation data of the Geographical Survey Institute with Python
Try to reproduce color film with Python
SIGNATE Quest ① From data reading to preprocessing
Try logging in to qiita with Python
Image classification with Keras-From preprocessing to classification test-
Try working with binary data in Python
Check raw data with Kaggle's Titanic (kaggle ⑥)
I tried factor analysis with Titanic data!
Convert Excel data to JSON with python
Send data to DRF API with Vue.js
Convert FX 1-minute data to 5-minute data with Python
Try to predict cherry blossoms with xgboost
Quickly try to visualize datasets with pandas
Try HTML scraping with a Python library
First YDK to try with Cisco IOS-XE
Preprocessing in machine learning 1 Data analysis process
Try to generate an image with aliasing
How to read problem data with paiza
Process big data with Dataflow (ApacheBeam) + Python3
Location information data display in Python --Try plotting with the map display library (folium)-