Introduction

This is a continuation of the previous article. https://qiita.com/shallowdf20/items/eb35a9cf3c24403debb1

This time, I would like to introduce the Encoding type processing of DataLiner.

Installation

! pip install -U dataliner

Data preparation

Prepare Titanic data as before.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]

By encoding categorical variables such as Sex and Name in some way, you will be able to handle character strings that cannot be handled as they are in mathematical formulas and models. Let's take a look now.

OneHotEncoding This is the most common encoding method. Replace the categorical variables with 0 and 1 dummy variables. Let's take a look first.

trans = dl.OneHotEncoding()
trans.fit_transform(X)

The number of columns has increased at once. For example, there are types of names for the number of data = 891 people, so 891 columns will increase at once. After this, you can drop the variable with DropLowAUC etc., but it is better to apply GroupRareCategory or DropHighCardinality in advance. Try encoding after applying DropHighCardinality in the pipeline.

from sklearn.pipeline import make_pipeline

process = make_pipeline(
    dl.DropHighCardinality(),
    dl.OneHotEncoding(),
)
process.fit_transform(X)

It fits in the number of common sense features. For example, when making a variable that exists only for men and women, such as Titanic's gender, as a dummy variable, it is common to drop one of the columns to avoid collinearity. DataLiner prepares an argument called drop_first, and by default True = automatically drops it.

CountEncoding This is an encoding that replaces categorical variables with the number of occurrences of the category. First, let's take a look at the number of passengers by gender on Titanic.

df['Sex'].value_counts()

male 577 female 314 Name: Sex, dtype: int64

There are 577 men and 314 women.

Now, let's do CountEncoding.

trans = dl.CountEncoding()
trans.fit_transform(X)

If you look at the Sex column, you can see that it is certainly replaced by the count. Since the category column is automatically recognized, other category columns such as Name and Embarked are also replaced with numbers.

RankedCountEncoding CountEncoding is a simple yet powerful technique, but it has the disadvantage of being vulnerable to outliers, for example, an unusually large number of one category. Also, if there are categories with the same number of occurrences, they will be replaced with the same number, so it will be indistinguishable. (Of course, that's correct if it shouldn't be distinguished in essence)

Therefore, this encoder replaces the number of occurrences, creates a ranking in descending order of the number of appearances, and replaces the categorical variables in that order. In the previous example, 577 men are ranked first in Sex, and 314 women are ranked second in Sex.

trans = dl.RankedCountEncoding()
trans.fit_transform(X)

Men and women are replaced by 1 and 2, respectively. Also, if you check the Name, you can see that the number of occurrences is 1 but they are Encoded with different numbers. Even if the number of occurrences is the same, it is encoded using the index after ranking, so if the category is different, you can always replace it with another number.

FrequencyEncoding CountEncoding encodes by the number of occurrences, but FrequencyEncoding encodes by the frequency of occurrence. It's easy to handle because the result automatically falls between 0 and 1.

For example, for Sex, males are encoded with 577 / (577 + 314) at 0.647 ... and females with 314 / (577 + 314) at 0.352 ...

trans = dl.FrequencyEncoding()
trans.fit_transform(X)

As with CountEncoding, the property that categories with the same number of occurrences cannot be distinguished remains, but the point that they are vulnerable to outliers has been improved. RankedFrequencyEncoding is not prepared because it gives the same result as RankedCountEncoding.

TargetMeanEncoding Although the name became famous in Kaggle etc., the idea itself is the basis of data analysis. Specifically, it replaces the categorical variable with the average value of the objective variable for each category.

Since the objective variable is the life and death of passengers, for example, in Sex, it is replaced by the survival rate by gender. However, since the information of the target to be predicted, which is the average of the objective variables, is used, if the number of data is small, the encoded value and the objective variable will correspond and a leak will easily occur. In the implementation of DataLiner, like Bayes, the average of the entire objective variable is adopted as the prior probability of each category, and the conversion is weighted by the number of data.

trans = dl.TargetMeanEncoding()
trans.fit_transform(X, y)

RankedTargetMeanEncoding It ranks the results of TargetMeanEncoding and replaces them in that order. For example, if the survival rate is 1st for women and 2nd for men, then 1 for women and 2 for men. In Target Mean Encoding, which adopts prior probabilities, when the number of certain categories is small compared to the total number of data, they are encoded with almost similar (= close to prior probabilities) numbers despite different categories. RankedTargetMeanEncoding will encode them as distinctly different.

trans = dl.RankedTargetMeanEncoding()
trans.fit_transform(X, y)

in conclusion

So, this time I introduced the Encoding related items of DataLiner. Next, I would like to introduce the conversion system.

Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/

[PYTHON] Try to process Titanic data with preprocessing library DataLiner (Encoding)

Introduction

Installation

Data preparation

in conclusion