[PYTHON] Since DataLiner 1.2.0 has been released, we will introduce the newly added preprocessing.

Introduction

We have released DataLiner 1.2.0, a pre-processing library for machine learning. This time, I have added about 6 new preprocessing, so I would like to introduce it.

GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/ Document: https://shallowdf20.github.io/dataliner/preprocessing.html

Installation

Install using pip.

! pip install -U dataliner

Data preparation

Use Titanic data as usual.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.250 NaN S
2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 C85 C
3 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.100 C123 S
5 3 Allen, Mr. William Henry male 35 0 0 373450 8.050 NaN S

Let's take a look now.

AppendArithmeticFeatures Four arithmetic operations are performed on the features included in the data, and a new feature with a higher evaluation index than the features used in the calculation is newly added. Evaluation is done by logistic regression. By default, multiplication and evaluation index are AUC, but addition, subtraction and division, and Accuracy are also available. It is necessary to fill in the missing values before using.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.AppendArithmeticFeatures(metric='roc_auc', operation='multiply')
)
process.fit_transform(X, y)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked PassengerId_multiply_Age PassengerId_multiply_SibSp PassengerId_multiply_Parch Pclass_multiply_Age
1 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.250 B96 B98 S 22 1 0 66
2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 C85 C 76 2 0 38
3 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 B96 B98 S 78 0 0 78
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.100 C123 S 140 4 0 35
5 3 Allen, Mr. William Henry male 35 0 0 373450 8.050 B96 B98 S 175 0 0 105

In this way, new features are added.

RankedEvaluationMetricEncoding After making each category a dummy variable, logistic regression is performed with each category column and objective variable. Create a ranking using the resulting metric (AUC by default) and encode the original category with that ranking. Since 5 folds of logistic regression are fitted to each category, the amount of calculation will be enormous for features with high cardinality, so in advance It is recommended to lower the cardinality by using Drop High Cardinality or Group Rare Category.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedEvaluationMetricEncoding()
)
process.fit_transform(X, y)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 3 640 2 22 1 0 288 7.250 1 1
2 1 554 1 38 1 0 284 71.283 77 2
3 3 717 1 26 0 0 256 7.925 1 1
4 1 803 1 35 1 0 495 53.100 112 1
5 3 602 2 35 0 0 94 8.050 1 1

You can also check how important each category in the categorical variable is by outputting the ranking.

process['rankedevaluationmetricencoding'].dic_corr_['Embarked']
Category Rank Evaluation_Metric
S 1 0.5688
C 2 0.5678
Q 3 0.4729

AppendClassificationModel The classifier is trained based on the input data, and the prediction result is added as a new feature. The model can be any sklearn compliant model. Also, if the predict_proba method is implemented You can add a score instead of a label by giving the argument probability = True. Since the model is trained, missing value completion and categorical variable processing are basically required.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.AppendClassificationModel(model=RandomForestClassifier(n_estimators=300, max_depth=5),
                                 probability=False)
)
process.fit_transform(X, y)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Predicted_RandomForestClassifier
1 3 0.3838 0.1889 22 1 0 0.3838 7.250 0.3039 0.3390 0
2 1 0.3838 0.7420 38 1 0 0.3838 71.283 0.3838 0.5536 1
3 3 0.3838 0.7420 26 0 0 0.3838 7.925 0.3039 0.3390 1
4 1 0.3838 0.7420 35 1 0 0.4862 53.100 0.4862 0.3390 1
5 3 0.3838 0.1889 35 0 0 0.3838 8.050 0.3039 0.3390 0

This is the case when probability = True. A score for Class 1 will be awarded.

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Predicted_RandomForestClassifier
1 3 0.3838 0.1889 22 1 0 0.3838 7.250 0.3039 0.3390 0.1497
2 1 0.3838 0.7420 38 1 0 0.3838 71.283 0.3838 0.5536 0.8477
3 3 0.3838 0.7420 26 0 0 0.3838 7.925 0.3039 0.3390 0.5401
4 1 0.3838 0.7420 35 1 0 0.4862 53.100 0.4862 0.3390 0.8391
5 3 0.3838 0.1889 35 0 0 0.3838 8.050 0.3039 0.3390 0.1514

AppendEncoder The various Encoders included in the DataLiner directly replace the category columns with encoded numbers. However, in some cases, you may want to use it as a new feature without replacing it. (TargetMeanEncoder, etc.) In that case, it will be added as a feature by wrapping the Encoder in this class.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.AppendEncoder(encoder=dl.TargetMeanEncoding())
)
process.fit_transform(X, y)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Name_TargetMeanEncoding Sex_TargetMeanEncoding Ticket_TargetMeanEncoding Cabin_TargetMeanEncoding Embarked_TargetMeanEncoding
1 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.250 B96 B98 S 0.3838 0.1889 0.3838 0.3039 0.3390
2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 C85 C 0.3838 0.7420 0.3838 0.3838 0.5536
3 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 B96 B98 S 0.3838 0.7420 0.3838 0.3039 0.3390
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.100 C123 S 0.3838 0.7420 0.4862 0.4862 0.3390
5 3 Allen, Mr. William Henry male 35 0 0 373450 8.050 B96 B98 S 0.3838 0.1889 0.3838 0.3039 0.3390

AppendClusterTargetMean Cluster the data and assign a cluster number. (Same as Append Cluster so far) Then replace each cluster number with the average of the objective variables in the cluster and add it as a new feature. Missing value completion and categorical variable processing are required.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.AppendClusterTargetMean()
)
process.fit_transform(X, y)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked cluster_mean
1 3 0.3838 0.1889 22 1 0 0.3838 7.250 0.3039 0.3390 0.3586
2 1 0.3838 0.7420 38 1 0 0.3838 71.283 0.3838 0.5536 0.3586
3 3 0.3838 0.7420 26 0 0 0.3838 7.925 0.3039 0.3390 0.3586
4 1 0.3838 0.7420 35 1 0 0.4862 53.100 0.4862 0.3390 0.3586
5 3 0.3838 0.1889 35 0 0 0.3838 8.050 0.3039 0.3390 0.3586

PermutationImportanceTest This is a type of feature selection method. With or without randomly shuffling data for a feature Feature selection is performed from the viewpoint of how much the evaluation index of the model prediction result deteriorates. If shuffling the data randomly does not have much effect on the metric, the feature is considered ineffective and deleted.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.PermutationImportanceTest()
)
process.fit_transform(X, y)
Pclass Sex Age SibSp Ticket Fare Cabin Embarked
3 0.1889 22 1 0.3838 7.250 0.3039 0.3390
1 0.7420 38 1 0.3838 71.283 0.3838 0.5536
3 0.7420 26 0 0.3838 7.925 0.3039 0.3390
1 0.7420 35 1 0.4862 53.100 0.4862 0.3390
3 0.1889 35 0 0.3838 8.050 0.3039 0.3390

Name, PassengerId and Parch have been removed. You can also check the deleted features as follows.

process['permutationimportancetest'].drop_columns_

['PassengerId', 'Name', 'Parch']

You can also adjust the sensitivity by adjusting the threshold threshold. See Document for details.

in conclusion

The above is the newly added preprocessing. RankedEvaluationMetricEncoding is sometimes more accurate than TargetMeanEncoding, so I often try it. Also, the Permutation Importance Test can be executed faster than the Boruta and Step-wise methods, but there is no difference unexpectedly. I think that it may be used when you want to select the (?) Feature more seriously than DropLowAUC.

Release article: [Updated Ver1.1.9] I made a data preprocessing library DataLiner for machine learning

Pre-processing before 1.2 is introduced below. Try processing Titanic data with the preprocessing library DataLiner (Drop) Try processing Titanic data with the preprocessing library DataLiner (Encoding) Try processing Titanic data with the preprocessing library DataLiner (conversion) Try processing Titanic data with the preprocessing library DataLiner (Append)

Recommended Posts

Since DataLiner 1.2.0 has been released, we will introduce the newly added preprocessing.
In the middle of development, we will introduce Alembic
Chainer v1.21 has been released