Introduction

We have released DataLiner 1.2.0, a pre-processing library for machine learning. This time, I have added about 6 new preprocessing, so I would like to introduce it.

GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/ Document: https://shallowdf20.github.io/dataliner/preprocessing.html

Installation

Install using pip.

! pip install -U dataliner

Data preparation

Use Titanic data as usual.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	NaN	S
2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C
3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S
5	3	Allen, Mr. William Henry	male	35	0	373450	8.050	NaN	S

Let's take a look now.

AppendArithmeticFeatures Four arithmetic operations are performed on the features included in the data, and a new feature with a higher evaluation index than the features used in the calculation is newly added. Evaluation is done by logistic regression. By default, multiplication and evaluation index are AUC, but addition, subtraction and division, and Accuracy are also available. It is necessary to fill in the missing values before using.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.AppendArithmeticFeatures(metric='roc_auc', operation='multiply')
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	PassengerId_multiply_Age	PassengerId_multiply_SibSp	Pclass_multiply_Age
1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	B96 B98	S	22	1	66
2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C	76	2	38
3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	B96 B98	S	78	0	78
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S	140	4	35
5	3	Allen, Mr. William Henry	male	35	0	373450	8.050	B96 B98	S	175	0	105

In this way, new features are added.

RankedEvaluationMetricEncoding After making each category a dummy variable, logistic regression is performed with each category column and objective variable. Create a ranking using the resulting metric (AUC by default) and encode the original category with that ranking. Since 5 folds of logistic regression are fitted to each category, the amount of calculation will be enormous for features with high cardinality, so in advance It is recommended to lower the cardinality by using Drop High Cardinality or Group Rare Category.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedEvaluationMetricEncoding()
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	3	640	2	22	1	288	7.250	1	1
2	1	554	1	38	1	284	71.283	77	2
3	3	717	1	26	0	256	7.925	1	1
4	1	803	1	35	1	495	53.100	112	1
5	3	602	2	35	0	94	8.050	1	1

You can also check how important each category in the categorical variable is by outputting the ranking.

process['rankedevaluationmetricencoding'].dic_corr_['Embarked']

Category	Rank	Evaluation_Metric
S	1	0.5688
C	2	0.5678
Q	3	0.4729

AppendClassificationModel The classifier is trained based on the input data, and the prediction result is added as a new feature. The model can be any sklearn compliant model. Also, if the predict_proba method is implemented You can add a score instead of a label by giving the argument probability = True. Since the model is trained, missing value completion and categorical variable processing are basically required.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.AppendClassificationModel(model=RandomForestClassifier(n_estimators=300, max_depth=5),
                                 probability=False)
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Predicted_RandomForestClassifier
1	3	0.3838	0.1889	22	1	0.3838	7.250	0.3039	0.3390	0
2	1	0.3838	0.7420	38	1	0.3838	71.283	0.3838	0.5536	1
3	3	0.3838	0.7420	26	0	0.3838	7.925	0.3039	0.3390	1
4	1	0.3838	0.7420	35	1	0.4862	53.100	0.4862	0.3390	1
5	3	0.3838	0.1889	35	0	0.3838	8.050	0.3039	0.3390	0

This is the case when probability = True. A score for Class 1 will be awarded.

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Predicted_RandomForestClassifier
1	3	0.3838	0.1889	22	1	0.3838	7.250	0.3039	0.3390	0.1497
2	1	0.3838	0.7420	38	1	0.3838	71.283	0.3838	0.5536	0.8477
3	3	0.3838	0.7420	26	0	0.3838	7.925	0.3039	0.3390	0.5401
4	1	0.3838	0.7420	35	1	0.4862	53.100	0.4862	0.3390	0.8391
5	3	0.3838	0.1889	35	0	0.3838	8.050	0.3039	0.3390	0.1514

AppendEncoder The various Encoders included in the DataLiner directly replace the category columns with encoded numbers. However, in some cases, you may want to use it as a new feature without replacing it. (TargetMeanEncoder, etc.) In that case, it will be added as a feature by wrapping the Encoder in this class.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.AppendEncoder(encoder=dl.TargetMeanEncoding())
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Name_TargetMeanEncoding	Sex_TargetMeanEncoding	Ticket_TargetMeanEncoding	Cabin_TargetMeanEncoding	Embarked_TargetMeanEncoding
1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	B96 B98	S	0.3838	0.1889	0.3838	0.3039	0.3390
2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C	0.3838	0.7420	0.3838	0.3838	0.5536
3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	B96 B98	S	0.3838	0.7420	0.3838	0.3039	0.3390
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S	0.3838	0.7420	0.4862	0.4862	0.3390
5	3	Allen, Mr. William Henry	male	35	0	373450	8.050	B96 B98	S	0.3838	0.1889	0.3838	0.3039	0.3390

AppendClusterTargetMean Cluster the data and assign a cluster number. (Same as Append Cluster so far) Then replace each cluster number with the average of the objective variables in the cluster and add it as a new feature. Missing value completion and categorical variable processing are required.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.AppendClusterTargetMean()
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	cluster_mean
1	3	0.3838	0.1889	22	1	0.3838	7.250	0.3039	0.3390	0.3586
2	1	0.3838	0.7420	38	1	0.3838	71.283	0.3838	0.5536	0.3586
3	3	0.3838	0.7420	26	0	0.3838	7.925	0.3039	0.3390	0.3586
4	1	0.3838	0.7420	35	1	0.4862	53.100	0.4862	0.3390	0.3586
5	3	0.3838	0.1889	35	0	0.3838	8.050	0.3039	0.3390	0.3586

PermutationImportanceTest This is a type of feature selection method. With or without randomly shuffling data for a feature Feature selection is performed from the viewpoint of how much the evaluation index of the model prediction result deteriorates. If shuffling the data randomly does not have much effect on the metric, the feature is considered ineffective and deleted.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.PermutationImportanceTest()
)
process.fit_transform(X, y)

Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
3	0.1889	22	1	0.3838	7.250	0.3039	0.3390
1	0.7420	38	1	0.3838	71.283	0.3838	0.5536
3	0.7420	26	0	0.3838	7.925	0.3039	0.3390
1	0.7420	35	1	0.4862	53.100	0.4862	0.3390
3	0.1889	35	0	0.3838	8.050	0.3039	0.3390

Name, PassengerId and Parch have been removed. You can also check the deleted features as follows.

process['permutationimportancetest'].drop_columns_

['PassengerId', 'Name', 'Parch']

You can also adjust the sensitivity by adjusting the threshold threshold. See Document for details.

in conclusion

The above is the newly added preprocessing. RankedEvaluationMetricEncoding is sometimes more accurate than TargetMeanEncoding, so I often try it. Also, the Permutation Importance Test can be executed faster than the Boruta and Step-wise methods, but there is no difference unexpectedly. I think that it may be used when you want to select the (?) Feature more seriously than DropLowAUC.

Release article: [Updated Ver1.1.9] I made a data preprocessing library DataLiner for machine learning

Pre-processing before 1.2 is introduced below. Try processing Titanic data with the preprocessing library DataLiner (Drop) Try processing Titanic data with the preprocessing library DataLiner (Encoding) Try processing Titanic data with the preprocessing library DataLiner (conversion) Try processing Titanic data with the preprocessing library DataLiner (Append)

[PYTHON] Since DataLiner 1.2.0 has been released, we will introduce the newly added preprocessing.

Introduction

Installation

Data preparation

in conclusion