[PYTHON] Introduction of DataLiner ver.1.3 and how to use Union Append

Introduction

We have implemented all the main functions that were expected for the release of DataLiner 1.3.1. In the future, the development pace will be about adding bug fix and preprocessing frequently.

Release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 GitHub: https://github.com/shallowdf20/dataliner Document: https://shallowdf20.github.io/dataliner/preprocessing.html

Installation

! pip install -U dataliner

Changes in ver1.3

There are the following four.

--UnionAppend implementation --StandardizeData abolished (renamed to Standard Scaling) --ArithmeticFeatureGenerator abolished (renamed to AppendArithmeticFeatures) --load_titanic implementation

Then, I will introduce the specific usage.

How to use

First, import the package to be used this time.

import dataliner as dl
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

From this version, titanic data is included in the package, so it is easy to try. Use the load_titanic method to load the sample titanic data.

X, X_test, y = dl.load_titanic()

Now X is the data excluding'Survived'in train.csv, X_test is test.csv, and y is the'Survived' column in train.csv.

How to use

This time, I will introduce the process called ** Union Append **.

In DataLiner pre-processing, basically all pre-processing that adds new features from existing features is given the class name Append 〇〇.

That's why we changed ArithmeticFeatureGenerator to AppendArithmeticFeatures in this version. The exceptions are BinarizeNaN and CountRowNaN, but these are the processes that are performed before the missing value completion / category processing in principle, so we have given this name.

Here, for example, let's say you want to add features as a whole, and suppose you build a pipeline as follows.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.StandardScaling(),
    dl.AppendCluster(),
    dl.AppendAnomalyScore(),
    dl.AppendPrincipalComponent(),
    dl.AppendClusterTargetMean(),
    dl.AppendClassificationModel(model=RandomForestClassifier(n_estimators=100, max_depth=5)),
    dl.AppendClusterDistance(),
    dl.AppendArithmeticFeatures(),
)
process.fit_transform(X, y)

In this method, for example, the features added by Append Cluster are used as the original data of the next Append Anomaly Score. (And the features will increase steadily in all Append 〇〇 below)

You may want to process in parallel instead of serial processing like this, and make all the features that are the basis of Append 〇〇 the same. You can use Union Append in that case.

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.StandardScaling(),
    dl.UnionAppend([
        dl.AppendCluster(),
        dl.AppendAnomalyScore(),
        dl.AppendPrincipalComponent(),
        dl.AppendClusterTargetMean(),
        dl.AppendClassificationModel(model=RandomForestClassifier(n_estimators=100, max_depth=5)),
        dl.AppendClusterDistance(),
        dl.AppendArithmeticFeatures(),
    ]),
)
process.fit_transform(X, y)

By giving the class of Append 〇 〇 that you want to apply to UnionAppend as an array, all the base features of the processing in UnionAppend are unified, and the processing result of each Append 〇〇 is combined and returned. The execution result is as follows.

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Cluster_Number Anomaly_Score Principal_Component_0 Principal_Component_1 Principal_Component_2 Principal_Component_3 Principal_Component_4 cluster_mean Predicted_RandomForestClassifier Cluster_Distance_0 Cluster_Distance_1 Cluster_Distance_2 Cluster_Distance_3 Cluster_Distance_4 Cluster_Distance_5 Cluster_Distance_6 Cluster_Distance_7 Age_multiply_SibSp PassengerId_multiply_SibSp SibSp_multiply_Parch
-1.729 0.8269 -0.9994 -0.7373 -0.5921 0.4326 -0.4734 -0.1954 -0.5022 -0.3479 -0.5397 1 0.094260 -1.4177 0.1906 -0.35640 -1.398 -0.5801 0.1677 0 2.861 1.265 4.352 3.466 5.616 3.461 2.782 5.667 -0.2561 -0.7479 -0.2048
-1.725 -1.5652 -0.9994 1.3548 0.6384 0.4326 -0.4734 -0.1954 0.7864 0.1665 2.0434 5 -0.047463 1.9956 0.1777 -0.14888 -2.449 0.6941 0.4874 1 3.768 4.335 5.799 3.681 3.946 3.028 4.993 4.830 0.2762 -0.7463 -0.2048
-1.721 0.8269 -0.9994 1.3548 -0.2845 -0.4743 -0.4734 -0.1954 -0.4886 -0.3479 -0.5397 0 0.076929 -0.8234 0.2181 -1.24773 -1.380 -1.2529 0.7321 1 1.870 2.311 4.937 3.759 5.490 3.548 3.376 5.467 0.1349 0.8164 0.2245
-1.717 -1.5652 -0.9994 1.3548 0.4077 0.4326 -0.4734 0.2317 0.4205 0.8250 -0.5397 0 -0.000208 1.5823 0.2699 0.10503 -1.536 -1.6788 0.7321 1 2.835 3.547 5.352 3.058 4.090 3.970 4.338 3.846 0.1763 -0.7429 -0.2048
-1.714 0.8269 -0.9994 -0.7373 0.4077 -0.4743 -0.4734 -0.1954 -0.4861 -0.3479 -0.5397 1 0.106421 -1.2160 -0.7344 -0.09900 -1.500 -0.7327 0.1677 0 2.866 1.148 5.064 2.921 5.567 3.463 2.689 5.594 -0.1934 0.8127 0.2245

Processing to test data is as usual.

process.transform(X_test)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Cluster_Number Anomaly_Score Principal_Component_0 Principal_Component_1 Principal_Component_2 Principal_Component_3 Principal_Component_4 cluster_mean Predicted_RandomForestClassifier Cluster_Distance_0 Cluster_Distance_1 Cluster_Distance_2 Cluster_Distance_3 Cluster_Distance_4 Cluster_Distance_5 Cluster_Distance_6 Cluster_Distance_7 Age_multiply_SibSp PassengerId_multiply_SibSp SibSp_multiply_Parch
1.733 0.8269 -0.9994 -0.7373 0.3692 -0.4743 -0.4734 -0.1954 -0.4905 -0.3479 0.06949 6 0.08314 -0.92627 -1.0572 0.17814 1.514 0.78456 0.1087 0 3.095 2.724 5.232 2.986 5.397 3.045 1.1646 5.405 -0.17512 -0.8219 0.2245
1.737 0.8269 -0.9994 1.3548 1.3306 0.4326 -0.4734 -0.1954 -0.5072 -0.3479 -0.53969 0 0.01921 -0.45407 -0.2239 0.40615 1.531 -0.36302 0.7321 0 2.677 3.744 5.022 3.451 5.503 3.924 2.7926 5.414 0.57556 0.7513 -0.2048
1.741 -0.3692 -0.9994 -0.7373 2.4843 -0.4743 -0.4734 -0.1954 -0.4531 -0.3479 0.06949 3 0.02651 0.04527 -2.0548 1.70715 1.119 0.49872 0.2277 0 4.047 3.880 6.207 2.345 5.441 3.955 2.9554 5.527 -1.17825 -0.8256 0.2245
1.745 0.8269 -0.9994 -0.7373 -0.2076 -0.4743 -0.4734 -0.1954 -0.4737 -0.3479 -0.53969 6 0.11329 -1.17022 -0.7993 0.02809 1.770 0.37658 0.1087 0 3.011 2.615 5.099 3.238 5.522 3.420 0.9194 5.466 0.09846 -0.8275 0.2245
1.749 0.8269 -0.9994 1.3548 -0.5921 0.4326 0.7672 -0.1954 -0.4008 -0.3479 -0.53969 0 0.02122 -0.63799 1.2879 -0.38498 1.920 -0.06859 0.7321 1 2.269 3.601 3.888 4.139 5.238 3.679 2.6813 5.425 -0.25613 0.7563 0.3319

in conclusion

This completes the implementation of the initially expected functions and the preprocessing used. In the future, if I find a bug fix and a new pre-processing / I will add it if I can think of it.

Recommended Posts

Introduction of DataLiner ver.1.3 and how to use Union Append
Introduction of cyber security framework "MITRE CALDERA": How to use and training
[Introduction] How to use open3d
Summary of how to use pandas.DataFrame.loc
How to install and use Tesseract-OCR
Summary of how to use pyenv-virtualenv
[Introduction to Udemy Python 3 + Application] 36. How to use In and Not
[Introduction to Azure for kaggle users] Comparison of how to start and use Azure Notebooks and Azure Notebooks VM
[Python] Summary of how to use split and join functions
Comparison of how to use higher-order functions in Python 2 and 3
How to install and use Graphviz
Summary of how to use csvkit
[Introduction to Python] How to use the Boolean operator (and ・ or ・ not)
[Python] Summary of how to use pandas
[Introduction to Python] How to use class in Python?
[Python2.7] Summary of how to use unittest
python: How to use locals () and globals ()
Jupyter Notebook Basics of how to use
Basics of PyTorch (1) -How to use Tensor-
How to use Python zip and enumerate
Summary of how to use Python list
[Python2.7] Summary of how to use subprocess
How to use is and == in Python
How to use pandas Timestamp and date_range
[Question] How to use plot_surface of python
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Scipy
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Pandas
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Matplotlib
[Introduction to pytorch-lightning] How to use torchvision.transforms and how to freely create your own dataset ♬
How to use folium (visualization of location information)
How to use lists, tuples, dictionaries, and sets
A simple example of how to use ArgumentParser
[Python] How to use two types of type ()
Introducing Sinatra-style frameworks and how to use them
[Introduction to Udemy Python3 + Application] 23. How to use tuples
Not much mention of how to use Pickle
[Python] How to use hash function and tuple.
How to install Cascade detector and how to use it
How to use xml.etree.ElementTree
How to use virtualenv
How to use Seaboan
How to use image-match
How to use shogun
How to use Pandas 2
How to use Virtualenv
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use the graph drawing library ♬ Environment construction
How to use numpy.vectorize
How to use pytest_report_header
How to use partial
How to use SymPy
How to use x-means
How to use WikiExtractor.py
How to use IPython
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv