[PYTHON] Try to process Titanic data with preprocessing library DataLiner (Append)

Introduction

This is the 4th article that introduces each process of Python's preprocessing library DataLiner. This time I would like to introduce the Append system. This completes all the pre-processing currently implemented.
We are planning to release Ver1.2 with some pre-processing added after GW, so I would like to write an introductory article again at that time.

Release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 document: https://shallowdf20.github.io/dataliner/preprocessing.html

Installation

! pip install -U dataliner

Data preparation

Prepare Titanic data as usual.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.250 NaN S
2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 C85 C
3 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.100 C123 S
5 3 Allen, Mr. William Henry male 35 0 0 373450 8.050 NaN S

AppendAnomalyScore The Isolation Forest is trained based on the data, and the outlier score is added as a new feature. Missing value completion and categorical variable processing are required before use.

trans = dl.AppendAnomalyScore()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    trans
)
process.fit_transform(X, y)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Anomaly_Score
1 3 640 2 22 1 0 141 7.250 144 3 0.04805
2 1 554 1 38 1 0 351 71.283 101 1 -0.06340
3 3 717 1 26 0 0 278 7.925 144 3 0.04050
4 1 803 1 35 1 0 92 53.100 33 3 -0.04854
5 3 602 2 35 0 0 113 8.050 144 3 0.06903

AppendCluster The data is clustered in KMeans ++, and as a result, the number of the cluster to which each data belongs is added as a new feature. Missing value completion and categorical variable processing are required before use. Scaling is also recommended.

trans = dl.AppendCluster()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    dl.StandardScaling(),
    trans
)
process.fit_transform(X, y)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Cluster_Number
-1.729 0.8269 0.7538 0.7373 -0.5921 0.4326 -0.4734 -0.8129 -0.5022 0.4561 0.5856 5
-1.725 -1.5652 0.4197 -1.3548 0.6384 0.4326 -0.4734 0.1102 0.7864 -0.6156 -1.9412 2
-1.721 0.8269 1.0530 -1.3548 -0.2845 -0.4743 -0.4734 -0.2107 -0.4886 0.4561 0.5856 4
-1.717 -1.5652 1.3872 -1.3548 0.4077 0.4326 -0.4734 -1.0282 0.4205 -2.3103 0.5856 0
-1.714 0.8269 0.6062 0.7373 0.4077 -0.4743 -0.4734 -0.9359 -0.4861 0.4561 0.5856 5

AppendClusterDistance The data is clustered in KMeans ++, and as a result, the distance from each data to each cluster is added as a new feature. Missing value completion and categorical variable processing are required before use. Scaling is also recommended.

trans = dl.AppendClusterDistance()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    dl.StandardScaling(),
    trans
)
process.fit_transform(X, y)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Cluster_Distance_0 Cluster_Distance_1 Cluster_Distance_2 Cluster_Distance_3 Cluster_Distance_4 Cluster_Distance_5 Cluster_Distance_6 Cluster_Distance_7
-1.729 0.8269 0.7538 0.7373 -0.5921 0.4326 -0.4734 -0.8129 -0.5022 0.4561 0.5856 4.580 2.794 3.633 4.188 3.072 2.363 4.852 5.636
-1.725 -1.5652 0.4197 -1.3548 0.6384 0.4326 -0.4734 0.1102 0.7864 -0.6156 -1.9412 3.434 4.637 3.374 4.852 3.675 4.619 6.044 3.965
-1.721 0.8269 1.0530 -1.3548 -0.2845 -0.4743 -0.4734 -0.2107 -0.4886 0.4561 0.5856 4.510 3.410 3.859 3.906 2.207 2.929 5.459 5.608
-1.717 -1.5652 1.3872 -1.3548 0.4077 0.4326 -0.4734 -1.0282 0.4205 -2.3103 0.5856 2.604 5.312 4.063 5.250 4.322 4.842 6.495 4.479
-1.714 0.8269 0.6062 0.7373 0.4077 -0.4743 -0.4734 -0.9359 -0.4861 0.4561 0.5856 4.482 2.632 3.168 4.262 3.097 2.382 5.724 5.593

AppendPrincipalComponent Principal component analysis is performed on the data, and the principal component is added as a new feature. Missing value completion and categorical variable processing are required before use. Scaling is also recommended.

trans = dl.AppendPrincipalComponent()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    dl.StandardScaling(),
    trans
)
process.fit_transform(X, y)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Principal_Component_0 Principal_Component_1 Principal_Component_2 Principal_Component_3 Principal_Component_4
-1.729 0.8269 0.7538 0.7373 -0.5921 0.4326 -0.4734 -0.8129 -0.5022 0.4561 0.5856 -1.0239 0.1683 0.2723 -0.7951 -1.839
-1.725 -1.5652 0.4197 -1.3548 0.6384 0.4326 -0.4734 0.1102 0.7864 -0.6156 -1.9412 2.2205 0.1572 1.3115 -0.9589 -1.246
-1.721 0.8269 1.0530 -1.3548 -0.2845 -0.4743 -0.4734 -0.2107 -0.4886 0.4561 0.5856 -0.6973 0.2542 0.6843 -0.5943 -1.782
-1.717 -1.5652 1.3872 -1.3548 0.4077 0.4326 -0.4734 -1.0282 0.4205 -2.3103 0.5856 2.7334 0.2536 -0.2722 -1.5439 -1.530
-1.714 0.8269 0.6062 0.7373 0.4077 -0.4743 -0.4734 -0.9359 -0.4861 0.4561 0.5856 -0.7770 -0.7732 0.2852 -0.9750 -1.641

in conclusion

Introduced Append items of DataLiner. In the future, I would like to write an introductory article about the function when updating DataLiner.

Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 Documentation: https://shallowdf20.github.io/dataliner/preprocessing.html GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/

Recommended Posts

Try to process Titanic data with preprocessing library DataLiner (Append)
Try to process Titanic data with preprocessing library DataLiner (Encoding)
Try to process Titanic data with preprocessing library DataLiner (conversion)
Try to process Titanic data with preprocessing library DataLiner (Drop edition)
Try converting to tidy data with pandas
Try to aggregate doujin music data with pandas
Try to factorial with recursion
[AWS] Try adding Python library to Layer with SAM + Lambda (Python)
Try to extract Azure SQL Server data table with pyodbc
Try to get data while port forwarding to RDS with anaconda.
Try to extract the features of the sensor data with CNN
Generate error correction code to restore data corruption with zfec library
Try to solve the shortest path with Python + NetworkX + social data
Try to get CloudWatch metrics with re: dash python data source
Try to operate Facebook with Python
How to deal with imbalanced data
How to deal with imbalanced data
Try to profile with ONNX Runtime
How to Data Augmentation with PyTorch
Process Pubmed .xml data with python
Try to output audio with M5STACK
Try data parallelism with Distributed TensorFlow
Try to image the elevation data of the Geographical Survey Institute with Python
Try to reproduce color film with Python
SIGNATE Quest ① From data reading to preprocessing
Try logging in to qiita with Python
Image classification with Keras-From preprocessing to classification test-
Try working with binary data in Python
Check raw data with Kaggle's Titanic (kaggle ⑥)
I tried factor analysis with Titanic data!
Convert Excel data to JSON with python
Send data to DRF API with Vue.js
Convert FX 1-minute data to 5-minute data with Python
Try to predict cherry blossoms with xgboost
Quickly try to visualize datasets with pandas
Try HTML scraping with a Python library
First YDK to try with Cisco IOS-XE
Preprocessing in machine learning 1 Data analysis process
Try to generate an image with aliasing
How to read problem data with paiza
Process big data with Dataflow (ApacheBeam) + Python3
Location information data display in Python --Try plotting with the map display library (folium)-