[PYTHON] Try to process Titanic data with preprocessing library DataLiner (conversion)

Introduction

This is the third article that introduces each process of Python's preprocessing library DataLiner. This time I would like to introduce the conversion system.

Release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37

Installation

! pip install -U dataliner

Data preparation

Prepare Titanic data as usual.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.250 NaN S
2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 C85 C
3 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.100 C123 S
5 3 Allen, Mr. William Henry male 35 0 0 373450 8.050 NaN S

StandardScaling / StandardizeData(deprecated) Converts the data to mean 0 variance 1. Unlike libraries such as Sklearn, even if category columns are included, only numeric columns are automatically determined, and since they are returned by pandas DataFrame, subsequent processing is easy. Since StandardizeData has been renamed to StandardScaling, a deprecation warning will be issued, and it will be deleted in ver.1.3.0.

trans = dl.StandardScaling() 
Xt = trans.fit_transform(X)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
-1.729 0.8269 Braund, Mr. Owen Harris male -0.5300 0.4326 -0.4734 A/5 21171 -0.5022 NaN S
-1.725 -1.5652 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 0.5714 0.4326 -0.4734 PC 17599 0.7864 C85 C
-1.721 0.8269 Heikkinen, Miss. Laina female -0.2546 -0.4743 -0.4734 STON/O2. 3101282 -0.4886 NaN S
-1.717 -1.5652 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 0.3649 0.4326 -0.4734 113803 0.4205 C123 S
-1.714 0.8269 Allen, Mr. William Henry male 0.3649 -0.4743 -0.4734 373450 -0.4861 NaN S

MinMaxScaling Converts the data so that it fits between 0 and 1.

trans = dl.MinMaxScaling() 
Xt = trans.fit_transform(X)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0.000000 1 Braund, Mr. Owen Harris male 0.2712 0.125 0 A/5 21171 0.01415 NaN S
0.001124 0 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 0.4722 0.125 0 PC 17599 0.13914 C85 C
0.002247 1 Heikkinen, Miss. Laina female 0.3214 0.000 0 STON/O2. 3101282 0.01547 NaN S
0.003371 0 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 0.4345 0.125 0 113803 0.10364 C123 S
0.004494 1 Allen, Mr. William Henry male 0.4345 0.000 0 373450 0.01571 NaN S

BinarizeNaN Finds the column that contains the missing value and creates a new binary column that tells if the column was missing.

trans = dl.BinarizeNaN() 
Xt = trans.fit_transform(X)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_NaNFlag Cabin_NaNFlag Embarked_NaNFlag
1 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.250 NaN S 0 1 0
2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 C85 C 0 0 0
3 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S 0 1 0
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.100 C123 S 0 0 0
5 3 Allen, Mr. William Henry male 35 0 0 373450 8.050 NaN S 0 1 0

CountRowNaN For each data point (row), count how many missing values are included and add the sum of the missing values as a new feature.

trans = dl.CountRowNaN() 
Xt = trans.fit_transform(X)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked NaN_Totals
1 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.250 NaN S 1
2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 C85 C 0
3 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S 1
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.100 C123 S 0
5 3 Allen, Mr. William Henry male 35 0 0 373450 8.050 NaN S 1

ImputeNaN Complements missing values. The default arguments are that the numeric column is complemented by the average and the category column is complemented by the mode. It can be changed with num_strategy and cat_strategy.

trans = dl.ImputeNaN() 
Xt = trans.fit_transform(X)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.250 B96 B98 S
2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 C85 C
3 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 B96 B98 S
4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.100 C123 S
5 3 Allen, Mr. William Henry male 35 0 0 373450 8.050 B96 B98 S

ClipData Define the X quantile and replace the data above and below the upper limit with the upper and lower limits. You can adjust how much you want to clip with the threshold argument, which defaults to 1%: 99%.

trans = dl.ClipData() 
Xt = trans.fit_transform(X)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
9.9 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.250 NaN S
9.9 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 C85 C
9.9 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S
9.9 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.100 C123 S
9.9 3 Allen, Mr. William Henry male 35 0 0 373450 8.050 NaN S

GroupRareCategory In the categorical variables, the infrequently occurring categories are collectively replaced with the string "RareCategory". Helps reduce cardinality. It is effective to use it before applying OneHotEncoding. With the argument threshold, you can change what percentage or less of the number of data to replace. The default is 1%.

trans = dl.GroupRareCategory() 
Xt = trans.fit_transform(X)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 3 RareCategory male 22 1 0 RareCategory 7.250 NaN S
2 1 RareCategory female 38 1 0 RareCategory 71.283 RareCategory C
3 3 RareCategory female 26 0 0 RareCategory 7.925 NaN S
4 1 RareCategory female 35 1 0 RareCategory 53.100 RareCategory S
5 3 RareCategory male 35 0 0 RareCategory 8.050 NaN S

in conclusion

So, this time I introduced the items of the conversion system of DataLiner. As the pre-processing implemented at the moment (ver.1.1.6), it will be the last in the Append system to be introduced next time.

Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/

Recommended Posts

Try to process Titanic data with preprocessing library DataLiner (conversion)
Try to process Titanic data with preprocessing library DataLiner (Append)
Try to process Titanic data with preprocessing library DataLiner (Encoding)
Try to process Titanic data with preprocessing library DataLiner (Drop edition)
Try converting to tidy data with pandas
Try to aggregate doujin music data with pandas
Try to factorial with recursion
[AWS] Try adding Python library to Layer with SAM + Lambda (Python)
Try to extract Azure SQL Server data table with pyodbc
Try to get data while port forwarding to RDS with anaconda.
Try to extract the features of the sensor data with CNN
Generate error correction code to restore data corruption with zfec library
Try to solve the shortest path with Python + NetworkX + social data
Try to get CloudWatch metrics with re: dash python data source
Try to operate Facebook with Python
How to deal with imbalanced data
How to deal with imbalanced data
Try to profile with ONNX Runtime
Try to put data in MongoDB
How to Data Augmentation with PyTorch
Try to output audio with M5STACK
Try data parallelism with Distributed TensorFlow
Preprocessing in machine learning 4 Data conversion
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
Try to image the elevation data of the Geographical Survey Institute with Python
Try to reproduce color film with Python
SIGNATE Quest ① From data reading to preprocessing
Try logging in to qiita with Python
Image classification with Keras-From preprocessing to classification test-
Process Pubmed .xml data with python [Part 2]
Try working with binary data in Python
Check raw data with Kaggle's Titanic (kaggle ⑥)
I tried factor analysis with Titanic data!
Convert Excel data to JSON with python
Send data to DRF API with Vue.js
Try to predict cherry blossoms with xgboost
Quickly try to visualize datasets with pandas
Try HTML scraping with a Python library
First YDK to try with Cisco IOS-XE
Python: Preprocessing in machine learning: Data conversion
Try to generate an image with aliasing
How to read problem data with paiza
Process big data with Dataflow (ApacheBeam) + Python3
Location information data display in Python --Try plotting with the map display library (folium)-