[PYTHON] [Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.

Introduction

I made DataLiner, a data preprocessing library for machine learning.

When performing machine learning modeling, the processes used in the data processing / feature engineering part are summarized as a preprocessing list. Since it is compliant with scikit-learn's transformer, it can be fit_transformed by itself or poured into pipeline. Since there are functions and pre-processing that have not been fully packed yet, we will update it regularly in the future, but it will be encouraging if you can get other bug reports, FIX, new functions, new pre-processing pull requests, etc.

GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/ Document: https://shallowdf20.github.io/dataliner/preprocessing.html

Installation

Install using pip. If you are building a Python environment using Anaconda, try running the following code with Anaconda Prompt.

! pip install -U dataliner

Data preparation

Let's take Titanic's Datasets, which everyone loves, as an example. Note that X must be pandas.DataFrame and y must be pandas.Series, and if the data type is different, an error will be thrown. Now, prepare the data to be processed.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'
X = df.drop(target_col, axis=1)
y = df[target_col]

Familiar data like this is stored in X. image.png

How to use

Then, I will use it immediately. First, try using DropHighCardinality, which automatically erases features with too many categories.

dhc = dl.DropHighCardinality()
dhc.fit_transform(X)

image.png You can see that features with a large number of categories such as Name and Ticket have been deleted. As an aside, in the case of Titanic, I think that we will twist information from these columns to improve accuracy.

Next, let's try the familiar Target Encoding. It is a version that smoothes using the average of y as a prior probability.

tme = dl.TargetMeanEncoding()
tme.fit_transform(X, y)

image.png It automatically recognized the columns of categorical variables and encoded each category using the objective variable.

I also think that many data scientists are using Pipeline for efficiency. Of course, each DataLiner class can be used in the same way.

from sklearn.pipeline import make_pipeline

process = make_pipeline(
    dl.DropNoVariance(),
    dl.DropHighCardinality(),
    dl.BinarizeNaN(),
    dl.ImputeNaN(),
    dl.OneHotEncoding(),
    dl.DropHighCorrelation(),
    dl.StandardScaling(),
    dl.DropLowAUC(),
)

process.fit_transform(X, y)

As a result of various processing, it became like this. image.png

In Titanic, there is data called test.csv that is held out in advance for evaluation, so read it and try the same processing.

X_test = pd.read_csv('test.csv')
process.transform(X_test)

image.png

That's it.

What is included

At the moment it is as follows. We would like to expand the functions and process in future updates.

** 5/3 postscript: ** I wrote an introductory article for each class. Try processing Titanic data with the preprocessing library DataLiner (Drop) Try processing Titanic data with the preprocessing library DataLiner (Encoding) Try processing Titanic data with the preprocessing library DataLiner (conversion) Try processing Titanic data with the preprocessing library DataLiner (Append)

** BinarizeNaN **-Finds a column that contains missing values and creates a new feature that tells if the column was missing ** ClipData ** --Separate numerical data with the q percentile and replace values above and below the upper limit with upper and lower limits. ** CountRowNaN ** --Creates a new feature that is the sum of the missing values in the row direction for each data. ** DropColumns ** --Drops the specified columns ** DropHighCardinality **-Drop columns with a large number of categories ** DropHighCorrelation ** --Removes features whose Pearson correlation coefficient exceeds the threshold. When deleting, it leaves features that are more correlated with the objective variable. ** DropLowAUC ** --For all features, logistic regression with y as the objective variable is performed for each feature, and the features whose AUC is below the threshold are deleted. ** DropNoVariance ** --Deletes features that contain only one type of data. ** GroupRareCategory ** --Groups the less frequently occurring categories in the category column. ** ImputeNaN **-Complements missing values. By default, numeric data is complemented with the mean and categorical variables are complemented with the mode. ** OneHotEncoding ** --Make categorical variables dummy variables. ** TargetMeanEncoding ** --Replaces each category of the categorical variable with a smoothed mean of the objective variable. ** StandardScaling ** --Converts numeric data to mean 0 variance 1. ** MinMaxScaling **-Scales numeric data from 0 to 1 ** CountEncoding ** --Replaces category values with the number of occurrences ** RankedCountEncoding ** --Create a ranking based on the number of occurrences of the category value and replace it with that ranking. This is effective when multiple categories appear the same number of times. ** FrequencyEncoding ** --Replaces category values by frequency of occurrence. The ranking version is the same as RankedCountEncoding, so it is not available. ** RankedTargetMeanEncoding ** --This is a version in which a ranking is created by averaging the objective variables for each category value, and the ranking is further replaced. ** AppendAnomalyScore ** --Adds an outlier score from Isolation Forest as a feature. ** AppendCluster ** --Perform KMeans and add the resulting cluster as a feature. Data scaling recommended ** AppendClusterDistance ** --Perform KMeans and add the resulting distance to each cluster as a feature. Data scaling recommended ** AppendPrincipalComponent ** --Perform principal component analysis and add principal components as features. Data scaling recommended ** AppendArithmeticFeatures ** --Performs four arithmetic operations on the features contained in the data, and adds a new feature with a higher evaluation index than the features used in the calculation. (ver1.2.0) ** RankedEvaluationMetricEncoding ** --After making each category a dummy variable, perform logistic regression with each category column and objective variable. Create a ranking using the resulting metric (AUC by default) and encode the original category with that ranking. (ver1.2.0) ** AppendClassificationModel ** --Trains the classifier for the input and adds the label or score of the prediction result as a feature. (ver1.2.0) ** AppendEncoder ** --Adds the processing result of each Encoder included in DataLiner as a feature instead of replacing it. (ver1.2.0) ** AppendClusterTargetMean ** --Adds the average objective variable in each cluster after clustering as a feature. (ver1.2.0) ** PermutationImportanceTest ** --Feature quantity selection is performed from the viewpoint of how much the evaluation index of the model prediction result deteriorates when the data of a certain feature quantity is shuffled at random or not. (ver1.2.0) ** UnionAppend ** --Append ○○ class of DataLiner is processed in parallel instead of serial, and the features of the output result are combined and added to the original features. Classes must be given in a list. (ver1.3.1) ** load_titanic ** --Load titanic data. (ver1.3.1)

at the end

I've put together a process that I've been repeating over and over, but maybe other people have similar needs? I made it public. Of the myriad of pre-processing, I hope that the main ones will be a cohesive library.

Again, we're waiting for bug reports, FIXes, new features and new processing pull requests! !!

Recommended Posts

[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
[Python] I made a classifier for irises [Machine learning]
I made a library for actuarial science
I started machine learning with Python Data preprocessing
xgboost: A valid machine learning model for table data
Data set for machine learning
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Made icrawler easier to use for machine learning data collection
I tried using Tensorboard, a visualization tool for machine learning
I made a learning kit for word2vec / doc2vec / GloVe / fastText
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
I made a C ++ learning site
<For beginners> python library <For machine learning>
Preprocessing in machine learning 2 Data acquisition
I made a tool that makes it convenient to set parameters for machine learning models.
Preprocessing in machine learning 4 Data conversion
I made a Python wrapper library for docomo image recognition API.
I made a dash docset for Holoviews
Python: Preprocessing in machine learning: Data acquisition
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
Creating a development environment for machine learning
I made a python dictionary file for Neocomplete
I made a spare2 cheaper algorithm for uWSGI
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
I made a downloader for word distributed expression
I tried to process and transform the image and expand the data for machine learning
I made a peeping prevention product for telework.
A story about data analysis by machine learning
I installed Chainer, a framework for deep learning
I made a user management tool for Let's Chat
I made a library to separate Japanese sentences nicely
I made a vim learning game "PacVim" with Go
I made a window for Log output with Tkinter
About data preprocessing of systems that use machine learning
I made a cleaning tool for Google Container Registry
I made a VM that runs OpenCV for Python
Installation of TensorFlow, a machine learning library from Google
I made a python library to do rolling rank
I made a repeating text data generation tool "rpttxt"
Memo for building a machine learning environment using Python
〇✕ I made a game
Machine learning library dlib
Machine learning library Shogun
I made a data extension class for tensorflow> = 2.0 because ImageDataGenerator can no longer be used.
I installed the automatic machine learning library auto-sklearn on centos7
[VSCode] I made a user snippet for Python print f-string
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
I made a resource monitor for Raspberry Pi with a spreadsheet
I made a face diagnosis AI for a female professional golfer ③
Try to process Titanic data with preprocessing library DataLiner (Append)
Try to process Titanic data with preprocessing library DataLiner (Encoding)
Build a PyData environment for a machine learning study session (January 2017)
Try to process Titanic data with preprocessing library DataLiner (conversion)
Pre-processing pipeline template for DataLiner
Build a machine learning environment
I made a python text
Made a command for FizzBuzz
I made a discord bot
I implemented Extreme learning machine
A story stuck with the installation of the machine learning library JAX