I would like to share what I have learned while making appropriate corrections and additions.
Focusing on PyCaret's ** preprocessing **. Basically, it doesn't touch on modeling and tuning.
While actually moving it, I am writing while reading the original source code. https://github.com/pycaret/pycaret
It is assumed that various libraries are imported as follows.
import pandas as pd
import numpy as np
A library that automates data preprocessing and machine learning model training and can be deployed in a low-code environment. https://pycaret.org/
Installation is a single pip command. Very easy. ..
pip install pycaret
You can refer to this article for an overview and how to implement a series of pipelines. https://qiita.com/tani_AI_Academy/items/62151d7e151024733919
In PyCaret, you can specify the preprocessing you want to execute with parameters. In addition, PyCaret confirms with the user about some processing contents before operation. The operation flow is as follows.
By calling setup ()
of the package prepared for each task such as classification and regression, the following preprocessing is executed.
I want PyCaret to process ** Preprocessing can be specified by giving it as an argument to setup ()
**.
Only "target (target variable)" is required as an argument.
In the following explanation, I would like to acquire and execute the data attached to PyCaret. You can check the data attached to PyCaret on the original page. https://pycaret.org/get-data/
The code to perform data acquisition and preprocessing is as follows. Here, only the argument "target" is specified. Other options default.
from pycaret.datasets import get_data
dataset = get_data("diamond")
from pycaret.regression import *
setup(dataset, target="Price")
When you run setup ()
, ** PyCaret first estimates the type (Data Type) of each variable and prompts the user to check the estimation result and continue processing **.
If the type estimation result is correct, press the Enter key in the edit box in the blue frame of the figure to continue the process.
If the inferred type is incorrect, you can interrupt the process by typing "quit".
Variables with incorrect type estimation can be resolved by explicitly specifying the type in setup ()
.
(For details, see [Contents described below](Numeric Features, Categorical Features).)
When the execution of setup ()
is completed, the processing contents are output in the data frame format.
Description | Value | |
---|---|---|
0 | session_id | 3104 |
1 | Transform Target | False |
2 | Transform Target Method | None |
3 | Original Data | (6000, 8) |
4 | Missing Values | False |
5 | Numeric Features | 1 |
6 | Categorical Features | 6 |
7 | Ordinal Features | False |
8 | High Cardinality Features | False |
9 | High Cardinality Method | None |
10 | Sampled Data | (6000, 8) |
11 | Transformed Train Set | (4199, 28) |
12 | Transformed Test Set | (1801, 28) |
13 | Numeric Imputer | mean |
14 | Categorical Imputer | constant |
15 | Normalize | False |
16 | Normalize Method | None |
17 | Transformation | False |
18 | Transformation Method | None |
19 | PCA | False |
20 | PCA Method | None |
21 | PCA Components | None |
22 | Ignore Low Variance | False |
23 | Combine Rare Levels | False |
24 | Rare Level Threshold | None |
25 | Numeric Binning | False |
26 | Remove Outliers | False |
27 | Outliers Threshold | None |
28 | Remove Multicollinearity | False |
29 | Multicollinearity Threshold | None |
30 | Clustering | False |
31 | Clustering Iteration | None |
32 | Polynomial Features | False |
33 | Polynomial Degree | None |
34 | Trignometry Features | False |
35 | Polynomial Threshold | None |
36 | Group Features | False |
37 | Feature Selection | False |
38 | Features Selection Threshold | None |
39 | Feature Interaction | False |
40 | Feature Ratio | False |
41 | Interaction Threshold | None |
From this table, you can ** check the data size, the number of features, and whether or not various preprocessing is specified **. By default, most options are disabled (False or None).
If you specify an option with the argument of setup ()
, the corresponding item becomes "True" and is colored.
In the following sections, we will explain the contents of various items.
session_id
Description | Value | |
---|---|---|
0 | session_id | 3104 |
It is a runtime identifier of PyCaret, and it seems that it is used internally as a seed for random numbers. If not specified, it will be determined randomly.
It can be specified by the argument "session_id" of setup ()
.
Specify this value to maintain reproducibility during repeated execution.
(It's an image close to "random_state" in scikit-learn.)
setup(dataset, target="Price", session_id=123)
Original Data
Description | Value | |
---|---|---|
3 | Original Data | (6000, 8) |
The size (shape) of the input data is output.
When I actually check it, it is certainly the same size.
dataset.shape
#Execution result
# (6000, 8)
Missing Values
Description | Value | |
---|---|---|
4 | Missing Values | False |
Whether or not the input data is missing is output. Since this data does not contain any defects, "False" is output.
If there is a defect, this item will be "True".
If there is a defect, the defect will be filled in ** setup ()
**.
The specification of the defect filling method will be described later.
Description | Value | |
---|---|---|
5 | Numeric Features | 1 |
6 | Categorical Features | 6 |
Estimated values for the number of continuous values and the number of features in the category are output, respectively.
It can be explicitly specified by the arguments "numeric_features" and "categorical_features" of setup ()
.
setup(dataset, target="Price",
categorical_features=["Cut", "Color", "Clarity", "Polish", "Symmetry", "Report"],
numeric_features=["Carat Weight"])
In the above-mentioned ** PyCaret type inference confirmation dialog, if there is a variable whose type inference is incorrect, specify it explicitly with this argument. ** **
Transformed Train Set、Transformed Test Set
Description | Value | |
---|---|---|
11 | Transformed Train Set | (4199, 28) |
12 | Transformed Test Set | (1801, 28) |
Each size after division is output to train / test data.
The ratio of train / test data division can be specified by the argument "train_size" of setup ()
.
The default is 0.7.
The number of columns is different from the input data because the number of features after preprocessing is displayed. (This time, the number of features has increased from 7 to 28 due to pretreatment.)
Sampled Data
Description | Value | |
---|---|---|
10 | Sampled Data | (6000, 8) |
When data is sampled in `` setup () `, the number of data after sampling is output. ** PyCaret encourages you to sample data and perform a series of operations when the number of rows of data is greater than 25,000. ** **
If you execute setup ()
on data with more than 25,000 rows, the sampling execution confirmation dialog will be displayed after the type inference confirmation dialog is executed.
When sampling, enter the percentage of data to be sampled in the edit box in the blue frame.
If you want to use the full number of lines without sampling, leave it blank and press the Enter key.
(For regression tasks)
(For classification tasks)
The graph drawn here shows an indication of the deterioration of accuracy due to sampling.
The model used for this plot can be specified by the argument "sample_estimator" of setup ()
.
For example, the code to specify RandomForestRegressor is below.
from sklearn.ensemble import RandomForestRegressor
traffic = get_data("traffic")
setup(traffic, target="traffic_volume", sample_estimator=RandomForestRegressor())
In addition, this function itself can be turned off by specifying the argument "sampling" of setup ()
.
(It is not confirmed whether sampling is executed or not, and processing is continued using all data.)
For other items, it is information about whether or not data cleaning and feature conversion processing is executed and the method. In the next chapter, we will explain the corresponding processes.
We will consider the processing content and the specification method.
The pre-processed data and processing pipeline will be returned. It seems that it depends on the type of task you want to solve.
regression
X, y, X_train, X_test, y_train, y_test, seed, prep_pipe, target_inverse_transformer, experiment__ \
= setup(dataset, target="Price")
classification
from pycaret.classification import *
dataset = get_data('credit')
X, y, X_train, X_test, y_train, y_test, seed, prep_pipe, experiment__ \
= setup(dataset, target = 'default')
The return value is slightly different between regression and classification. ** The data after preprocessing is returned to X and y **, so you can check the specific processing result.
Is it possible to further process the data after pre-processing by PyCaret and reset it to PyCaret? Is currently unknown.
You can set the features to be excluded in preprocessing and subsequent modeling.
It can be executed by giving the following argument to setup ()
.
** ID and date (datetime) seem to be set to exclude ** at the time of modeling by default. If the date column is not recognized as a date, it seems that you can explicitly specify it with the argument "date_features".
Also, although the correct specifications are being confirmed, even if there are columns with exactly the same data, one will be automatically excluded.
Interpolates the defects in the specified way.
It can be executed by giving the following argument to setup ()
.
At the moment, it is not possible to specify for each column, and it seems that all are processed by a unified method.
Label conversion is performed by specifying the column you want to define as ordinal data.
It can be executed by giving the following argument to setup ()
.
Specify with the following image.
ordinal_features = { 'column_name' : ['low', 'medium', 'high'] }
In the value part of the dictionary data, specify the values in ascending order of the order data.
Normalize each feature.
It can be executed by giving the following argument to setup ()
.
You can refer to this article for'robust'scaling. https://qiita.com/unhurried/items/7a79d2f3574fb1d0cc27
If the dataset contains outliers, the'robust' scaling seems to be strong.
For other scaling, this article will be helpful. https://qiita.com/Umaremin/items/fbbbc6df11f78532932d
In general, linear algorithms tend to be more accurate when normalized, but this is not always the case and may require multiple experiments.
In the categorical variable, the categories that are less than the specified threshold are merged into one category.
It can be executed by giving the following argument to setup ()
.
In general, this technique avoids cases where a sparse matrix is created by making a dummy variable when there are a large number of categories in the categorical variable.
Bins the features of numerical data.
It can be executed by giving the following argument to setup ()
.
Internally, it's an image that runs sklearn.preprocessing.KBinsDiscretizer. (It seems that an algorithm using the one-dimensional k-means method is used.)
Remove outliers from train data.
It can be executed by giving the following argument to setup ()
.
It seems that singular value decomposition and PCA are used for internal processing.
Removes features that can cause multicollinearity.
It can be executed by giving the following argument to setup ()
.
For multicollinearity, this article will be helpful. https://qiita.com/ynakayama/items/36b7c1640e6a02ce2e00
Clustering is performed using each feature, and the class label of each record is added as a new feature.
It can be executed by giving the following argument to setup ()
.
The number of clusters appears to be determined using a combination of Calinski Harabasz and silhouette criteria.
For more information on Calinski Harabasz and silhouette standards, this article will help. https://qiita.com/yasaigirai/items/ec3c3aaaab5bc9b930a2
Remove features with variances that are not statistically significant.
It can be executed by giving the following argument to setup ()
.
The data variance here seems to be calculated using the ratio of unique values (unique values) in all samples. Is it an image that is a candidate for exclusion because the more "same value" is in a variable, the lower the variance is considered?
Generates interaction features using the specified parameters.
It can be executed by giving the following argument to setup ()
.
For example, if the input is two variables [a, b] and polynomial_degree = 2 is specified, the feature quantity [1, a, b, a ^ 2, ab, b ^ 2] will be generated.
In addition to the above, you can also specify the interaction features. Generates first-order interaction features for all numeric data features, including dummy variable features for categorical variables and features generated by polynomial_features and trigonometry_features.
About polynomial_threshold and interaction_threshold Indicators to compare with thresholds are like importance based on multiple combinations such as Random Forest, AdaBoost, and Linear Correlation.
About trigonometry_features, do you literally make features using trigonometric functions (sin, cos, tan)? Is it?
Please note that this function may be inefficient for datasets with a large feature space.
By specifying related features in the dataset, statistical features based on them are extracted. The following aggregated values between the specified features are calculated to generate a new feature.
It can be executed by giving the following argument to setup ()
.
The implementation image is as follows.
setup(dataset, target="Price", group_features=[["cal1", "cal2" "cal3"], ["cal4", "cal5"]], group_names=["gr1", "gr2"])
Select features using several evaluation indicators.
It can be executed by giving the following argument to setup ()
.
About feature_selection_threshold Indicators to compare with thresholds are like importance based on multiple combinations such as Random Forest, AdaBoost, and Linear Correlation.
According to the original source comment, when using polynomial_features and feature_interaction, it is better to define this parameter with a low value. Is it an image that the features created by interaction should be narrowed down to some extent in this process?
Specifying a column with high cardinality reduces the data types in the column and lowers the cardinality.
It can be executed by giving the following argument to setup ()
.
In the'clustering'method, k-means is used for a quick look at the source of the original family.
Scales features according to the specified method.
It can be executed by giving the following argument to setup ()
.
Both'yeo-johnson'and'quantile' seem to transform the data to follow a normal distribution.
After checking the original code,'yeo-johnson' uses sklearn.preprocessing.PowerTransformer, and'quantile'uses sklearn.preprocessing.QuantileTransformer.
In general, bringing features closer to a normal distribution can be useful during modeling. According to the original source comment,'quantile' is non-linear and it should be noted that it may distort the linear correlation between variables measured on the same scale.
Scales the objective variable by the specified method.
It can be executed by giving the following argument to setup ()
.
Bringing the objective variable closer to a normal distribution can be useful during modeling.
Box-Cox conversion has a restriction that all data are positive values, so if the data contains negative values, it seems to forcibly switch to Yeo-Johnson conversion.
Dimensionality reduction of features is performed.
It can be executed by giving the following argument to setup ()
.
Generally, it is carried out for the purpose of removing unimportant features and saving memory and CPU resources.
This process (dimension reduction) seems to be executed at the end of the preprocessing pipeline. (Dimension reduction is performed for the data after other preprocessing is completed.)
This article will be helpful for the main component analysis. https://qiita.com/shuva/items/9625bc326e2998f1fa27 https://qiita.com/NoriakiOshita/items/460247bb57c22973a5f0
For'incremental', it seems to use a method called Incremental PCA. According to scikit-learn's explanation, if the target data set is too large to fit in memory, it is better to use Incremental PCA (IPCA) instead of Principal Component Analysis (PCA). IPCA uses a memory amount that does not depend on the number of input data to create a low-dimensional approximation of the input data. https://scikit-learn.org/stable/auto_examples/decomposition/plot_incremental_pca.html
from pycaret.regression import *
X, y, X_train, X_test, y_train, y_test, seed, prep_pipe, target_inverse_transformer, experiment__ \
= setup(dataset, target="Price", session_id=123,
bin_numeric_features = ["Carat Weight"],
create_clusters = True,
polynomial_features = True, feature_interaction = True, feature_ratio = True)
The execution contents (excerpt) output from setup ()
are as shown in the figure below.
Checking the returned preprocessed data, 72 features were generated as shown below.
print(X.info())
#Output result
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 6000 entries, 0 to 5999
# Data columns (total 72 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 Carat Weight_Power2 6000 non-null float64
# 1 Cut_Fair 6000 non-null float64
# 2 Cut_Good 6000 non-null float64
# 3 Cut_Ideal 6000 non-null float64
# 4 Cut_Signature-Ideal 6000 non-null float64
# 5 Cut_Very Good 6000 non-null float64
# 6 Color_D 6000 non-null float64
# 7 Color_E 6000 non-null float64
# 8 Color_F 6000 non-null float64
# 9 Color_G 6000 non-null float64
# 10 Color_H 6000 non-null float64
# 11 Color_I 6000 non-null float64
# 12 Clarity_FL 6000 non-null float64
# 13 Clarity_IF 6000 non-null float64
# 14 Clarity_SI1 6000 non-null float64
# 15 Clarity_VS1 6000 non-null float64
# 16 Clarity_VS2 6000 non-null float64
# 17 Clarity_VVS1 6000 non-null float64
# 18 Clarity_VVS2 6000 non-null float64
# 19 Polish_EX 6000 non-null float64
# 20 Polish_G 6000 non-null float64
# 21 Polish_ID 6000 non-null float64
# 22 Polish_VG 6000 non-null float64
# 23 Symmetry_EX 6000 non-null float64
# 24 Symmetry_G 6000 non-null float64
# 25 Symmetry_ID 6000 non-null float64
# 26 Symmetry_VG 6000 non-null float64
# 27 Report_GIA 6000 non-null float64
# 28 Carat Weight_0.0 6000 non-null float64
# 29 Carat Weight_1.0 6000 non-null float64
# 30 Carat Weight_10.0 6000 non-null float64
# 31 Carat Weight_11.0 6000 non-null float64
# 32 Carat Weight_12.0 6000 non-null float64
# 33 Carat Weight_13.0 6000 non-null float64
# 34 Carat Weight_2.0 6000 non-null float64
# 35 Carat Weight_3.0 6000 non-null float64
# 36 Carat Weight_4.0 6000 non-null float64
# 37 Carat Weight_5.0 6000 non-null float64
# 38 Carat Weight_6.0 6000 non-null float64
# 39 Carat Weight_7.0 6000 non-null float64
# 40 Carat Weight_8.0 6000 non-null float64
# 41 Carat Weight_9.0 6000 non-null float64
# 42 data_cluster_0 6000 non-null float64
# 43 Polish_EX_multiply_Carat Weight_Power2 6000 non-null float64
# 44 Symmetry_EX_multiply_Carat Weight_Power2 6000 non-null float64
# 45 Report_GIA_multiply_Carat Weight_Power2 6000 non-null float64
# 46 Clarity_VVS2_multiply_Carat Weight_Power2 6000 non-null float64
# 47 Clarity_IF_multiply_Carat Weight_Power2 6000 non-null float64
# 48 Clarity_SI1_multiply_Carat Weight_Power2 6000 non-null float64
# 49 Carat Weight_Power2_multiply_data_cluster_0 6000 non-null float64
# 50 Symmetry_EX_multiply_data_cluster_0 6000 non-null float64
# 51 Report_GIA_multiply_data_cluster_0 6000 non-null float64
# 52 Symmetry_VG_multiply_Carat Weight_Power2 6000 non-null float64
# 53 Carat Weight_8.0_multiply_Carat Weight_Power2 6000 non-null float64
# 54 Cut_Signature-Ideal_multiply_Carat Weight_Power2 6000 non-null float64
# 55 data_cluster_0_multiply_Symmetry_EX 6000 non-null float64
# 56 Color_E_multiply_Carat Weight_Power2 6000 non-null float64
# 57 data_cluster_0_multiply_Cut_Ideal 6000 non-null float64
# 58 Carat Weight_Power2_multiply_Polish_EX 6000 non-null float64
# 59 data_cluster_0_multiply_Report_GIA 6000 non-null float64
# 60 Color_F_multiply_Carat Weight_Power2 6000 non-null float64
# 61 Carat Weight_Power2_multiply_Carat Weight_8.0 6000 non-null float64
# 62 Cut_Ideal_multiply_Carat Weight_Power2 6000 non-null float64
# 63 Color_D_multiply_Carat Weight_Power2 6000 non-null float64
# 64 data_cluster_0_multiply_Carat Weight_Power2 6000 non-null float64
# 65 data_cluster_0_multiply_Polish_EX 6000 non-null float64
# 66 Color_I_multiply_Carat Weight_Power2 6000 non-null float64
# 67 Polish_EX_multiply_data_cluster_0 6000 non-null float64
# 68 Color_H_multiply_Carat Weight_Power2 6000 non-null float64
# 69 Carat Weight_Power2_multiply_Report_GIA 6000 non-null float64
# 70 Clarity_VS2_multiply_Carat Weight_Power2 6000 non-null float64
# 71 Carat Weight_Power2_multiply_Symmetry_VG 6000 non-null float64
# dtypes: float64(72)
# memory usage: 3.3 MB
Checking the returned preprocessing pipeline is as follows.
print(prep_pipe)
#Execution result
# Pipeline(memory=None,
# steps=[('dtypes',
# DataTypes_Auto_infer(categorical_features=[],
# display_types=True, features_todrop=[],
# ml_usecase='regression',
# numerical_features=[], target='Price',
# time_features=[])),
# ('imputer',
# Simple_Imputer(categorical_strategy='not_available',
# numeric_strategy='mean',
# target_variable=None)),
# ('new_levels1',
# New_Catagorical_Levels_i...
# ('dummy', Dummify(target='Price')),
# ('fix_perfect', Remove_100(target='Price')),
# ('clean_names', Clean_Colum_Names()),
# ('feature_select', Empty()), ('fix_multi', Empty()),
# ('dfs',
# DFS_Classic(interactions=['multiply', 'divide'],
# ml_usecase='regression', random_state=123,
# subclass='binary', target='Price',
# top_features_to_pick_percentage=None)),
# ('pca', Empty())],
# verbose=False)
** PyCaret can perform various data cleaning and feature conversion processing with simple code ** PyCaret was able to describe various pre-processing just by specifying the parameters, and I felt that it would lead to a significant time saving. I also thought that the code would be cleaner and more unified, which would improve readability and thinking efficiency for the team and myself.
** Understanding the preprocessing that can be done with PyCaret also leads to studying various techniques ** PyCaret is relatively easy to make even for those who are not good at coding. I thought that it would be a good tool for beginners who had been stumbling in coding until now, to focus on learning the theory while actually moving it. (I myself learned a lot of techniques I didn't know before while conducting this survey.)
** On the other hand (at the moment) PyCaret is just a tool for efficiency ** PyCaret only performs cleaning and feature conversion processing based on the data input by the user, and I realized that hypothesis making, data collection, and feature design must still be done manually. It's done.
Recommended Posts