I would like to share what I have learned while making appropriate corrections and additions.

Currently, I mainly focus on regression tasks. Please note that the specifications may differ for other tasks. (As far as I've checked briefly, it seems that the specifications are almost the same between tasks.)

About this document

Focusing on PyCaret's ** preprocessing **. Basically, it doesn't touch on modeling and tuning.

While actually moving it, I am writing while reading the original source code. https://github.com/pycaret/pycaret

Please note that there may be some mistakes.

Implementation assumptions

It is assumed that various libraries are imported as follows.

import pandas as pd
import numpy as np

What is PyCaret

A library that automates data preprocessing and machine learning model training and can be deployed in a low-code environment. https://pycaret.org/

Installation is a single pip command. Very easy. ..

pip install pycaret

You can refer to this article for an overview and how to implement a series of pipelines. https://qiita.com/tani_AI_Academy/items/62151d7e151024733919

How to execute preprocessing

In PyCaret, you can specify the preprocessing you want to execute with parameters. In addition, PyCaret confirms with the user about some processing contents before operation. The operation flow is as follows.

Call the data entry / preprocessing execution function

By calling setup () of the package prepared for each task such as classification and regression, the following preprocessing is executed.

Data cleaning and data conversion
Train / test data split
Data sampling

I want PyCaret to process ** Preprocessing can be specified by giving it as an argument to setup () **. Only "target (target variable)" is required as an argument.

In the following explanation, I would like to acquire and execute the data attached to PyCaret. You can check the data attached to PyCaret on the original page. https://pycaret.org/get-data/

Of course, you can also read your own data with pandas and use it.

The code to perform data acquisition and preprocessing is as follows. Here, only the argument "target" is specified. Other options default.

from pycaret.datasets import get_data
dataset = get_data("diamond")

from pycaret.regression import *
setup(dataset, target="Price")

Although the return value is passed through here, multiple values such as preprocessed data are returned. For the return value, check [Details described below](Data after preprocessing is returned as the return value of setup ()).

Check the estimation result of the type of each variable

When you run setup (), ** PyCaret first estimates the type (Data Type) of each variable and prompts the user to check the estimation result and continue processing **. If the type estimation result is correct, press the Enter key in the edit box in the blue frame of the figure to continue the process. If the inferred type is incorrect, you can interrupt the process by typing "quit".

Variables with incorrect type estimation can be resolved by explicitly specifying the type in setup (). (For details, see [Contents described below](Numeric Features, Categorical Features).)

Check the execution summary of preprocessing

When the execution of setup () is completed, the processing contents are output in the data frame format.

	Description	Value
0	session_id	3104
1	Transform Target	False
2	Transform Target Method	None
3	Original Data	(6000, 8)
4	Missing Values	False
5	Numeric Features	1
6	Categorical Features	6
7	Ordinal Features	False
8	High Cardinality Features	False
9	High Cardinality Method	None
10	Sampled Data	(6000, 8)
11	Transformed Train Set	(4199, 28)
12	Transformed Test Set	(1801, 28)
13	Numeric Imputer	mean
14	Categorical Imputer	constant
15	Normalize	False
16	Normalize Method	None
17	Transformation	False
18	Transformation Method	None
19	PCA	False
20	PCA Method	None
21	PCA Components	None
22	Ignore Low Variance	False
23	Combine Rare Levels	False
24	Rare Level Threshold	None
25	Numeric Binning	False
26	Remove Outliers	False
27	Outliers Threshold	None
28	Remove Multicollinearity	False
29	Multicollinearity Threshold	None
30	Clustering	False
31	Clustering Iteration	None
32	Polynomial Features	False
33	Polynomial Degree	None
34	Trignometry Features	False
35	Polynomial Threshold	None
36	Group Features	False
37	Feature Selection	False
38	Features Selection Threshold	None
39	Feature Interaction	False
40	Feature Ratio	False
41	Interaction Threshold	None

From this table, you can ** check the data size, the number of features, and whether or not various preprocessing is specified **. By default, most options are disabled (False or None).

If you specify an option with the argument of setup (), the corresponding item becomes "True" and is colored.

In the following sections, we will explain the contents of various items.

Information about the session

session_id

	Description	Value
0	session_id	3104

It is a runtime identifier of PyCaret, and it seems that it is used internally as a seed for random numbers. If not specified, it will be determined randomly.

It can be specified by the argument "session_id" of setup (). Specify this value to maintain reproducibility during repeated execution. (It's an image close to "random_state" in scikit-learn.)

setup(dataset, target="Price", session_id=123)

Information about input data

Original Data

	Description	Value
3	Original Data	(6000, 8)

The size (shape) of the input data is output.

When I actually check it, it is certainly the same size.

dataset.shape

#Execution result
# (6000, 8)

Missing Values

	Description	Value
4	Missing Values	False

Whether or not the input data is missing is output. Since this data does not contain any defects, "False" is output.

If there is a defect, this item will be "True".

If there is a defect, the defect will be filled in ** setup () **. The specification of the defect filling method will be described later.

Numeric Features and Categorical Features

	Description	Value
5	Numeric Features	1
6	Categorical Features	6

Estimated values for the number of continuous values and the number of features in the category are output, respectively.

It can be explicitly specified by the arguments "numeric_features" and "categorical_features" of setup ().

setup(dataset, target="Price",
        categorical_features=["Cut", "Color", "Clarity", "Polish", "Symmetry", "Report"], 
        numeric_features=["Carat Weight"])

In the above-mentioned ** PyCaret type inference confirmation dialog, if there is a variable whose type inference is incorrect, specify it explicitly with this argument. ** **

Information about data partitioning of train / test

Transformed Train Set、Transformed Test Set

	Description	Value
11	Transformed Train Set	(4199, 28)
12	Transformed Test Set	(1801, 28)

Each size after division is output to train / test data. The ratio of train / test data division can be specified by the argument "train_size" of setup (). The default is 0.7.

The number of columns is different from the input data because the number of features after preprocessing is displayed. (This time, the number of features has increased from 7 to 28 due to pretreatment.)

Information about data sampling

Sampled Data

	Description	Value
10	Sampled Data	(6000, 8)

When data is sampled in `` setup () `, the number of data after sampling is output. ** PyCaret encourages you to sample data and perform a series of operations when the number of rows of data is greater than 25,000. ** **

If you execute setup () on data with more than 25,000 rows, the sampling execution confirmation dialog will be displayed after the type inference confirmation dialog is executed. When sampling, enter the percentage of data to be sampled in the edit box in the blue frame. If you want to use the full number of lines without sampling, leave it blank and press the Enter key.

(For regression tasks)

(For classification tasks)

The graph drawn here shows an indication of the deterioration of accuracy due to sampling.

For regression tasks, plot the coefficient of determination (by default in the linear regression model)
In the classification task, plots of various indicators (default is logistic regression model)

The model used for this plot can be specified by the argument "sample_estimator" of setup (). For example, the code to specify RandomForestRegressor is below.

from sklearn.ensemble import RandomForestRegressor

traffic = get_data("traffic")
setup(traffic, target="traffic_volume", sample_estimator=RandomForestRegressor())

In addition, this function itself can be turned off by specifying the argument "sampling" of setup (). (It is not confirmed whether sampling is executed or not, and processing is continued using all data.)

(Other) Methods related to data cleaning and feature conversion processing

For other items, it is information about whether or not data cleaning and feature conversion processing is executed and the method. In the next chapter, we will explain the corresponding processes.

Data cleaning and feature conversion process

We will consider the processing content and the specification method.

The data after preprocessing is returned as the return value of setup ()

The pre-processed data and processing pipeline will be returned. It seems that it depends on the type of task you want to solve.

`regression`


X, y, X_train, X_test, y_train, y_test, seed, prep_pipe, target_inverse_transformer, experiment__ \
    = setup(dataset, target="Price")

`classification`


from pycaret.classification import *

dataset = get_data('credit')
X, y, X_train, X_test, y_train, y_test, seed, prep_pipe, experiment__ \
    = setup(dataset, target = 'default')

The return value is slightly different between regression and classification. ** The data after preprocessing is returned to X and y **, so you can check the specific processing result.

Is it possible to further process the data after pre-processing by PyCaret and reset it to PyCaret? Is currently unknown.

I would like to continue investigating.

Feature exclusion

You can set the features to be excluded in preprocessing and subsequent modeling.

Parameters

It can be executed by giving the following argument to setup ().

ignore_features (list type of string, default = None)
Specify the column names of the features you want to exclude in the list.

reference

** ID and date (datetime) seem to be set to exclude ** at the time of modeling by default. If the date column is not recognized as a date, it seems that you can explicitly specify it with the argument "date_features".

Also, although the correct specifications are being confirmed, even if there are columns with exactly the same data, one will be automatically excluded.

Filling the deficiency

Interpolates the defects in the specified way.

Parameters

It can be executed by giving the following argument to setup ().

numeric_imputation (string type, default ='mean')
Specifies how to fill in the missing parts for numeric data.
You can specify'mean'or'median'. *'mean' fills the gap with the mean. *'median' fills the gap with the median.

categorical_imputation (string type, default ='constant')
Specifies how to fill in the missing parts for category data.
You can specify'constant'or'mode'.
Always fill in'constant' with the string "not_available".
Fill in'mode' with the mode for each feature.

reference

At the moment, it is not possible to specify for each column, and it seems that all are processed by a unified method.

Sequence data encoding

Label conversion is performed by specifying the column you want to define as ordinal data.

Parameters

It can be executed by giving the following argument to setup ().

ordinal_features (dictionary type, default = None)
Specify the order of column names and values of order data in dictionary format.

Specify with the following image. ordinal_features = { 'column_name' : ['low', 'medium', 'high'] }

In the value part of the dictionary data, specify the values in ascending order of the order data.

Feature normalization

Normalize each feature.

Parameters

It can be executed by giving the following argument to setup ().

normalize (bool type, default = False)
Specify whether to execute this process (True / False).

normalize_method (string type, default ='zscore')
Defines the method (one of the following) used for normalization. *'zscore': Calculated as z = (x --u) / s by a technique called standardization. *'minmax': Scales to the range 0-1 with a technique called Min-Max scaling. *'maxabs': Scales the maximum and minimum absolute values to 1.
`robust': Scales relative to the quartile of the data.

reference

You can refer to this article for'robust'scaling. https://qiita.com/unhurried/items/7a79d2f3574fb1d0cc27

If the dataset contains outliers, the'robust' scaling seems to be strong.

For other scaling, this article will be helpful. https://qiita.com/Umaremin/items/fbbbc6df11f78532932d

In general, linear algorithms tend to be more accurate when normalized, but this is not always the case and may require multiple experiments.

Integration of rare values in categorical variables

In the categorical variable, the categories that are less than the specified threshold are merged into one category.

Parameters

It can be executed by giving the following argument to setup ().

combine_rare_levels (bool type, default = False)
Specify whether to execute this process (True / False).

rare_level_threshold (float type, default = 0.1)
Specify the threshold value to be considered as a rare value.
Combine all categories whose appearance frequency is less than the threshold value into one category.
Valid only when there are two or more categories below the threshold.
Integrated categories are defined with names like "XXX_others_infrequent".

reference

In general, this technique avoids cases where a sparse matrix is created by making a dummy variable when there are a large number of categories in the categorical variable.

Binning of numerical data

Bins the features of numerical data.

Parameters

It can be executed by giving the following argument to setup ().

bin_numeric_features (list type of string, default = None)
Specify the column name of the numerical data feature you want to bin in the list.

reference

Internally, it's an image that runs sklearn.preprocessing.KBinsDiscretizer. (It seems that an algorithm using the one-dimensional k-means method is used.)

I don't understand the details such as how to decide the number of bins, so I would like to study in the future.

Removal of outliers

Remove outliers from train data.

Parameters

It can be executed by giving the following argument to setup ().

remove_outliers (bool type, default = False)
Specify whether to execute this process (True / False).

outliers_threshold (float type, default = 0.05)
Specifies the percentage of outliers in the dataset.
For example, if you specify the default of 0.05, 0.025% of the values on both sides of the tail of the distribution will be removed.

reference

It seems that singular value decomposition and PCA are used for internal processing.

I don't understand the details, so I would like to study in the future.

Removal of multicollinearity

Removes features that can cause multicollinearity.

Parameters

It can be executed by giving the following argument to setup ().

remove_multicollinearity (bool type, default = False)
Specify whether to execute this process (True / False).

multicollinearity_threshold (float type, default = 0.9)
Variables with cross-correlation higher than the threshold defined by this parameter will be deleted.
(Of the two features, the one with the lowest correlation with the objective variable seems to be deleted.)

reference

For multicollinearity, this article will be helpful. https://qiita.com/ynakayama/items/36b7c1640e6a02ce2e00

Feature quantification of class ring results

Clustering is performed using each feature, and the class label of each record is added as a new feature.

Parameters

It can be executed by giving the following argument to setup ().

create_clusters (bool type, default = False)
Specify whether to execute this process (True / False).

cluster_iter (int type, default = 20)
Specify the number of iterations when determining the number of clusters.

reference

The number of clusters appears to be determined using a combination of Calinski Harabasz and silhouette criteria.

For more information on Calinski Harabasz and silhouette standards, this article will help. https://qiita.com/yasaigirai/items/ec3c3aaaab5bc9b930a2

Removal of features by data distribution

Remove features with variances that are not statistically significant.

Parameters

It can be executed by giving the following argument to setup ().

ignore_low_variance (bool type, default = False)
Specify whether to execute this process (True / False).

reference

The data variance here seems to be calculated using the ratio of unique values (unique values) in all samples. Is it an image that is a candidate for exclusion because the more "same value" is in a variable, the lower the variance is considered?

I don't understand the details, so I would like to study in the future.

Generation of interaction features

Generates interaction features using the specified parameters.

Parameters

It can be executed by giving the following argument to setup ().

polynomial_features (bool type, default = False)
If True is specified, a new feature will be generated by combining polynomials of all numerical data features.
The degree of polynomial is specified by polynomial_degree param.
However, among the generated features, those that are judged to be insignificant are excluded.
Set the judgment threshold with polynomial_threshold.

trigonometry_features (bool type, default = False)
If True is specified, a new feature will be generated by combining trigonometric functions of all numerical data features.
Specify the degree and threshold as well as polynomial_features.

polynomial_degree (int type, default = 2)
Specifies the degree of polynomial features.

polynomial_threshold (float type, default = 0.1)
Specify the threshold value to determine whether to keep the newly generated features.
(Refer to "Reference" below for the judgment method.)

For example, if the input is two variables [a, b] and polynomial_degree = 2 is specified, the feature quantity [1, a, b, a ^ 2, ab, b ^ 2] will be generated.

In addition to the above, you can also specify the interaction features. Generates first-order interaction features for all numeric data features, including dummy variable features for categorical variables and features generated by polynomial_features and trigonometry_features.

feature_interaction (bool type, default = False) If * True is specified, the interaction (a * b) is calculated and generated as a new feature.
However, among the generated features, those that are judged to be insignificant are excluded.
Set the judgment threshold with polynomial_threshold.

interaction_threshold (bool type, default = False)
If True is specified, the ratio (a / b) will be calculated and generated as a new feature.
Specify the threshold as well as feature_interaction.

interaction_threshold (bool type, default = 0.01)
Specify the threshold value to determine whether to keep the newly generated features.
(Refer to "Reference" below for the judgment method.)

reference

About polynomial_threshold and interaction_threshold Indicators to compare with thresholds are like importance based on multiple combinations such as Random Forest, AdaBoost, and Linear Correlation.

I don't understand the details, so I would like to study in the future.

About trigonometry_features, do you literally make features using trigonometric functions (sin, cos, tan)? Is it?

I don't understand the details, so I would like to study in the future.

Please note that this function may be inefficient for datasets with a large feature space.

Generation of group features

By specifying related features in the dataset, statistical features based on them are extracted. The following aggregated values between the specified features are calculated to generate a new feature.

minimum value
Maximum value
Average value
Median
Mode
standard deviation

Parameters

It can be executed by giving the following argument to setup ().

group_features (list type of string or list type including list, default = None)
Specify the column name of the (related) numerical data feature for which you want to generate a group feature.

group_names (list type, default = None)
Each group name can be specified as a character string.
A column name is given to each aggregated value with the image "Group name_Min".
If the length does not match group_features or this argument is not specified, the names will be group_1 and group_2 in that order.

The implementation image is as follows.

setup(dataset, target="Price", group_features=[["cal1", "cal2" "cal3"], ["cal4", "cal5"]], group_names=["gr1", "gr2"])

Execution of feature selection

Select features using several evaluation indicators.

Parameters

It can be executed by giving the following argument to setup ().

feature_selection (bool type, default = False)
Specify whether to execute this process (True / False).

feature_selection_threshold (float type, default = 0.8)
Specifies the percentage threshold used for feature selection. (Including newly generated polynomial features, etc.)
The higher this value, the more features will be selected.

reference

About feature_selection_threshold Indicators to compare with thresholds are like importance based on multiple combinations such as Random Forest, AdaBoost, and Linear Correlation.

I don't understand the details, so I would like to study in the future.

According to the original source comment, when using polynomial_features and feature_interaction, it is better to define this parameter with a low value. Is it an image that the features created by interaction should be narrowed down to some extent in this process?

Reduction of high cardinality features

Specifying a column with high cardinality reduces the data types in the column and lowers the cardinality.

Parameters

It can be executed by giving the following argument to setup ().

high_cardinality_features (string type, default = None)
Specifies the (high cardinality) column to convert.

high_cardinality_method (string type, default ='frequency')
Specify the conversion method.
You can select'frequency'or'clustering'.
If'frequency'is specified, the original data is replaced with the frequency (numerical value) of each data type.
Specify'clustering'to perform clustering and replace the original data with the result (class label).

reference

For cardinality data, refer to this article. https://qiita.com/soyanchu/items/034be19a2e3cb87b2efb

In the'clustering'method, k-means is used for a quick look at the source of the original family.

I haven't fully understood the benefits of reducing cardinality, so I would like to study in the future.

Feature scaling

Scales features according to the specified method.

Parameters

It can be executed by giving the following argument to setup ().

transformation (bool type, default = False)
Specify whether to execute this process (True / False).

transformation_method (string type, default ='yeo-johnson')
Specify the conversion method.
You can select'yeo-johnson'or'quantile'. *'Yeo-johnson' does the Yeo-Johnson conversion.
Is'quantile' a quartile? to hold.

reference

Both'yeo-johnson'and'quantile' seem to transform the data to follow a normal distribution.

I don't understand the details, so I would like to study in the future.

After checking the original code,'yeo-johnson' uses sklearn.preprocessing.PowerTransformer, and'quantile'uses sklearn.preprocessing.QuantileTransformer.

In general, bringing features closer to a normal distribution can be useful during modeling. According to the original source comment,'quantile' is non-linear and it should be noted that it may distort the linear correlation between variables measured on the same scale.

Objective variable scaling

Scales the objective variable by the specified method.

At this time, this item cannot be specified in the classfication package. It only supports transformations to get closer to the normal distribution, which is probably because they are unnecessary processing in the classification task.

Parameters

It can be executed by giving the following argument to setup ().

transform_target (bool type, default = False)
Specify whether to execute this process (True / False).

transform_target_method (string type, default ='box-cox')
Specify the conversion method.
You can select'Box-cox'or'yeo-johnson'.
`Box-cox'does a Box-Cox conversion. *'Yeo-johnson' does the Yeo-Johnson conversion.

reference

Bringing the objective variable closer to a normal distribution can be useful during modeling.

Box-Cox conversion has a restriction that all data are positive values, so if the data contains negative values, it seems to forcibly switch to Yeo-Johnson conversion.

For Box-Cox conversion, refer to this article. https://qiita.com/dyamaguc/items/b468ae66f9ce6ee89724

Dimensionality reduction of features

Dimensionality reduction of features is performed.

Parameters

It can be executed by giving the following argument to setup ().

pca (bool type, default = False)
Specify whether to execute this process (True / False).

pca_method (string type, default ='linear') *'linear': Dimensionality reduction by principal component analysis (linear). *'kernel': Dimensionality reduction by kernel principal component analysis. *'incremental': Dimensionality reduction is performed by principal component analysis (mass data ver).

pca_components (int / float type, default = 0.99)
Specify the number / percentage of features to be left after dimensionality reduction.
If specified as an int type, it will be treated as the number of features to be left.
It is necessary to specify a value smaller than the original number of features.
If specified as a float type, it will be treated as a percentage of the remaining features.

reference

Generally, it is carried out for the purpose of removing unimportant features and saving memory and CPU resources.

This process (dimension reduction) seems to be executed at the end of the preprocessing pipeline. (Dimension reduction is performed for the data after other preprocessing is completed.)

This article will be helpful for the main component analysis. https://qiita.com/shuva/items/9625bc326e2998f1fa27 https://qiita.com/NoriakiOshita/items/460247bb57c22973a5f0

For'incremental', it seems to use a method called Incremental PCA. According to scikit-learn's explanation, if the target data set is too large to fit in memory, it is better to use Incremental PCA (IPCA) instead of Principal Component Analysis (PCA). IPCA uses a memory amount that does not depend on the number of input data to create a low-dimensional approximation of the input data. https://scikit-learn.org/stable/auto_examples/decomposition/plot_incremental_pca.html

Implementation sample

Make a large amount of features

from pycaret.regression import *
X, y, X_train, X_test, y_train, y_test, seed, prep_pipe, target_inverse_transformer, experiment__ \
    =  setup(dataset, target="Price", session_id=123, 
             bin_numeric_features = ["Carat Weight"],
             create_clusters = True,
             polynomial_features = True,  feature_interaction = True,  feature_ratio = True)

The execution contents (excerpt) output from setup () are as shown in the figure below.

Checking the returned preprocessed data, 72 features were generated as shown below.

print(X.info())

#Output result
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 6000 entries, 0 to 5999
# Data columns (total 72 columns):
#  #   Column                                            Non-Null Count  Dtype  
# ---  ------                                            --------------  -----  
#  0   Carat Weight_Power2                               6000 non-null   float64
#  1   Cut_Fair                                          6000 non-null   float64
#  2   Cut_Good                                          6000 non-null   float64
#  3   Cut_Ideal                                         6000 non-null   float64
#  4   Cut_Signature-Ideal                               6000 non-null   float64
#  5   Cut_Very Good                                     6000 non-null   float64
#  6   Color_D                                           6000 non-null   float64
#  7   Color_E                                           6000 non-null   float64
#  8   Color_F                                           6000 non-null   float64
#  9   Color_G                                           6000 non-null   float64
#  10  Color_H                                           6000 non-null   float64
#  11  Color_I                                           6000 non-null   float64
#  12  Clarity_FL                                        6000 non-null   float64
#  13  Clarity_IF                                        6000 non-null   float64
#  14  Clarity_SI1                                       6000 non-null   float64
#  15  Clarity_VS1                                       6000 non-null   float64
#  16  Clarity_VS2                                       6000 non-null   float64
#  17  Clarity_VVS1                                      6000 non-null   float64
#  18  Clarity_VVS2                                      6000 non-null   float64
#  19  Polish_EX                                         6000 non-null   float64
#  20  Polish_G                                          6000 non-null   float64
#  21  Polish_ID                                         6000 non-null   float64
#  22  Polish_VG                                         6000 non-null   float64
#  23  Symmetry_EX                                       6000 non-null   float64
#  24  Symmetry_G                                        6000 non-null   float64
#  25  Symmetry_ID                                       6000 non-null   float64
#  26  Symmetry_VG                                       6000 non-null   float64
#  27  Report_GIA                                        6000 non-null   float64
#  28  Carat Weight_0.0                                  6000 non-null   float64
#  29  Carat Weight_1.0                                  6000 non-null   float64
#  30  Carat Weight_10.0                                 6000 non-null   float64
#  31  Carat Weight_11.0                                 6000 non-null   float64
#  32  Carat Weight_12.0                                 6000 non-null   float64
#  33  Carat Weight_13.0                                 6000 non-null   float64
#  34  Carat Weight_2.0                                  6000 non-null   float64
#  35  Carat Weight_3.0                                  6000 non-null   float64
#  36  Carat Weight_4.0                                  6000 non-null   float64
#  37  Carat Weight_5.0                                  6000 non-null   float64
#  38  Carat Weight_6.0                                  6000 non-null   float64
#  39  Carat Weight_7.0                                  6000 non-null   float64
#  40  Carat Weight_8.0                                  6000 non-null   float64
#  41  Carat Weight_9.0                                  6000 non-null   float64
#  42  data_cluster_0                                    6000 non-null   float64
#  43  Polish_EX_multiply_Carat Weight_Power2            6000 non-null   float64
#  44  Symmetry_EX_multiply_Carat Weight_Power2          6000 non-null   float64
#  45  Report_GIA_multiply_Carat Weight_Power2           6000 non-null   float64
#  46  Clarity_VVS2_multiply_Carat Weight_Power2         6000 non-null   float64
#  47  Clarity_IF_multiply_Carat Weight_Power2           6000 non-null   float64
#  48  Clarity_SI1_multiply_Carat Weight_Power2          6000 non-null   float64
#  49  Carat Weight_Power2_multiply_data_cluster_0       6000 non-null   float64
#  50  Symmetry_EX_multiply_data_cluster_0               6000 non-null   float64
#  51  Report_GIA_multiply_data_cluster_0                6000 non-null   float64
#  52  Symmetry_VG_multiply_Carat Weight_Power2          6000 non-null   float64
#  53  Carat Weight_8.0_multiply_Carat Weight_Power2     6000 non-null   float64
#  54  Cut_Signature-Ideal_multiply_Carat Weight_Power2  6000 non-null   float64
#  55  data_cluster_0_multiply_Symmetry_EX               6000 non-null   float64
#  56  Color_E_multiply_Carat Weight_Power2              6000 non-null   float64
#  57  data_cluster_0_multiply_Cut_Ideal                 6000 non-null   float64
#  58  Carat Weight_Power2_multiply_Polish_EX            6000 non-null   float64
#  59  data_cluster_0_multiply_Report_GIA                6000 non-null   float64
#  60  Color_F_multiply_Carat Weight_Power2              6000 non-null   float64
#  61  Carat Weight_Power2_multiply_Carat Weight_8.0     6000 non-null   float64
#  62  Cut_Ideal_multiply_Carat Weight_Power2            6000 non-null   float64
#  63  Color_D_multiply_Carat Weight_Power2              6000 non-null   float64
#  64  data_cluster_0_multiply_Carat Weight_Power2       6000 non-null   float64
#  65  data_cluster_0_multiply_Polish_EX                 6000 non-null   float64
#  66  Color_I_multiply_Carat Weight_Power2              6000 non-null   float64
#  67  Polish_EX_multiply_data_cluster_0                 6000 non-null   float64
#  68  Color_H_multiply_Carat Weight_Power2              6000 non-null   float64
#  69  Carat Weight_Power2_multiply_Report_GIA           6000 non-null   float64
#  70  Clarity_VS2_multiply_Carat Weight_Power2          6000 non-null   float64
#  71  Carat Weight_Power2_multiply_Symmetry_VG          6000 non-null   float64
# dtypes: float64(72)
# memory usage: 3.3 MB

Checking the returned preprocessing pipeline is as follows.

print(prep_pipe)

#Execution result
# Pipeline(memory=None,
#          steps=[('dtypes',
#                  DataTypes_Auto_infer(categorical_features=[],
#                                       display_types=True, features_todrop=[],
#                                       ml_usecase='regression',
#                                       numerical_features=[], target='Price',
#                                       time_features=[])),
#                 ('imputer',
#                  Simple_Imputer(categorical_strategy='not_available',
#                                 numeric_strategy='mean',
#                                 target_variable=None)),
#                 ('new_levels1',
#                  New_Catagorical_Levels_i...
#                 ('dummy', Dummify(target='Price')),
#                 ('fix_perfect', Remove_100(target='Price')),
#                 ('clean_names', Clean_Colum_Names()),
#                 ('feature_select', Empty()), ('fix_multi', Empty()),
#                 ('dfs',
#                  DFS_Classic(interactions=['multiply', 'divide'],
#                              ml_usecase='regression', random_state=123,
#                              subclass='binary', target='Price',
#                              top_features_to_pick_percentage=None)),
#                 ('pca', Empty())],
#          verbose=False)

Summary

** PyCaret can perform various data cleaning and feature conversion processing with simple code ** PyCaret was able to describe various pre-processing just by specifying the parameters, and I felt that it would lead to a significant time saving. I also thought that the code would be cleaner and more unified, which would improve readability and thinking efficiency for the team and myself.

** Understanding the preprocessing that can be done with PyCaret also leads to studying various techniques ** PyCaret is relatively easy to make even for those who are not good at coding. I thought that it would be a good tool for beginners who had been stumbling in coding until now, to focus on learning the theory while actually moving it. (I myself learned a lot of techniques I didn't know before while conducting this survey.)

** On the other hand (at the moment) PyCaret is just a tool for efficiency ** PyCaret only performs cleaning and feature conversion processing based on the data input by the user, and I realized that hypothesis making, data collection, and feature design must still be done manually. It's done.

[PYTHON] I investigated the pretreatment that can be done with PyCaret

About this document

Implementation assumptions

What is PyCaret

How to execute preprocessing

Call the data entry / preprocessing execution function

Check the estimation result of the type of each variable

Check the execution summary of preprocessing

Information about the session

Information about input data

Numeric Features and Categorical Features

Information about data partitioning of train / test

Information about data sampling

(Other) Methods related to data cleaning and feature conversion processing

Data cleaning and feature conversion process

The data after preprocessing is returned as the return value of setup ()

regression

classification

Feature exclusion

Parameters

reference

Filling the deficiency

Parameters

reference

Sequence data encoding

Parameters

Feature normalization

Parameters

reference

Integration of rare values in categorical variables

Parameters

reference

Binning of numerical data

Parameters

reference

Removal of outliers

Parameters

reference

Removal of multicollinearity

Parameters

reference

Feature quantification of class ring results

Parameters

reference

Removal of features by data distribution

Parameters

reference

Generation of interaction features

Parameters

reference

Generation of group features

Parameters

Execution of feature selection

Parameters

reference

Reduction of high cardinality features

Parameters

reference

Feature scaling

Parameters

reference

Objective variable scaling

Parameters

reference

Dimensionality reduction of features

Parameters

reference

Implementation sample

Make a large amount of features

Summary

`regression`

`classification`