[PYTHON] Machine learning experience (first part) in just a few lines. Explain PyCaret in detail. From dataset preparation to accuracy comparison of multiple models.

Regarding unseen data

When studying PyCaret, it seems that unseen data is mistaken for test data, but unseen data is test data, but if you explain in detail,

Create a predictive model with training data Create a final prediction model by combining training data with test data Finally, enter unseen data into the model to check the accuracy of the model

It will be the flow.

Introduction

The open source Python machine learning library "PyCaret 1.0.0" was released on April 16, 2020, so let's install it with pip.

PyCaret automatically complements missing values and adjusts hyperparameters. Therefore, the feature is that the machine learning step is realized in a few lines. Even if you are not familiar with the contents of machine learning, you can easily create and compare models.

Binary Classification Tutorial (CLF101) --Level Beginner For reference, I will try to implement it using Google Colab.

Install PyCaret

For Google Colab or Azure Notebooks, install with the following code.

The version at the time of writing is 1.0.0. It seems that 1.0.1 will return models trained with compare_models.

Return models from compare_models. Currently compare_models() donot return any trained model object.

code.py


! pip install pycaret

When using Google Colab, you can display it interactively by executing the following code.

code.py


from pycaret.utils import enable_colab
enable_colab()

Data set preparation

Pycaret provides several datasets for you to use with get_data () (must be connected to the internet).

The datasets that can be used are stored in pycaret / datasets /. And cancer and heart disease.

The tutorial uses a credit card payment information dataset from April to September 2005 in Taiwan, including gender, final education, marital status, past payment status, past payment history and billing details. It is included.

Target Column is Default payment (1 = yes, 0 = no), so it is a binary classification. Binary classification is a two-value classification of pass or fail, positive or negative.

code.py


from pycaret.datasets import get_data
dataset = get_data('credit')

image.png

Let's check the number of datasets.

code.py


#check the shape of data
dataset.shape

result

(24000, 24)

Next, set 5% to Unseen Data. 1200 data is not used to create the forecast model in this dataset. Looking at the tutorial, it says train / test splits and don't get crowded. (The training data and the test data are separated by the setup () function.)

This should not be confused with a train/test split as this particular split is performed to simulate a real life scenario.

code.py


data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
data.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

result

Data for Modeling: (22800, 24) Unseen Data For Predictions: (1200, 24)

PyCaret environment settings

Initialize the PyCaret environment with setup (). In setup (), there are two required parameters, the pandas dataframe and the target column name settings. Other parameters are optional. This time, session_id is specified. This is used for reproducibility. If not specified, a pseudo-random number will be issued.

After running setup (), it will automatically infer the data type. Since it is not always inferred correctly, after running setup (), the features and the inferred data type will be displayed. Once you have verified that all data types have been identified correctly, press Enter to continue or type quit to exit.

code.py


from pycaret.classification import *
exp_clf101 = setup(data = data, target = 'default', session_id=123)

If the original data has missing values, it will be displayed as True. In this experiment, there were no missing values in the dataset.

image.png

Sampled Data (22800, 24) Transformed Train Set (15959, 90)

I will pay attention to. It can be seen that the features of the training dataset are increased from the features of the original dataset. This is because it was automatically made into a categorical variable.

Categorical Features 9

And these 9 features have been converted to categorical variables. It's really amazing that it does so automatically.

Since the training data is 70% and the test data is 30%

Sampled Data (22800, 24) Transformed Train Set (15959, 90) Transformed Test Set (6841, 90)

It will be divided like this.

Model comparison

You can train on all models in the library and use 10-fold cross-validation to calculate and compare accuracy, reproducibility, and F1 scores. For example, if you place importance on F1 score, you will feel like using LightGBM.

code.py


compare_models()

By default, it is sorted by precision.

image.png

For example, if you want to add an option that you want to sort by recall or do 5-fold cross-validation, run the following code.

code.py


compare_models(sort = 'Recall', fold = 5)

image.png

Summary

In this article, we even compared the models. Next, I would like to create a model and evaluate it.

Recommended Posts

Machine learning experience (first part) in just a few lines. Explain PyCaret in detail. From dataset preparation to accuracy comparison of multiple models.
Machine learning experience in just a few lines (Part 2). Explain PyCaret in detail. Model building and evaluation analysis.
How to make a face image data set used in machine learning (3: Face image generation from candidate images Part 1)
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
The first step for those who are amateurs of statistics but want to implement machine learning models in Python