[PYTHON] I tried using PyCaret

Articles sent by data scientists from the manufacturing industry
This time there was a library (PyCaret) like DataRobot, so I tried using it.

Introduction

I've implemented regression model techniques in the past, but I thought it would be nice to be able to easily compare multiple models. There is a tool called DataRobot in the world, but I was wondering if there was a similar library in Python because it was expensive and I could not buy it. I found a library that can do preprocessing, visualization, and model development with just a few lines of code.

I tried using PyCaret

You can easily install the library below.

pip install pycaret

Load the required libraries and data and launch PyCaret. This time, we will use the price data of Boston houses published in UCI Machine Learning Repository, but PyCaret ’s Data Repository provides about 50 types of data sets.

The arguments of PyCaret are as follows.

First argument: data used for analysis (read data)
Second argument: Name of the objective variable used for prediction
Third argument: Name of the explanatory variable to be excluded from the analysis

#Import required libraries
import pandas as pd
from pycaret.regression import *

#Data set reading
from pycaret.datasets import get_data 
boston_data = get_data('boston')

#Launch PyCaret
exp1 = setup(boston_data, target = 'medv', ignore_features = None, session_id=1498)

スクリーンショット 2021-01-13 16.06.23.png

When started, the prediction result of the data type for the input data is displayed. You need to check the contents yourself. If there is no problem, press "Enter" with the cursor in the white frame below.

スクリーンショット 2021-01-13 16.08.10.png

Setup is complete. You can fix the random seed with session_id.

Next, build the model.

#Model building
compare_models()

スクリーンショット 2021-01-13 16.11.19.png

It automatically creates major regression models such as "Cat Boost Regressor", "Gradient Boosting Regressor", and "Ridge Regression", and automatically calculates evaluation indicators.

However, at present, tuning of hyperparameters has not been implemented.

Next, when building the model, cross-validation is performed 10 times by default, so let's check the result. This time, select the cat boost with the best accuracy.

#Check the catboost model
catboost = create_model('catboost')

スクリーンショット 2021-01-13 16.14.48.png

Next, tune the hyperparameters. The parameter tuning method is random grid search.

#Optimized catboost model
catboost_tuned = tune_model(catboost, optimize = 'MAE')

スクリーンショット 2021-01-13 16.20.04.png

Next, check the analysis result.

#Confirmation of analysis results
evaluate_model(catboost_tuned)

スクリーンショット 2021-01-13 16.20.59.png

You can check the values of hyperparameters after optimization in "Hyperparameters". You can check the result of residual analysis in "Residuals".

スクリーンショット 2021-01-13 16.22.29.png

You can check the prediction accuracy with "Prediction Error".

スクリーンショット 2021-01-13 16.23.18.png

You can check the cook's distance in "Cooks Distance". Cook's distance is a "measure of the distance between a coefficient calculated using the i-th observation and a coefficient calculated without the observation".

スクリーンショット 2021-01-13 16.24.05.png

In addition, you can check "Feature Importance: Variable importance", "Learing Curve: Learning curve", "Validation Curve: Change in prediction accuracy depending on the depth of the tree", etc. What you can analyze depends on the method you choose.

Finally, you can also study ensemble.

#Ensemble learning
lgbm = create_model('lightgbm')
xgboost = create_model('xgboost')

ensemble = blend_models([lgbm, xgboost])

スクリーンショット 2021-01-13 16.31.44.png

at the end

Thank you for reading to the end. I didn't know there was such a useful library. You can do a satisfactory analysis without DataRobot.

If you have a request for correction, we would appreciate it if you could contact us.