[PYTHON] I tried using PyCaret

Introduction

I've implemented regression model techniques in the past, but I thought it would be nice to be able to easily compare multiple models. There is a tool called DataRobot in the world, but I was wondering if there was a similar library in Python because it was expensive and I could not buy it. I found a library that can do preprocessing, visualization, and model development with just a few lines of code.

I tried using PyCaret

You can easily install the library below.

pip install pycaret

Load the required libraries and data and launch PyCaret. This time, we will use the price data of Boston houses published in UCI Machine Learning Repository, but PyCaret ’s Data Repository provides about 50 types of data sets.

The arguments of PyCaret are as follows.

#Import required libraries
import pandas as pd
from pycaret.regression import *

#Data set reading
from pycaret.datasets import get_data 
boston_data = get_data('boston')

#Launch PyCaret
exp1 = setup(boston_data, target = 'medv', ignore_features = None, session_id=1498)

スクリーンショット 2021-01-13 16.06.23.png

When started, the prediction result of the data type for the input data is displayed. You need to check the contents yourself. If there is no problem, press "Enter" with the cursor in the white frame below.

スクリーンショット 2021-01-13 16.08.10.png

Setup is complete. You can fix the random seed with session_id.

Next, build the model.

#Model building
compare_models()

スクリーンショット 2021-01-13 16.11.19.png

It automatically creates major regression models such as "Cat Boost Regressor", "Gradient Boosting Regressor", and "Ridge Regression", and automatically calculates evaluation indicators.

However, at present, tuning of hyperparameters has not been implemented.

Next, when building the model, cross-validation is performed 10 times by default, so let's check the result. This time, select the cat boost with the best accuracy.

#Check the catboost model
catboost = create_model('catboost')

スクリーンショット 2021-01-13 16.14.48.png

Next, tune the hyperparameters. The parameter tuning method is random grid search.

#Optimized catboost model
catboost_tuned = tune_model(catboost, optimize = 'MAE')

スクリーンショット 2021-01-13 16.20.04.png

Next, check the analysis result.

#Confirmation of analysis results
evaluate_model(catboost_tuned)

スクリーンショット 2021-01-13 16.20.59.png

You can check the values ​​of hyperparameters after optimization in "Hyperparameters". You can check the result of residual analysis in "Residuals".

スクリーンショット 2021-01-13 16.22.29.png

You can check the prediction accuracy with "Prediction Error".

スクリーンショット 2021-01-13 16.23.18.png

You can check the cook's distance in "Cooks Distance". Cook's distance is a "measure of the distance between a coefficient calculated using the i-th observation and a coefficient calculated without the observation".

スクリーンショット 2021-01-13 16.24.05.png

In addition, you can check "Feature Importance: Variable importance", "Learing Curve: Learning curve", "Validation Curve: Change in prediction accuracy depending on the depth of the tree", etc. What you can analyze depends on the method you choose.

Finally, you can also study ensemble.

#Ensemble learning
lgbm = create_model('lightgbm')
xgboost = create_model('xgboost')

ensemble = blend_models([lgbm, xgboost])

スクリーンショット 2021-01-13 16.31.44.png

at the end

Thank you for reading to the end. I didn't know there was such a useful library. You can do a satisfactory analysis without DataRobot.

If you have a request for correction, we would appreciate it if you could contact us.

Recommended Posts

I tried using PyCaret
I tried using PyCaret
I tried using parameterized
I tried using mimesis
I tried using anytree
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried PyCaret2.0 (pycaret-nightly)
I tried using openpyxl
I tried using Ipython
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried using Jupyter
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
I tried using easydict (memo).
I tried face recognition using Face ++
I tried using Random Forest
I tried using BigQuery ML
I tried using Amazon Glacier
I tried using git inspector
[Python] I tried using OpenPose
I tried using magenta / TensorFlow
I tried using AWS Chalice
I tried using Slack emojinator
I tried using Rotrics Dex Arm # 2
I tried using Rotrics Dex Arm
I tried using GrabCut of OpenCV
I tried using Thonny (Python / IDE)
I tried reinforcement learning using PyBrain
I tried deep learning using Theano
Somehow I tried using jupyter notebook
[Kaggle] I tried undersampling using imbalanced-learn
I tried shooting Kamehameha using OpenPose
I tried using the checkio API
[Python] I tried using YOLO v3
I tried asynchronous processing using asyncio
I tried scraping
I tried AutoKeras
I tried papermill
I tried django-slack
I tried Django
I tried spleeter
I tried cgo
I tried using Amazon SQS with django-celery
I tried using Azure Speech to Text.
I tried using Twitter api and Line api
I tried playing a ○ ✕ game using TensorFlow
I tried using YOUTUBE Data API V3
I tried using Selenium with Headless chrome