I've implemented regression model techniques in the past, but I thought it would be nice to be able to easily compare multiple models. There is a tool called DataRobot in the world, but I was wondering if there was a similar library in Python because it was expensive and I could not buy it. I found a library that can do preprocessing, visualization, and model development with just a few lines of code.
You can easily install the library below.
pip install pycaret
Load the required libraries and data and launch PyCaret. This time, we will use the price data of Boston houses published in UCI Machine Learning Repository, but PyCaret ’s Data Repository provides about 50 types of data sets.
The arguments of PyCaret are as follows.
#Import required libraries
import pandas as pd
from pycaret.regression import *
#Data set reading
from pycaret.datasets import get_data
boston_data = get_data('boston')
#Launch PyCaret
exp1 = setup(boston_data, target = 'medv', ignore_features = None, session_id=1498)
When started, the prediction result of the data type for the input data is displayed. You need to check the contents yourself. If there is no problem, press "Enter" with the cursor in the white frame below.
Setup is complete. You can fix the random seed with session_id.
Next, build the model.
#Model building
compare_models()
It automatically creates major regression models such as "Cat Boost Regressor", "Gradient Boosting Regressor", and "Ridge Regression", and automatically calculates evaluation indicators.
However, at present, tuning of hyperparameters has not been implemented.
Next, when building the model, cross-validation is performed 10 times by default, so let's check the result. This time, select the cat boost with the best accuracy.
#Check the catboost model
catboost = create_model('catboost')
Next, tune the hyperparameters. The parameter tuning method is random grid search.
#Optimized catboost model
catboost_tuned = tune_model(catboost, optimize = 'MAE')
Next, check the analysis result.
#Confirmation of analysis results
evaluate_model(catboost_tuned)
You can check the values of hyperparameters after optimization in "Hyperparameters". You can check the result of residual analysis in "Residuals".
You can check the prediction accuracy with "Prediction Error".
You can check the cook's distance in "Cooks Distance". Cook's distance is a "measure of the distance between a coefficient calculated using the i-th observation and a coefficient calculated without the observation".
In addition, you can check "Feature Importance: Variable importance", "Learing Curve: Learning curve", "Validation Curve: Change in prediction accuracy depending on the depth of the tree", etc. What you can analyze depends on the method you choose.
Finally, you can also study ensemble.
#Ensemble learning
lgbm = create_model('lightgbm')
xgboost = create_model('xgboost')
ensemble = blend_models([lgbm, xgboost])
Thank you for reading to the end. I didn't know there was such a useful library. You can do a satisfactory analysis without DataRobot.
If you have a request for correction, we would appreciate it if you could contact us.
Recommended Posts