[PYTHON] I tried using PyCaret at the fastest speed

Introduction

Immediately, I tried using the machine learning library PyCaret released the other day. I realized that anyone can easily model. It was really easy! You can tune and predict from pre-processing without writing 10 lines of code! There are many parts that I haven't figured out yet, such as arguments, but I decided to write the PyCaret article first. If you notice anything, please comment.

0. Environment and version

1. First from the installation

Execute the code below to install it. It was a sensation, but it took only a few minutes. When I installed it locally, I got an error, so I gave up once.

! pip install pycaret

2. Data acquisition

This time we will use the data of boston. You can get the data with the following code.

from pycaret.datasets import get_data
boston_data = get_data('boston')

3. Pretreatment

Perform preprocessing. Data and target variables are defined and initialized in setup (). Since we are solving a regression problem this time, we have specified pycaret.regression. For classification problems, specify pycaret.classification. You can also perform tasks such as natural language processing and clustering.

setup () handles missing values, encodes categorical data, train-test-split, etc. For more information, see here.

from pycaret.regression import *
exp1 = setup(boston_data, target = 'medv')

Run it to complete the setup. image.png image.png

4. Model comparison

Let's compare and select models. You can compare models in one line below. It took a few minutes. It is convenient to check the evaluation index in a list! By default, k-fold is divided into 10 parts. You can specify the number of folds and the index to sort with the argument. (Running is done by default.)

compare_models()

Click here for execution results image.png

5. Modeling

Select a model and model it. This time I'm using Random Forest. (I feel completely.) This function returns a table containing k-folded scores and trained model objects. You can also check the SD, which is very convenient!

rf = create_model('rf')

image.png

By specifying a period after the trained object, you can check as follows. image.png

6. Tuning

Tuning can also be done in one line.

tuned_rf = tune_model('rf')

image.png

You can get the parameters below.

tuned_rf.get_params

image.png

7. Model visualization

Let's visualize the accuracy of the model. The regression plot is shown below, but for classification problems, you can choose the output according to the metric. I regret that I should have selected the classification problem here because there are many variations of visualization of the classification problem. .. ..

plot_model(tuned_rf)

image.png

8. Interpretation of the model

The model is interpreted using SHAP. Check SHAP git for how to read the graph and how to interpret the model.

interpret_model(tuned_rf)

image.png

9. Forecast

The prediction for the test data is written as follows. The execution result returns the predicted result for 30% of the test data train-test-split by setup ().

rf_holdout_pred = predict_model(rf)

image.png

When making predictions for new data, pass the dataset as an argument to data.

predictions = predict_model(rf, data=boston_data)

The prediction result is added to the far right. image.png

Finally

Until the end Thank you for reading. If you have any questions, please leave a comment.

Reference site

Recommended Posts

I tried using PyCaret at the fastest speed
I tried using PyCaret
I tried using PyCaret
I tried using the checkio API
I tried using the BigQuery Storage API
I looked at the meta information of BigQuery & tried using it
I tried using parameterized
I tried using scrapy for the first time
I tried using mimesis
I tried using anytree
vprof --I tried using the profiler for Python
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using Pipenv
I tried using the Google Cloud Vision API
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried PyCaret2.0 (pycaret-nightly)
I tried using openpyxl
I tried using Ipython
[Mac] Build a Python 3.x environment at the fastest speed using Docker
I tried using cron
I tried using the Datetime module by Python
I tried using ngrok
I tried using face_recognition
I tried using Jupyter
I tried using Heapq
I tried using doctest
I tried using the image filter of OpenCV
I tried using folium
I tried using jinja2
I tried using the functional programming library toolz
I tried using folium
I tried using time-window
[Linux] I tried using the genetic statistics software PLINK
I tried clustering ECG data using the K-Shape method
I tried to approximate the sin function using chainer
I tried using the API of the salmon data project
[MNIST] I tried Fine Tuning using the ImageNet model.
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to compress the image using machine learning
I tried using easydict (memo).
I tried face recognition using Face ++
I tried using Random Forest
I tried using BigQuery ML
I tried using Amazon Glacier
I tried the changefinder library!
I tried using git inspector
[Python] I tried using OpenPose
I tried using magenta / TensorFlow
I tried using AWS Chalice
I tried using Slack emojinator
I tried using the Python library from Ruby with PyCall
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried to simulate ad optimization using the bandit algorithm.
I tried face recognition of the laughter problem using Keras.
I tried using the DS18B20 temperature sensor with Raspberry Pi