Introduction

I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.

This time, I will apply the Titanic survival prediction problem to PyCaret, submit the prediction result to Kaggle, and see the result.

** This is a follow-up article of I tried to classify wine quality with PyCaret published last time. ** **

1. Install PyCaret

Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)

pip install pycaret

2. Data acquisition

Train.csv and test.csv are available for download on Kaggle's Titanic site. https://www.kaggle.com/c/titanic/data

import pandas as pd
train_data = pd.read_csv("train.csv")
train_data.head()

Results

Let's take a look at the contents of the data using Pandas profile_report ().

import pandas_profiling
train_data.profile_report()

Results

3. Data preprocessing

Use setup () to preprocess the data. At that time, specify the objective variable as Survived as an argument.

from pycaret.classification import *
exp_titanic = setup(data = train_data, target = 'Survived')

Result (up to 10 items)

4. Model comparison

Use compare_models () to analyze the dataset using multiple classification models and summarize the results in a table. This is a very useful feature when considering which classification model to use.

There are more than 10 classification models provided by Pycaret, which can be confirmed at the link below.

https://pycaret.org/classification/

compare_models()

The accuracy of the catBoost Classifier was 83.63%. This time, we will talk about the performance evaluation of PyCaret with the 9th place Random Forest Classifier.

result

5. Generation of analytical model

Select a classification model and model it. Use create_model (). This time, we will use the Random Forest Classifier model.

dt = create_model('rf', round=2)

result

6. Tuning the analytical model

The model is also tuned using tune_model.

tuned_rf = tune_model('rf',round=2)

result

The average accuracy before tuning was 0.80, and the average accuracy after tuning was 0.81.

7. Visualization of analytical model

Visualize the analysis results using plot_model.

First, plot the AUC curve.

plot_model(tuned_rf, plot = 'auc')

result

!

Then plot the confusion matrix.

plot_model(tuned_lightgbm, plot = 'confusion_matrix')

result

!

8. Evaluation of analytical model

It is possible to perform multiple evaluations at the same time using evaluate_model ().

evaluate_model(tuned_rf)

If you press the button in the yellow frame, each evaluation result will be displayed.

result

!

9. Forecast

After finalizing the model with finalize_model (), make a prediction with predict_model (). At the time of prediction, test data (test.csv in this case) is used.

final_rf = finalize_model(tuned_rf)
data_unseen = pd.read_csv('test.csv')
result = predict_model(final_rf, data = data_unseen)

The Label column represents the result of the prediction.

result

I uploaded this result to Kaggle. The score was 0.76076.

10. Summary

We used the Titanic survival prediction dataset and analyzed it with PyCaret.
Very easy to use. I think that it has a high analysis function that is comparable to the commercial analysis tools Alteryx and DataRobot.

10.1 List of Pycaret functions used for analysis

Data preprocessing: setup ()
Compare models: compare_models ()
Generate analytical model: create_model ()
Tuning: tune_model ()
Visualization: plot_model ()
Evaluation: evaluate_model ()
Prediction: finalize_model (), predict_model ()

11. References

1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine with PyCaret. Https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e

[PYTHON] I tried to predict Titanic survival with PyCaret

Introduction

1. Install PyCaret

2. Data acquisition

3. Data preprocessing

4. Model comparison

5. Generation of analytical model

6. Tuning the analytical model

7. Visualization of analytical model

8. Evaluation of analytical model

9. Forecast

10. Summary

10.1 List of Pycaret functions used for analysis

11. References