[PYTHON] I tried to predict Titanic survival with PyCaret

Introduction

I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.

This time, I will apply the Titanic survival prediction problem to PyCaret, submit the prediction result to Kaggle, and see the result.

** This is a follow-up article of I tried to classify wine quality with PyCaret published last time. ** **

1. Install PyCaret

Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)

pip install pycaret

2. Data acquisition

Train.csv and test.csv are available for download on Kaggle's Titanic site. https://www.kaggle.com/c/titanic/data

import pandas as pd
train_data = pd.read_csv("train.csv")
train_data.head()

Results image.png

Let's take a look at the contents of the data using Pandas profile_report ().

import pandas_profiling
train_data.profile_report()

Results image.png

3. Data preprocessing

Use setup () to preprocess the data. At that time, specify the objective variable as Survived as an argument.

from pycaret.classification import *
exp_titanic = setup(data = train_data, target = 'Survived')

Result (up to 10 items)

image.png

4. Model comparison

Use compare_models () to analyze the dataset using multiple classification models and summarize the results in a table. This is a very useful feature when considering which classification model to use.

There are more than 10 classification models provided by Pycaret, which can be confirmed at the link below.

https://pycaret.org/classification/

compare_models()

The accuracy of the catBoost Classifier was 83.63%. This time, we will talk about the performance evaluation of PyCaret with the 9th place Random Forest Classifier.

result

image.png

5. Generation of analytical model

Select a classification model and model it. Use create_model (). This time, we will use the Random Forest Classifier model.

dt = create_model('rf', round=2)

result

image.png

6. Tuning the analytical model

The model is also tuned using tune_model.

tuned_rf = tune_model('rf',round=2)

result

The average accuracy before tuning was 0.80, and the average accuracy after tuning was 0.81.

image.png

7. Visualization of analytical model

Visualize the analysis results using plot_model.

First, plot the AUC curve.

plot_model(tuned_rf, plot = 'auc')

result

!image.png

Then plot the confusion matrix.

plot_model(tuned_lightgbm, plot = 'confusion_matrix')

result

!image.png

8. Evaluation of analytical model

It is possible to perform multiple evaluations at the same time using evaluate_model ().

evaluate_model(tuned_rf)

If you press the button in the yellow frame, each evaluation result will be displayed.

result

!image.png

9. Forecast

After finalizing the model with finalize_model (), make a prediction with predict_model (). At the time of prediction, test data (test.csv in this case) is used.

final_rf = finalize_model(tuned_rf)
data_unseen = pd.read_csv('test.csv')
result = predict_model(final_rf, data = data_unseen)

The Label column represents the result of the prediction.

result

!image.png

I uploaded this result to Kaggle. The score was 0.76076. image.png

10. Summary

  1. We used the Titanic survival prediction dataset and analyzed it with PyCaret.
  2. Very easy to use. I think that it has a high analysis function that is comparable to the commercial analysis tools Alteryx and DataRobot.

10.1 List of Pycaret functions used for analysis

  1. Data preprocessing: setup ()
  2. Compare models: compare_models ()
  3. Generate analytical model: create_model ()
  4. Tuning: tune_model ()
  5. Visualization: plot_model ()
  6. Evaluation: evaluate_model ()
  7. Prediction: finalize_model (), predict_model ()

11. References

1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine with PyCaret. Https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e

Recommended Posts

I tried to predict Titanic survival with PyCaret
I tried to predict and submit Titanic survivors with Kaggle
I tried to predict Boston real estate prices with PyCaret
I tried to predict next year with AI
I tried clustering with PyCaret
I tried to implement Autoencoder with TensorFlow
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried factor analysis with Titanic data!
I tried learning with Kaggle's Titanic (kaggle②)
I tried to implement CVAE with PyTorch
I tried to solve TSP with QAOA
I tried to predict Covid-19 using Darts
I tried to detect Mario with pytorch + yolov3
I tried to implement reading Dataset with PyTorch
I tried to use lightGBM, xgboost with Boruta
I tried to learn logical operations with TF Learn
I tried to move GAN (mnist) with keras
I tried to save the data with discord
I tried to detect motion quickly with OpenCV
I tried to integrate with Keras in TFv1.1
I tried principal component analysis with Titanic data!
I tried to output LLVM IR with Python
I tried PyCaret2.0 (pycaret-nightly)
I tried to debug.
I tried using PyCaret
I tried to detect an object with M2Det!
I tried to automate sushi making with python
I tried to paste
I tried using PyCaret
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to operate Linux with Discord Bot
I tried to study DP with Fibonacci sequence
I tried to start Jupyter with Amazon lightsail
I tried to judge Tsundere with Naive Bayes
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to predict the price of ETF
I tried to predict the horses that will be in the top 3 with LightGBM
I tried to learn the sin function with chainer
I tried to create a table only with Django
I tried to extract features with SIFT of OpenCV
I tried to move Faster R-CNN quickly with pytorch
I tried to read and save automatically with VOICEROID2 2
I tried to implement and learn DCGAN with PyTorch
I tried to implement Minesweeper on terminal with python
I tried to get started with blender python script_Part 01
I tried to touch the CSV file with Python
I tried to predict the J-League match (data analysis)
I tried to solve the soma cube with python
I tried to automatically read and save with VOICEROID2
I tried to get started with blender python script_Part 02
I tried to generate ObjectId (primary key) with pymongo
I tried to implement an artificial perceptron with python
I tried to build ML Pipeline with Cloud Composer
I tried to implement time series prediction with GBDT
I tried to uncover our darkness with Chatwork API
I tried to automatically generate a password with Python3
[Introduction to Pytorch] I tried categorizing Cifar10 with VGG16 ♬
I tried to solve the problem with Python Vol.1
I tried to analyze J League data with Python
I tried to implement Grad-CAM with keras and tensorflow