[PYTHON] I tried to predict Titanic survival with PyCaret


I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.

This time, I will apply the Titanic survival prediction problem to PyCaret, submit the prediction result to Kaggle, and see the result.

** This is a follow-up article of I tried to classify wine quality with PyCaret published last time. ** **

1. Install PyCaret

Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)

pip install pycaret

2. Data acquisition

Train.csv and test.csv are available for download on Kaggle's Titanic site. https://www.kaggle.com/c/titanic/data

import pandas as pd
train_data = pd.read_csv("train.csv")

Results image.png

Let's take a look at the contents of the data using Pandas profile_report ().

import pandas_profiling

Results image.png

3. Data preprocessing

Use setup () to preprocess the data. At that time, specify the objective variable as Survived as an argument.

from pycaret.classification import *
exp_titanic = setup(data = train_data, target = 'Survived')

Result (up to 10 items)


4. Model comparison

Use compare_models () to analyze the dataset using multiple classification models and summarize the results in a table. This is a very useful feature when considering which classification model to use.

There are more than 10 classification models provided by Pycaret, which can be confirmed at the link below.



The accuracy of the catBoost Classifier was 83.63%. This time, we will talk about the performance evaluation of PyCaret with the 9th place Random Forest Classifier.



5. Generation of analytical model

Select a classification model and model it. Use create_model (). This time, we will use the Random Forest Classifier model.

dt = create_model('rf', round=2)



6. Tuning the analytical model

The model is also tuned using tune_model.

tuned_rf = tune_model('rf',round=2)


The average accuracy before tuning was 0.80, and the average accuracy after tuning was 0.81.


7. Visualization of analytical model

Visualize the analysis results using plot_model.

First, plot the AUC curve.

plot_model(tuned_rf, plot = 'auc')



Then plot the confusion matrix.

plot_model(tuned_lightgbm, plot = 'confusion_matrix')



8. Evaluation of analytical model

It is possible to perform multiple evaluations at the same time using evaluate_model ().


If you press the button in the yellow frame, each evaluation result will be displayed.



9. Forecast

After finalizing the model with finalize_model (), make a prediction with predict_model (). At the time of prediction, test data (test.csv in this case) is used.

final_rf = finalize_model(tuned_rf)
data_unseen = pd.read_csv('test.csv')
result = predict_model(final_rf, data = data_unseen)

The Label column represents the result of the prediction.



I uploaded this result to Kaggle. The score was 0.76076. image.png

10. Summary

  1. We used the Titanic survival prediction dataset and analyzed it with PyCaret.
  2. Very easy to use. I think that it has a high analysis function that is comparable to the commercial analysis tools Alteryx and DataRobot.

10.1 List of Pycaret functions used for analysis

  1. Data preprocessing: setup ()
  2. Compare models: compare_models ()
  3. Generate analytical model: create_model ()
  4. Tuning: tune_model ()
  5. Visualization: plot_model ()
  6. Evaluation: evaluate_model ()
  7. Prediction: finalize_model (), predict_model ()

11. References

1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine with PyCaret. Https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e

Recommended Posts

I tried to predict Titanic survival with PyCaret
I tried to predict and submit Titanic survivors with Kaggle
I tried to predict Boston real estate prices with PyCaret
I tried to predict next year with AI
I tried clustering with PyCaret
I tried to implement Autoencoder with TensorFlow
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried factor analysis with Titanic data!
I tried learning with Kaggle's Titanic (kaggle②)
I tried to implement CVAE with PyTorch
I tried to solve TSP with QAOA
I tried to predict Covid-19 using Darts
I tried to detect Mario with pytorch + yolov3
I tried to implement reading Dataset with PyTorch
I tried to use lightGBM, xgboost with Boruta
I tried to learn logical operations with TF Learn
I tried to move GAN (mnist) with keras
I tried to save the data with discord
I tried to detect motion quickly with OpenCV
I tried to integrate with Keras in TFv1.1
I tried principal component analysis with Titanic data!
I tried to output LLVM IR with Python
I tried PyCaret2.0 (pycaret-nightly)
I tried to debug.
I tried using PyCaret
I tried to detect an object with M2Det!
I tried to automate sushi making with python
I tried to paste
I tried using PyCaret
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to operate Linux with Discord Bot
I tried to study DP with Fibonacci sequence
I tried to start Jupyter with Amazon lightsail
I tried to judge Tsundere with Naive Bayes
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to predict the price of ETF
I tried to predict the horses that will be in the top 3 with LightGBM
I tried to learn the sin function with chainer
I tried to create a table only with Django
I tried to extract features with SIFT of OpenCV
I tried to move Faster R-CNN quickly with pytorch
I tried to read and save automatically with VOICEROID2 2
I tried to implement and learn DCGAN with PyTorch
I tried to implement Minesweeper on terminal with python
I tried to get started with blender python script_Part 01
I tried to touch the CSV file with Python
I tried to predict the J-League match (data analysis)
I tried to solve the soma cube with python
I tried to automatically read and save with VOICEROID2
I tried to get started with blender python script_Part 02
I tried to generate ObjectId (primary key) with pymongo
I tried to implement an artificial perceptron with python
I tried to build ML Pipeline with Cloud Composer
I tried to implement time series prediction with GBDT
I tried to uncover our darkness with Chatwork API
I tried to automatically generate a password with Python3
[Introduction to Pytorch] I tried categorizing Cifar10 with VGG16 ♬
I tried to solve the problem with Python Vol.1
I tried to analyze J League data with Python
I tried to implement Grad-CAM with keras and tensorflow