[PYTHON] Challenges for the Titanic Competition for Kaggle Beginners

The other day, we held an in-house training on Kaggle's Titanic competition. We will share the explanatory materials and the exercises that the participants did. The materials and assignments are in Kaggle's Notebook, so if you are interested, please check that as well.

By the way, this is Qiita's first post.

Introduction

Why I chose Kaggle

Since it is not possible to teach (learn) everything by training alone, I thought that it was necessary for each individual to continue working on it. When I tried to learn something, I sometimes stumbled upon building an environment, so I decided to use Kaggle, which makes it unnecessary.

Assumed participants

--I'm interested in machine learning --I have never used Kaggle --Inexperienced in Python

About training

Training goals

--Experience the flow of machine learning ――Make people feel that they can write a program by themselves

Explanatory materials, exercises

Since I'm using Kaggle, I've also created an explanatory material as a Kaggle Notebook.

--Explanatory material: Let's try the Kaggle tutorial "Titanic Survivor Prediction"! https://www.kaggle.com/plasticgrammer/kaggle-titanic

--Practice: Titanic: Predict survivors (ΦωΦ) https://www.kaggle.com/plasticgrammer/titanic-predict-survivors

How to proceed

I wanted to combine explanations and exercises in a well-balanced manner, so I proceeded with the following flow.

  1. Explain data analysis using materials --Python basics --How to use Kaggle, explanation of terms --Check the flow of machine learning (data reading, data analysis)

  2. Data analysis exercises

  3. Explain up to the forecast using materials

  1. Challenges for improving prediction accuracy

Prepared exercises

The following content is also described in the exercise notebook, but I will also describe it in this article for the time being.

Data analysis

step1) Check the outline of the data

--Check the number of rows and columns of training data and test data --Let's display the first 5 training data --Let's display the first 5 test data ――What is the difference between training data and test data? What exactly does machine learning predict survivors?

step2) Check the details of the data

--Let's display the training data information with the info method --Let's check the missing value status of training data --Let's check the missing value status of the test data --Let's check the number of cases for each value of the target variable Survived --Let's check what value is set for the variable Pclass --Let's check the distribution of variable Age with a histogram --Let's check the maximum value, average value, and median value of the variable Age. --Let's check the distribution of variable Sex with value_counts + bar graph --Using pd.crosstab, let's check the number of variables Sex in [For each Survived].

Step3) Visualize whether there is a correlation

-Let's check the number of variables Sex in [For each Survived] with a bar graph. Is there a correlation? If so, what are the trends? -Let's check the number of variables Pclass in [For each Survived] with a bar graph. Is there a correlation? If so, what are the trends?

Feature creation and prediction

Assumption) Age (0 missing value filled), flow to predict with Random Forest using Sex has been created

--Let's fill the missing value of Age with the median --Let's use Fare for prediction --Let's use Embarked for prediction --Let's add SibSp + Parch + 1 as FamilySize --Let's add FamilySize <= 1 as IsAlone --Let's add the first character of Cabin as a feature

Reflection on looking back

This training took 5 hours. It took more time than I expected to proceed with the last task to improve the prediction accuracy. As a result, I got the impression that it was difficult. At a later date, it was conducted again in the form of additional training, but I felt that it would be better to proceed one by one with a lot of exercises.

in conclusion

There are many articles written about the Titanic competition, and I have referred to them in various ways. When I tried to make it a training task for Python beginners, I often compiled it as a material, so I shared it with you if it helps.

Recommended Posts

Challenges for the Titanic Competition for Kaggle Beginners
[For Kaggle beginners] Titanic (LightGBM)
[Kaggle for super beginners] Titanic (Logistic regression)
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
Kaggle for the first time (kaggle ①)
[For beginners] kaggle exercise (merucari)
It's okay to stumble on Titanic! Introducing the Kaggle strategy for super beginners
[Kaggle] Participation in the Melanoma Competition
PySpark learning record ② Kaggle I tried the Titanic competition with PySpark binding
Check the correlation with Kaggle's Titanic (kaggle③)
The fastest way for beginners to master Python
Roadmap for beginners
Challenge Kaggle Titanic
Conducting the TensorFlow MNIST For ML Beginners Tutorial
Day 66 [Introduction to Kaggle] The easiest Titanic forecast
I tried the MNIST tutorial for beginners of tensorflow.
Kaggle competition process from the perspective of score transitions
[For beginners] Install the package in the Anaconda environment (Janome)
Examine the parameters of RandomForestClassifier in the Kaggle / Titanic tutorial
[For beginners] Quantify the similarity of sentences with TF-IDF
Kaggle Tutorial Titanic know-how to be in the top 2%
Take a closer look at the Kaggle / Titanic tutorial
Spacemacs settings (for beginners)
python textbook for beginners
Dijkstra algorithm for beginners
OpenCV for Python beginners
■ Kaggle Practice for Beginners --Introduction of Python --by Google Colaboratory
Tips for Python beginners to use the Scikit-image example for themselves
[Python] The biggest weakness / disadvantage of Google Colaboratory [For beginners]
[For beginners] I tried using the Tensorflow Object Detection API