[PYTHON] Challenges for the Titanic Competition for Kaggle Beginners

The other day, we held an in-house training on Kaggle's Titanic competition. We will share the explanatory materials and the exercises that the participants did. The materials and assignments are in Kaggle's Notebook, so if you are interested, please check that as well.

By the way, this is Qiita's first post.

Introduction

Why I chose Kaggle

Since it is not possible to teach (learn) everything by training alone, I thought that it was necessary for each individual to continue working on it. When I tried to learn something, I sometimes stumbled upon building an environment, so I decided to use Kaggle, which makes it unnecessary.

Assumed participants

--I'm interested in machine learning --I have never used Kaggle --Inexperienced in Python

About training

Training goals

--Experience the flow of machine learning ――Make people feel that they can write a program by themselves

Explanatory materials, exercises

Since I'm using Kaggle, I've also created an explanatory material as a Kaggle Notebook.

--Explanatory material: Let's try the Kaggle tutorial "Titanic Survivor Prediction"! https://www.kaggle.com/plasticgrammer/kaggle-titanic

--Practice: Titanic: Predict survivors (ΦωΦ) https://www.kaggle.com/plasticgrammer/titanic-predict-survivors

How to proceed

I wanted to combine explanations and exercises in a well-balanced manner, so I proceeded with the following flow.

Explain data analysis using materials --Python basics --How to use Kaggle, explanation of terms --Check the flow of machine learning (data reading, data analysis)
Data analysis exercises
Explain up to the forecast using materials

Preprocessing --Modeling, learning, forecasting

Challenges for improving prediction accuracy

Prepared exercises

The following content is also described in the exercise notebook, but I will also describe it in this article for the time being.

Data analysis

step1) Check the outline of the data

--Check the number of rows and columns of training data and test data --Let's display the first 5 training data --Let's display the first 5 test data ――What is the difference between training data and test data? What exactly does machine learning predict survivors?

step2) Check the details of the data

--Let's display the training data information with the info method --Let's check the missing value status of training data --Let's check the missing value status of the test data --Let's check the number of cases for each value of the target variable Survived --Let's check what value is set for the variable Pclass --Let's check the distribution of variable Age with a histogram --Let's check the maximum value, average value, and median value of the variable Age. --Let's check the distribution of variable Sex with value_counts + bar graph --Using pd.crosstab, let's check the number of variables Sex in [For each Survived].

Step3) Visualize whether there is a correlation

-Let's check the number of variables Sex in [For each Survived] with a bar graph. Is there a correlation? If so, what are the trends? -Let's check the number of variables Pclass in [For each Survived] with a bar graph. Is there a correlation? If so, what are the trends?

Feature creation and prediction

Assumption) Age (0 missing value filled), flow to predict with Random Forest using Sex has been created

--Let's fill the missing value of Age with the median --Let's use Fare for prediction --Let's use Embarked for prediction --Let's add SibSp + Parch + 1 as FamilySize --Let's add FamilySize <= 1 as IsAlone --Let's add the first character of Cabin as a feature

Reflection on looking back

This training took 5 hours. It took more time than I expected to proceed with the last task to improve the prediction accuracy. As a result, I got the impression that it was difficult. At a later date, it was conducted again in the form of additional training, but I felt that it would be better to proceed one by one with a lot of exercises.

in conclusion

There are many articles written about the Titanic competition, and I have referred to them in various ways. When I tried to make it a training task for Python beginners, I often compiled it as a material, so I shared it with you if it helps.