[PYTHON] Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1

1. What is Kaggle?

A site where you can test your skills by competing to solve various problems using data analysis. You will be able to study data analysis because you can get a dataset and see other people's explanations (kernel).

2. What is Titanic?

One of Kaggle's competitions. It is used by many beginners as a tutorial. Predict which passengers on the Titanic survived. The theme is to predict the survival of the other 418 passengers from 891 passenger data.

3. 3. What to do this time

We will consistently explain techniques for beginners up to a submission score of 0.83732 (equivalent to the top 1.5%) using Random Forest. This time, I will explain until the submitted score reaches 0.78468. Next time increased the score to 0.81339, and Next time corresponds to the top 1.5%. The submission score is 0.83732. All the code used is published on Github. The code used this time is titanic (0.83732) _1.

4. Code details

Import required libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

Read CSV and check the contents


#Read CSV
train= pd.read_csv("train.csv")
test= pd.read_csv("test.csv")

#Data integration
dataset = pd.concat([train, test], ignore_index = True)

#For submission
PassengerId = test['PassengerId']

#Check up to the third content of the train
train.head(3)

image.png

A brief description of each column is as follows.

· PassengerId – Passenger Identification Unique ID Survived – Survival flag (0 = death, 1 = survival) ・ Pclass – Ticket class · Name – Passenger's name ・ Sex – Gender (male = male, female = female) ・ Age – Age · SibSp – Number of siblings / spouses on board the Titanic · Parch – Number of parents / children on the Titanic ・ Ticket – Ticket number ・ Fare – Fare ・ Cabin – Room number ・ Embarked – Departure point (port on Titanic)

I will also give a brief description of each variable. pclass = ticket class 1 = Upper class (rich) 2 = Intermediate class (general class) 3 = Lower class (working class)

Embarked = Definition of each variable is as follows C = Cherbourg Q = Queenstown S = Southampton NaN represents a data loss. (In the table above, you can see two NaNs in cabin.) Let's check the total number of missing data.


#Check the total number of missing data
dataset_null = dataset.fillna(np.nan)
dataset_null.isnull().sum()

Age 263 Cabin 1014 Embarked 2 Fare 1 Name 0 Parch 0 PassengerId 0 Pclass 0 Sex 0 SibSp 0 Survived 418 Ticket 0 dtype: int64

With Cabin, you can see that there are as many as 1014 missing data. Next, let's check the overall statistical data.


#Check statistical data
dataset.describe()

2020-01-08.png First, check the accuracy by substituting the median value etc. for the missing data.


#Cabin is temporarily excluded
del dataset["Cabin"]

# Age(age)And Fare(Fee)Is the median of each, Embarked(Departure point)Is S(Southampton)Substitute
dataset["Age"].fillna(dataset.Age.mean(), inplace=True) 
dataset["Fare"].fillna(dataset.Fare.mean(), inplace=True) 
dataset["Embarked"].fillna("S", inplace=True)

#Check the total number of missing data
dataset_null = dataset.fillna(np.nan)
dataset_null.isnull().sum()

Age 0 Embarked 0 Fare 0 Name 0 Parch 0 PassengerId 0 Pclass 0 Sex 0 SibSp 0 Survived 418 Ticket 0 dtype: int64

Now there are no missing data. Survived's 418 matches 418 in the test data, so it shouldn't be a problem. Organize your data for forecasting. First, use P class, Sex, Age, Fare, Embarked. It also converts it to a dummy variable so that the machine can predict it. (Currently, there are two sex items, male and female, but by doing this, they are converted to sex_male and sex_female. If it is male, sex_male is assigned to 1, and if it is different, 0 is assigned.)

#Extract only variables to use
dataset1 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked']]
#Create a dummy variable
dataset_dummies=pd.get_dummies(dataset1)
dataset_dummies.head(3)

2020-01-08 (1).png

Let the machine learn. Create the best predictive model by changing the n_estimators and max_depth of the RandomForestClassifier.

#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]

#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data

#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
              'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_, sep="\n")

{'classify__max_depth': 8, 'classify__n_estimators': 23} 0.8316498316498316 When max_depth is 8 and n_estimators is 23, it turns out to be the best model with a prediction accuracy of 83% for training data. Predict test data with this model and create a submission file (submission1.csv).

#Prediction of test data
pred = grid.predict(test_set)

#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission1.csv", index=False)

When I actually submitted it, the score was 0.78468. Suddenly a high prediction came out.

This time, Parch (number of parents / children on board) and SibSp (number of siblings / spouses on board) are added for prediction.

#Extract variables to use
dataset2 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked', 'Parch', 'SibSp']]

#Create a dummy variable
dataset_dummies = pd.get_dummies(dataset2)
dataset_dummies.head(3)

2020-01-08 (3).png

#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]

#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data

#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
              'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_, sep="\n")

#Prediction of test data
pred = grid.predict(test_set)

#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission2.csv", index=False)

{'classify__max_depth': 7, 'classify__n_estimators': 25} 0.8417508417508418 When max_depth is 7 and n_estimators is 25, it turns out to be the best model with a prediction accuracy of 84% for training data. Although it is more accurate than before, when I submitted the test data prediction (submission2.csv) for this model, the score dropped to 0.76076. It seems that it has caused overfitting. It seems better not to use Parch (number of parents / children on board) and SibSp (number of siblings / spouses on board).

5. Summary

I made a prediction for Kaggle's tutorial competition Titanic. The highest submitted score was 0.78468. Next time will visualize the data and explain the process to the submission score 0.83732.

Recommended Posts

Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
Kaggle Tutorial Titanic know-how to be in the top 2%
Challenges for the Titanic Competition for Kaggle Beginners
It's okay to stumble on Titanic! Introducing the Kaggle strategy for super beginners
Examine the parameters of RandomForestClassifier in the Kaggle / Titanic tutorial
[For Kaggle beginners] Titanic (LightGBM)
[For beginners] I want to explain the number of learning times in an easy-to-understand manner.
Kaggle Tutorial Titanic know-how to be in the top 2%
Challenge Kaggle Titanic
Try Kaggle's Titanic tutorial
Examine the parameters of RandomForestClassifier in the Kaggle / Titanic tutorial
Take a closer look at the Kaggle / Titanic tutorial
[For Kaggle beginners] Titanic (LightGBM)
[Kaggle] I made a collection of questions using the Titanic tutorial
Select models with Kaggle's Titanic (kaggle ④)
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
[Kaggle for super beginners] Titanic (Logistic regression)
Switch the module to be loaded for each execution environment in Python
The fastest way for beginners to master Python
I tried to predict the horses that will be in the top 3 with LightGBM
Try to calculate RPN in Python (for beginners)
[For beginners] Introduction to vectorization in machine learning
Basic story of inheritance in Python (for beginners)
Day 66 [Introduction to Kaggle] The easiest Titanic forecast
How to limit the API to be published in the C language shared library of Linux
Find a guideline for the number of processes / threads to set in the application server
Output the specified table of Oracle database in Python to Excel for each file
python beginners tried to predict the number of criminals
[For beginners] How to use say command in python!
How to get the number of digits in Python
I tried the MNIST tutorial for beginners of tensorflow.
[For beginners] Install the package in the Anaconda environment (Janome)
Check for the existence of BigQuery tables in Java
[For beginners] Quantify the similarity of sentences with TF-IDF
To do the equivalent of Ruby's ObjectSpace._id2ref in Python
Everything for beginners to be able to do machine learning
[For beginners of competitive pros] Three input methods to remember when starting competitive programming in Python
What seems to be a template of the standard input part of the competition pro in python3
How to find the optimal number of clusters in k-means
Test code to check for broken links in the page
Check the operation of Python for .NET in each environment
Processing of python3 that seems to be usable in paiza
[For beginners] Summary of standard input in Python (with explanation)
■ Kaggle Practice for Beginners --Introduction of Python --by Google Colaboratory
Summary of stumbling blocks in Django for the first time
[Explanation for beginners] Introduction to convolution processing (explained in TensorFlow)
[Explanation for beginners] Introduction to pooling processing (explained in TensorFlow)
Get the number of occurrences for each element in the list
Tips for Python beginners to use the Scikit-image example for themselves
[Python] The biggest weakness / disadvantage of Google Colaboratory [For beginners]
What beginners learned from the basics of variables in python
Google search for the last line of the file in Python
Check the Check button in Tkinter to allow Entry to be edited
How to get rid of the "Tags must be an array of hashes." Error in the qiita api
[For IT beginners] What to do when the rev command cannot be used with Git Bash
The story of returning to the front line for the first time in 5 years and refactoring Python Django
Kaggle for the first time (kaggle ①)
[For beginners] kaggle exercise (merucari)
Overview of Docker (for beginners)
~ Tips for beginners to Python ③ ~
[For beginners] How to implement O'reilly sample code in Google Colab
How to handle multiple versions of CUDA in the same environment
How to change the log level of Azure SDK for Python
Wrap (part of) the AtCoder Library in Cython for use in Python
How to implement Java code in the background of RedHat (LinuxONE)
Become familiar with (want to be) around the pipeline of spaCy
How to know the internal structure of an object in Python
Django cannot be installed in the development environment of pipenv + pyenv