1.First of all

As a tutorial on learning machine learning, I'll take a reminder of how I've done to predict the Titanic survivors, which is a must-have for everyone.

About the version used

Python　3.7.6
numpy　1.18.1
pandas　 1.0.1
matplotlib 3.1.3
seaborn 0.10.0

The data used was downloaded from here after registering with Kaggle. https://www.kaggle.com/c/titanic

2 About the program

Imported libraries, etc.


import pandas as pd
import numpy as np
df = pd.read_csv('train.csv') 

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Defined as data frame df. Now let's look at the first five lines of the read data.


df.head()

Survived is 1 for survival and 0 for death. Please see the official website for what other factors refer to.

histogram

Next, let's look at the histogram.


df.hist(figsize=(12,12))
plt.show()

You can see that most of them are in their 20s and 30s, and most of them have a Pclass of 3 (the cheapest).

Check the correlation coefficient


plt.figure(figsize = (15,15))
sns.heatmap(df.corr(),annot = True)

Indicators with a high correlation coefficient for Survived are 0.26: Fare and -0.34: Pclass. Fare: If the Pclass grade is decided, the fare will be decided naturally, but the two are different indicators. ..

Handling of missing values


df.isnull().sum()

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

Age: Age and Cabin: You can see that there are many amounts without the crew number. For the handling of missing values, refer to the following. https://qiita.com/0NE_shoT_/items/8db6d909e8b48adcb203

This time, we decided to substitute the median for age. Also, Embarked: The boarding place is substituting the most S. Other missing values have been removed.


from sklearn.model_selection import  train_test_split
#Missing value processing
df['Fare'] = df['Fare'].fillna(df['Fare'].median())
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna('S')

#Conversion of categorical variables
df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)
df['Embarked'] = df['Embarked'].map( {'S': 0 , 'C':1 , 'Q':2}).astype(int)

#Delete unnecessary columns
df = df.drop(['Cabin','Name','PassengerId','Ticket'],axis =1)

Classification of training data and test data


train_X = df.drop('Survived',axis = 1)
train_y = df.Survived
(train_X , test_X , train_y , test_y) = train_test_split(train_X, train_y , test_size = 0.3 , random_state = 0)

This time, the test data is 30% (test_size = 0.3).

Prediction by machine learning (decision tree)


from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', random_state = 0)
clf = clf.fit(train_X , train_y)
pred = clf.predict(test_X)

#Calculation of correct answer rate
from sklearn.metrics import (roc_curve , auc ,accuracy_score)
pred = clf.predict(test_X)
fpr, tpr, thresholds = roc_curve(test_y , pred,pos_label = 1)
auc(fpr,tpr)
accuracy_score(pred,test_y)

I referred to here for the handling of decision tree parameters. http://data-analysis-stats.jp/2019/01/14/%E6%B1%BA%E5%AE%9A%E6%9C%A8%E5%88%86%E6%9E%90%E3%81%AE%E3%83%91%E3%83%A9%E3%83%A1%E3%83%BC%E3%82%BF%E8%A7%A3%E8%AA%AC/

criterion='gini', 0.7798507462686567 criterion='entropy', 0.7910447761194029

The correct answer rate increased slightly for entropy. According to the reference url

As a point of proper use, it is said that the Gini coefficient is better at continuous data, and entropy is better at categorical data. The Gini coefficient minimizes misclassification, while entropy explores the reference value.

... apparently ... This time, there was a lot of category data such as gender and boarding place, so entropy may be appropriate.

This is also helpful because it has detailed information on the decision tree. https://qiita.com/3000manJPY/items/ef7495960f472ec14377

Prediction by machine learning (random forest)


from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 10,max_depth=5,random_state = 0)　#Create an instance of Random Forest.
clf = clf.fit(train_X , train_y)　#Train the model with the fit method using teacher labels and teacher data
pred = clf.predict(test_X)
fpr, tpr , thresholds = roc_curve(test_y,pred,pos_label = 1)
auc(fpr,tpr)
accuracy_score(pred,test_y)

0.8283582089552238

Random forests are available for classification (Classifier) and for regression analysis (Regressor). This time, the purpose is to classify whether it is alive or dead, so we will use it for classification.

I tried to move the learning parameters and arguments of the RandomForestClassifier class while looking at this. https://data-science.gr.jp/implementation/iml_sklearn_random_forest.html

As I increased n_estimators, the accuracy rate increased. It is said that n_estimators and epochs used in neural networks often end up in trouble even if they are calculated with large values. Please see below for the contents and countermeasures. https://amalog.hateblo.jp/entry/hyper-parameter-search

Output of prediction results

From the previous results, it was found that Random Forest has a higher accuracy rate than decision trees. Therefore, we decided to use Random Forest for prediction this time.


fin = pd.read_csv('test.csv')
fin.head()

passsengerid = fin['PassengerId']
fin.isnull().sum()
fin['Fare'] = fin['Fare'].fillna(fin['Fare'].median()) 
fin['Age'] = fin['Age'].fillna(fin['Age'].median())
fin['Embarked'] = fin['Embarked'].fillna('S')

#Conversion of categorical variables
fin['Sex'] = fin['Sex'].apply(lambda x: 1 if x == 'male' else 0)
fin['Embarked'] = fin['Embarked'].map( {'S': 0 , 'C':1 , 'Q':2}).astype(int)

#Delete unnecessary columns
fin= fin.drop(['Cabin','Name','Ticket','PassengerId'],axis =1)
#Predicted in Random Forest
predictions = clf.predict(fin)

submission = pd.DataFrame({'PassengerId':passsengerid, 'Survived':predictions})
submission.to_csv('submission.csv' , index = False)

It was 4900th / 16000th. I would like to devote myself more.

[PYTHON] (Kaggle) Predicted Titanic survivors using a model using decision trees and random forests