[PYTHON] Survivor prediction using kaggle's titanic neural network [80.8%]

Last time, I used a decision tree-based xgboost to predict the survival rate. Last time: Survival prediction using kaggle's titanic xg boost [80.1%]

This time, I will try to predict the survival of Titanic using ** Neural Network **, which is often used in kaggle.

1. Acquisition of data and confirmation of missing values

import pandas as pd
import numpy as np

train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')
#Combine train data and test data into one
data = pd.concat([train,test]).reset_index(drop=True)
#Check the number of rows that contain missing values
train.isnull().sum()
test.isnull().sum()

The number of each missing value is as follows.

train data test data
PassengerId 0 0
Survived 0
Pclass 0 0
Name 0 0
Sex 0 0
Age 177 86
SibSp 0 0
Parch 0 0
Ticket 0 0
Fare 0 1
Cabin 687 327
Embarked 2 0

2. Complementing missing values and creating features

2.1 Complement of Fare

The missing row had a ** Pclass of 3 ** and ** Embarked was S **. スクリーンショット 2020-10-01 12.01.00.png Complement with the median ** among those who meet these two conditions.

data['Fare'] = data['Fare'].fillna(data.query('Pclass==3 & Embarked=="S"')['Fare'].median())

2.2 Life-and-death difference between groups Creating'Family_survival'

Titanic [0.82]-[0.83] Created the feature'Family_survival' introduced in this code. I will.

** Family and friends are more likely to be acting together on board **, so it can be said that whether or not they survived ** tends to have the same result within the group **.

Therefore, grouping is performed by the surname of the first name and the ticket number, and the value is determined by whether or not the members of the group are alive.

** Creating this feature has improved the prediction accuracy rate by about 2% **, so this grouping is quite effective.

#Get the surname of the name'Last_name'Put in
data['Last_name'] = data['Name'].apply(lambda x: x.split(",")[0])

data['Family_survival'] = 0.5 #Default value
#Last_Grouping by name and Fare
for grp, grp_df in data.groupby(['Last_name', 'Fare']):
                               
    if (len(grp_df) != 1):
        #(Same surname)And(Same Fare)When there are two or more people
        for index, row in grp_df.iterrows():
            smax = grp_df.drop(index)['Survived'].max()
            smin = grp_df.drop(index)['Survived'].min()
            passID = row['PassengerId']
            
            if (smax == 1.0):
                data.loc[data['PassengerId'] == passID, 'Family_survival'] = 1
            elif (smin == 0.0):
                data.loc[data['PassengerId'] == passID, 'Family_survival'] = 0
            #About members other than yourself in the group
            #Even one person is alive → 1
            #No survivors(Including NaN) → 0
            #All NaN → 0.5

#Grouping by ticket number
for grp, grp_df in data.groupby('Ticket'):
    if (len(grp_df) != 1):
        #When there are two or more people with the same ticket number
        #If there is even one survivor in the group'Family_survival'To 1
        for ind, row in grp_df.iterrows():
            if (row['Family_survival'] == 0) | (row['Family_survival']== 0.5):
                smax = grp_df.drop(ind)['Survived'].max()
                smin = grp_df.drop(ind)['Survived'].min()
                passID = row['PassengerId']
                if (smax == 1.0):
                    data.loc[data['PassengerId'] == passID, 'Family_survival'] = 1
                elif (smin == 0.0):
                    data.loc[data['PassengerId'] == passID, 'Family_survival'] = 0

2.3 Creation and classification of feature quantity'Family_size' representing the number of family members

Using the values of SibSp and Parch, we will create a feature quantity'Family_size'that indicates how many families boarded the Titanic, and classify them according to the number of people.

#Family_Creating size
data['Family_size'] = data['SibSp']+data['Parch']+1
#1, 2~4, 5~Divide into three
data['Family_size_bin'] = 0
data.loc[(data['Family_size']>=2) & (data['Family_size']<=4),'Family_size_bin'] = 1
data.loc[(data['Family_size']>=5) & (data['Family_size']<=7),'Family_size_bin'] = 2
data.loc[(data['Family_size']>=8),'Family_size_bin'] = 3

2.4 Creating the title title'Title'

Get titles such as'Mr','Miss' from the Name column. Incorporate a few titles ('Mme','Mlle', etc.) into titles that have the same meaning.

#Get the title of the name'Title'Put in
data['Title'] = data['Name'].map(lambda x: x.split(', ')[1].split('. ')[0])
#Integrate a few titles
data['Title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
data['Title'].replace(['Don', 'Sir',  'the Countess', 'Lady', 'Dona'], 'Royalty', inplace=True)
data['Title'].replace(['Mme', 'Ms'], 'Mrs', inplace=True)
data['Title'].replace(['Mlle'], 'Miss', inplace=True)
data['Title'].replace(['Jonkheer'], 'Master', inplace=True)

2.5 Age complementation and classification

Use ** the average age obtained for each title of the name ** to complete the missing value of Age. Then, it is divided into three categories: ** children (0-18), adults (18-60), and elderly people (60-) **.

#Complement the missing value of Age with the average value for each title
title_list = data['Title'].unique().tolist()
for t in title_list:
    index = data[data['Title']==t].index.values.tolist()
    age = data.iloc[index]['Age'].mean()
    age = np.round(age,1)
    data.iloc[index,5] = data.iloc[index,5].fillna(age)

#Classification by age
data['Age_bin'] = 0
data.loc[(data['Age']>18) & (data['Age']<=60),'Age_bin'] = 1
data.loc[(data['Age']>60),'Age_bin'] = 2

2.6 Standardization of Fare & dummy variable of features

Since the Fare value has a large variable scale difference, ** standardize ** (mean value is 0, standard deviation is 1) so that the neural network can be easily learned.

Then, the string that is a character string is made into a dummy variable with get_dummies. ** Pclass is a number **, but ** the size of the value itself has no meaning **, so let's convert it to a dummy variable as well.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
#A standardized version of Fare'Fare_std'Put in
data['Fare_std'] = sc.fit_transform(data[['Fare']])
#Convert to dummy variable
data['Sex'] = data['Sex'].map({'male':0, 'female':1})
data = pd.get_dummies(data=data, columns=['Title','Pclass','Family_survival'])

Finally, remove unnecessary features.

data = data.drop(['PassengerId','Name','Age','SibSp','Parch','Ticket',
                     'Fare','Cabin','Embarked','Family_size','Last_name'], axis=1)

The data frame looks like this.

Survived Sex Family_size_bin Age_bin Fare_std Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty Pclass_1 Pclass_2 Pclass_3 Family_survival_0.0 Family_survival_0.5 Family_survival_1.0
0 0.0 0 1 1 -0.503176 0 0 1 0 0 0 0 0 1 0 1 0
1 1.0 1 1 1 0.734809 0 0 0 1 0 0 1 0 0 0 1 0
2 1.0 1 0 1 -0.490126 0 1 0 0 0 0 0 0 1 0 1 0
3 1.0 1 1 1 0.383263 0 0 0 1 0 0 1 0 0 1 0 0
4 0.0 0 0 1 -0.487709 0 0 1 0 0 0 0 0 1 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1304 NaN 0 0 1 -0.487709 0 0 1 0 0 0 0 0 1 0 1 0
1305 NaN 1 0 1 1.462069 0 0 0 0 0 1 1 0 0 0 0 1
1306 NaN 0 0 1 -0.503176 0 0 1 0 0 0 0 0 1 0 1 0
1307 NaN 0 0 1 -0.487709 0 0 1 0 0 0 0 0 1 0 1 0
1308 NaN 0 1 0 -0.211081 1 0 0 0 0 0 0 0 1 0 0 1

1309 rows × 17 columns

The integrated data is divided into train data and test data, and the feature processing is completed.

model_train = data[:891]
model_test = data[891:]

x_train = model_train.drop('Survived', axis=1)
y_train = pd.DataFrame(model_train['Survived'])
x_test = model_test.drop('Survived', axis=1)

3. Model building and forecasting

Now that the data frame is complete, let's build a neural network model and make predictions.

from keras.layers import Dense,Dropout
from keras.models import Sequential
from keras.callbacks import EarlyStopping
#Model initialization
model = Sequential()
#Layer construction
model.add(Dense(12, activation='relu', input_dim=16))
model.add(Dropout(0.2))
model.add(Dense(8, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#Model building
model.compile(optimizer = 'adam', loss='binary_crossentropy', metrics='acc')
#View model structure
model.summary()

スクリーンショット 2020-10-15 11.12.07.png Train by passing train data. If you set validation_split, the data for validation will be divided arbitrarily from the train data, so it's easy.

log = model.fit(x_train, y_train, epochs=5000, batch_size=32,verbose=1,
                callbacks=[EarlyStopping(monitor='val_loss',min_delta=0,patience=100,verbose=1)],
                validation_split=0.3)

スクリーンショット 2020-10-15 11.20.07.png

It looks like this when the state of learning progress is displayed in a graph.

import matplotlib.pyplot as plt
plt.plot(log.history['loss'],label='loss')
plt.plot(log.history['val_loss'],label='val_loss')
plt.legend(frameon=False)
plt.xlabel('epochs')
plt.ylabel('crossentropy')
plt.show()

スクリーンショット 2020-10-15 11.11.03.png

Finally, predict_classes is used to output the predicted value.

#Predict whether it will be classified as 0 or 1
y_pred_cls = model.predict_classes(x_test)
#Create a data frame for kaggle
y_pred_cls = y_pred_cls.reshape(-1)
submission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':y_pred_cls})
submission.to_csv('titanic_nn.csv', index=False)

The correct answer rate of this prediction model was ** 80.8% **. I don't know if this model created is optimal because the neural network can freely decide the parameters and the number of layers of the model, but if it exceeds 80%, it is reasonable. スクリーンショット 2020-10-15 12.58.37.png

If you have any opinions or suggestions, we would appreciate it if you could make a comment or edit request.

Sites and books that I referred to

Titanic - Neural Networks [KERAS] - 81.8% Titanic [0.82] - [0.83] [Data analysis technology that wins with Kaggle](https://www.amazon.co.jp/Kaggle%E3%81%A7%E5%8B%9D%E3%81%A4%E3%83%87%E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E3% 81% AE% E6% 8A% 80% E8% A1% 93-% E9% 96% 80% E8% 84 % 87-% E5% A4% A7% E8% BC% 94-ebook / dp / B07YTDBC3Z) [Deep Learning from scratch-Theory and implementation of deep learning learned with Python](https://www.amazon.co.jp/%E3%82%BC%E3%83%AD%E3%81%8B%E3] % 82% 89% E4% BD% 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E3% 83% 87% E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0% E3% 81% AE% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% A3% 85-% E6% 96% 8E% E8% 97% A4-% E5% BA% B7% E6% AF% 85 / dp / 4873117585)

Recommended Posts

Survivor prediction using kaggle's titanic neural network [80.8%]
Survivor prediction using kaggle's titanic xg boost [80.1%]
Simple neural network implementation using Chainer
Try using TensorFlow-Part 2-Convolutional Neural Network (MNIST)
Implementation of "blurred" neural network using Chainer
Simple neural network implementation using Chainer-Data preparation-
Simple neural network implementation using Chainer-Model description-
Simple neural network implementation using Chainer-optimization algorithm setting-
Reinforcement learning 10 Try using a trained neural network.
Another style conversion method using Convolutional Neural Network
Parametric Neural Network
Author estimation using neural network and Doc2Vec (Aozora Bunko)
Model using convolutional neural network in natural language processing
Implementation of a convolutional neural network using only Numpy
Implement Convolutional Neural Network
Implement Neural Network from 1
Convolutional neural network experience
Try Kaggle's Titanic tutorial
Rank learning using neural network (Implementation of RankNet by Chainer)
Titanic survival prediction using machine learning workflow management tool Kedro
Try building a neural network in Python without using a library