Introduction

Last time, I looked at Titanic's train data to see what features are related to survival rate. Last time: Data analysis before kaggle's titanic feature generation (I hope you can read this article as well)

Based on the result, this time ** create a data frame to complement the missing values and add the necessary features to give to the prediction model **.

This time, as a prediction model, we will use "** GBDT (gradient boosting tree) xg boost **" which is often used in kaggle competitions, so we will make the data suitable for it.

1. Data acquisition and missing value confirmation

import pandas as pd
import numpy as np

train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')
#Combine train data and test data into one
data = pd.concat([train,test]).reset_index(drop=True)
#Check the number of rows that contain missing values
train.isnull().sum()
test.isnull().sum()

The number of each missing value is as follows.

	train data	test data
PassengerId	0	0
Survived	0
Pclass	0	0
Name	0	0
Sex	0	0
Age	177	86
SibSp	0	0
Parch	0	0
Ticket	0	0
Fare	0	1
Cabin	687	327
Embarked	2	0

2. Complementing missing values and creating features

2.1 Fare Complement and Classification

The missing row had a ** Pclass of 3 ** and ** Embarked was S **. スクリーンショット 2020-10-01 12.01.00.png Therefore, it is complemented by the median among those who meet these two conditions. After that, the classification is performed in consideration of the difference in survival rate depending on the Fare value and Pclass.

#Missing value completion
data['Fare'] = data['Fare'].fillna(data.query('Pclass==3 & Embarked=="S"')['Fare'].median())
#Classified'Fare_bin'Put in
data['Fare_bin'] = 0
data.loc[(data['Fare']>=10) & (data['Fare']<50), 'Fare_bin'] = 1
data.loc[(data['Fare']>=50) & (data['Fare']<100), 'Fare_bin'] = 2
data.loc[(data['Fare']>=100), 'Fare_bin'] = 3

2.2 Life-and-death difference between groups Creating'Family_survival'

Titanic [0.82]-[0.83] Created the feature'Family_survival' introduced in this code. I will.

** Family and friends are more likely to be acting together on board **, so it can be said that whether or not they survived ** tends to have the same result within the group **.

Therefore, grouping is performed by the surname of the first name and the ticket number, and the value is determined by whether or not the members of the group are alive.

#Get the surname of the name'Last_name'Put in
data['Last_name'] = data['Name'].apply(lambda x: x.split(",")[0])


data['Family_survival'] = 0.5 #Default value
#Last_Grouping by name and Fare
for grp, grp_df in data.groupby(['Last_name', 'Fare']):
                               
    if (len(grp_df) != 1):
        #(Same surname)And(Same Fare)When there are two or more people
        for index, row in grp_df.iterrows():
            smax = grp_df.drop(index)['Survived'].max()
            smin = grp_df.drop(index)['Survived'].min()
            passID = row['PassengerId']
            
            if (smax == 1.0):
                data.loc[data['PassengerId'] == passID, 'Family_survival'] = 1
            elif (smin == 0.0):
                data.loc[data['PassengerId'] == passID, 'Family_survival'] = 0
            #About members other than yourself in the group
            #Even one person is alive → 1
            #No survivors(Including NaN) → 0
            #All NaN → 0.5

#Grouping by ticket number
for grp, grp_df in data.groupby('Ticket'):
    if (len(grp_df) != 1):
        #When there are two or more people with the same ticket number
        #If there is even one survivor in the group'Family_survival'To 1
        for ind, row in grp_df.iterrows():
            if (row['Family_survival'] == 0) | (row['Family_survival']== 0.5):
                smax = grp_df.drop(ind)['Survived'].max()
                smin = grp_df.drop(ind)['Survived'].min()
                passID = row['PassengerId']
                if (smax == 1.0):
                    data.loc[data['PassengerId'] == passID, 'Family_survival'] = 1
                elif (smin == 0.0):
                    data.loc[data['PassengerId'] == passID, 'Family_survival'] = 0

2.3 Creation and classification of feature quantity'Family_size' representing the number of family members

Using the values of SibSp and Parch, we will create a feature quantity'Family_size'that indicates how many families boarded the Titanic, and classify them according to their survival rates.

#Family_Creating size
data['Family_size'] = data['SibSp']+data['Parch']+1
#1, 2~4, 5~Divide into three
data['Family_size_bin'] = 0
data.loc[(data['Family_size']>=2) & (data['Family_size']<=4),'Family_size_bin'] = 1
data.loc[(data['Family_size']>=5) & (data['Family_size']<=7),'Family_size_bin'] = 2
data.loc[(data['Family_size']>=8),'Family_size_bin'] = 3

2.4 Creating the title title'Title'

Get titles such as'Mr','Miss' from the Name column. Incorporate a few titles ('Mme','Mlle', etc.) into titles that have the same meaning.

#Get the title of the name'Title'Put in
data['Title'] = data['Name'].map(lambda x: x.split(', ')[1].split('. ')[0])
#Integrate a few titles
data['Title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
data['Title'].replace(['Don', 'Sir',  'the Countess', 'Lady', 'Dona'], 'Royalty', inplace=True)
data['Title'].replace(['Mme', 'Ms'], 'Mrs', inplace=True)
data['Title'].replace(['Mlle'], 'Miss', inplace=True)
data['Title'].replace(['Jonkheer'], 'Master', inplace=True)

2.5 Ticket number labeling

Some ticket numbers are numbers only and some include numbers and letters. After separating the two, we will label each type of ticket number.

#Divide into tickets with numbers only and tickets with numbers and alphabets
#Get a number-only ticket
num_ticket = data[data['Ticket'].str.match('[0-9]+')].copy()
num_ticket_index = num_ticket.index.values.tolist()
#Tickets with only numbers dropped from the original data and the rest contains alphabets
num_alpha_ticket = data.drop(num_ticket_index).copy()

#Classification of tickets with numbers only
#Since the ticket number is a character string, it is converted to a numerical value
num_ticket['Ticket'] = num_ticket['Ticket'].apply(lambda x:int(x))

num_ticket['Ticket_bin'] = 0
num_ticket.loc[(num_ticket['Ticket']>=100000) & (num_ticket['Ticket']<200000),
               'Ticket_bin'] = 1
num_ticket.loc[(num_ticket['Ticket']>=200000) & (num_ticket['Ticket']<300000),
               'Ticket_bin'] = 2
num_ticket.loc[(num_ticket['Ticket']>=300000),'Ticket_bin'] = 3

#Ticket classification including numbers and alphabets
num_alpha_ticket['Ticket_bin'] = 4
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('A.+'),'Ticket_bin'] = 5
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('C.+'),'Ticket_bin'] = 6
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('C\.*A\.*.+'),'Ticket_bin'] = 7
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('F\.C.+'),'Ticket_bin'] = 8
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('PC.+'),'Ticket_bin'] = 9
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('S\.+.+'),'Ticket_bin'] = 10
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('SC.+'),'Ticket_bin'] = 11
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('SOTON.+'),'Ticket_bin'] = 12 
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('STON.+'),'Ticket_bin'] = 13
num_alpha_ticket.loc[num_alpha_ticket['Ticket'].str.match('W\.*/C.+'),'Ticket_bin'] = 14

data = pd.concat([num_ticket,num_alpha_ticket]).sort_values('PassengerId')

2.6 Age complementation and classification

For the missing value of Age, there is a method of ** median ** or ** finding the average age for each title of the name and complementing it **, but when I looked it up, it was lost using a random forest. There was a method ** to predict the age of the part where it is, so this time I will use that.

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
#Labeling features that are character strings
le = LabelEncoder()
data['Sex'] = le.fit_transform(data['Sex']) #Since it is used for survival rate prediction, labeling is incidental
data['Title'] = le.fit_transform(data['Title'])
#Features used to predict Age'age_data'Put in
age_data = data[['Age','Pclass','Family_size',
                 'Fare_bin','Title']].copy()
#Divide the line where Age is missing and the line where Age is not missing
known_age = age_data[age_data['Age'].notnull()].values  
unknown_age = age_data[age_data['Age'].isnull()].values

x = known_age[:, 1:]  
y = known_age[:, 0]
#Learn in Random Forest
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(x, y)
#Reflect the predicted value in the original data frame
age_predict = rfr.predict(unknown_age[:, 1:])
data.loc[(data['Age'].isnull()), 'Age'] = np.round(age_predict,1)

You have now completed the missing value of Age. スクリーンショット 2020-10-15 21.37.08.png

And Age will also be classified.

data['Age_bin'] = 0
data.loc[(data['Age']>18) & (data['Age']<=60),'Age_bin'] = 1
data.loc[(data['Age']>60),'Age_bin'] = 2

Finally, reduce unnecessary features.

data = data.drop(['PassengerId','Name','Age','SibSp','Parch','Ticket',
                  'Fare','Cabin','Embarked','Last_name','Family_size'], axis=1)

In the end, the data frame looks like this.

	Survived	Pclass	Sex	Fare_bin	Family_survival	Family_size_bin	Title	Ticket_bin	Age_bin
0	0.0	3	1	0	0.5	1	2	5	1
1	1.0	1	0	2	0.5	1	3	9	1
2	1.0	3	0	0	0.5	0	1	13	1
3	1.0	1	0	2	0.0	1	3	1	1
4	0.0	3	1	0	0.5	0	2	3	1
...	...	...	...	...	...	...	...	...	...
1304	NaN	3	1	0	0.5	0	2	5	1
1305	NaN	1	0	3	1.0	0	5	9	1
1306	NaN	3	1	0	0.5	0	2	12	1
1307	NaN	3	1	0	0.5	0	2	3	1
1308	NaN	3	1	1	1.0	1	0	0	0

1309 rows × 9 columns

The integrated data is divided into train data and test data, and the feature processing is completed.

model_train = data[:891]
model_test = data[891:]

X = model_train.drop('Survived', axis=1)
Y = pd.DataFrame(model_train['Survived'])
x_test = model_test.drop('Survived', axis=1)

Prediction with 3 xg boost

Let's look at the performance of the model by seeking two, ** logloss ** and ** accuracy **.

from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import xgboost as xgb
#Set parameters
params = {'objective':'binary:logistic',
          'max_depth':5,
          'eta': 0.1, 
          'min_child_weight':1.0,
          'gamma':0.0,
          'colsample_bytree':0.8,
          'subsample':0.8}

num_round = 1000

logloss = []
accuracy = []

kf = KFold(n_splits=4, shuffle=True, random_state=0)
for train_index, valid_index in kf.split(X):
    x_train, x_valid = X.iloc[train_index], X.iloc[valid_index] 
    y_train, y_valid = Y.iloc[train_index], Y.iloc[valid_index]
    #Convert data frame to a shape suitable for xg boost
    dtrain = xgb.DMatrix(x_train, label=y_train)
    dvalid = xgb.DMatrix(x_valid, label=y_valid)
    dtest = xgb.DMatrix(x_test)
    #Learn with xgboost
    model = xgb.train(params, dtrain, num_round,evals=[(dtrain,'train'),(dvalid,'eval')],
                      early_stopping_rounds=50)
    
    valid_pred_proba = model.predict(dvalid)
    #Ask for log loss
    score = log_loss(y_valid, valid_pred_proba)
    logloss.append(score)
    #Find accuracy
    #valid_pred_Since proba is a probability value, it is converted to 0 and 1.
    valid_pred = np.where(valid_pred_proba >0.5,1,0)
    acc = accuracy_score(y_valid, valid_pred)
    accuracy.append(acc)
    
print(f'log_loss:{np.mean(logloss)}')
print(f'accuracy:{np.mean(accuracy)}')

With this code log_loss : 0.39114 accuracy : 0.8338 The result was that.

Now that we have a model, we will create forecast data to submit to kaggle.

#Predict with predict
y_pred_proba = model.predict(dtest)
y_pred= np.where(y_pred_proba > 0.5,1,0)
#Create a data frame
submission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':y_pred})
submission.to_csv('titanic_xgboost.csv', index=False)

スクリーンショット 2020-10-15 21.54.44.png The correct answer rate was ** 80.1% **, which reached the limit of 80%.

Summary

This time I tried to actually submit the predicted value to kaggle.

At first, Cabin was also labeled and used as a feature, but ** the prediction accuracy improved when Cabin was not used **. After all it seems that there were too many missing values and it was not very suitable for xg boost.

Also, if you make predictions without classifying features such as Fare and Age, ** logloss and accuracy values will improve **, but ** the accuracy rate of the predicted values will not improve **. I felt that it would be possible to create an appropriate model without overfitting by classifying the features based on the difference in survival rate rather than using the features as they are.

If you have any opinions or suggestions, we would appreciate it if you could make a comment or edit request.

Sites and books that I referred to

Kaggle Tutorial Titanic know-how to be in the top 2% pyhaya’s diary Titanic [0.82] - [0.83] [Data analysis technology that wins with Kaggle](https://www.amazon.co.jp/Kaggle%E3%81%A7%E5%8B%9D%E3%81%A4%E3%83%87%E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E3% 81% AE% E6% 8A% 80% E8% A1% 93-% E9% 96% 80% E8% 84 % 87-% E5% A4% A7% E8% BC% 94-ebook / dp / B07YTDBC3Z)

[PYTHON] Survivor prediction using kaggle's titanic xg boost [80.1%]