[PYTHON] Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3

Following Last time, Kaggle Titanic to the top 1.5% (0.83732) I will explain the approach of. The code to use is titanic (0.83732) _3 from Github. I will explain how to improve from the submitted score given in Last time to 0.83732.

1. Import the required library and load the CSV.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
#Read CSV
train= pd.read_csv("train.csv")
test= pd.read_csv("test.csv")

#Data integration
dataset = pd.concat([train, test], ignore_index = True)

#For submission
PassengerId = test['PassengerId']
#Survival rate comparison by room level
dataset['Cabin'] = dataset['Cabin'].fillna('Unknown') #Substitute Unknown if room data is missing
dataset['Deck']= dataset['Cabin'].str.get(0) #Get the first letter (0th letter) of Cabin (room number)

#Survival rate comparison by the number of characters in the ticket
Ticket_Count = dict(dataset['Ticket'].value_counts()) #Group by the number of characters in the ticket
dataset['TicketGroup'] = dataset['Ticket'].apply(lambda x:Ticket_Count[x]) #Group distribution

#Divide into two groups, a group with a high survival rate in terms of the number of characters in the ticket and a group with a low survival rate.
#Substitute 2 if high and 1 if low
def Ticket_Label(s):
    if (s >= 2) & (s <= 4): #Group with high survival rate in character count
        return 2
    elif ((s > 4) & (s <= 8)) | (s == 1): #Group with low survival rate in character count
        return 1
    elif (s > 8):
        return 0

dataset['TicketGroup'] = dataset['TicketGroup'].apply(Ticket_Label)

2. Use honorific title

Looking at Kaggle's top code, we can see that using name titles is the key to high scores. Honorific titles are Mr, Mrs, Miss, etc. included in the middle of Name. Occupations such as Dr (doctor) and Rev (priest or minister) may be listed without using Mr. Extract and group this information.

# 'Honorifics'(Honorific title)Divide by characteristics by
dataset['Honorifics'] = dataset['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip()) #Honorific title(','When'.'Words between)Extract

#Group titles
#Example:'Capt', 'Col', 'Major', 'Dr', 'Rev'Is'Officer'To
Honorifics_Dict = {}
Honorifics_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
Honorifics_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
Honorifics_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
Honorifics_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
Honorifics_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
Honorifics_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))
dataset['Honorifics'] = dataset['Honorifics'].map(Honorifics_Dict)
sns.barplot(x="Honorifics", y="Survived", data=dataset, palette='Set3')

"""List of titles
Mr: man,Master: Boy,Jonkheer: Dutch aristocrat(Man),
Mlle: Mademoiselle(France unmarried woman),Miss: Unmarried women, girls,Mme: Madam(French married woman),Ms: Female(Unmarried or married),Mrs: married woman, 
Don: Man(Spain),Sir: Man(England),the Countess: Countess,Dona: Married woman(Spain),Lady: Married woman(England),
Capt: Captain,Col: Colonel,Major: Military personnel,Dr: Doctor,Rev: priests and ministers
"""

image.png After all, adult males have a low survival rate, and females and children have a high survival rate. But this time, we can discover that royalty groups such as aristocrats are higher than children. I feel that the aristocrats of this era are prioritized and saved. It seems to be a weapon to be able to use data on survival rate as to whether or not it is an aristocrat.

3. Review the assignment of missing values

Last time and Last time put the median value in the missing value for the time being. Review these to improve the accuracy of your predictions.

3.1 Review of missing values for'Age'

Substitute what was predicted by machine learning into the missing value of age It seems that the title (occupation) data mentioned earlier can also be used for prediction. (Dr can suppress the prediction of 5 years old)

##Predict and substitute for missing age values
#Extract items used for age prediction and create dummy variables
age = dataset[['Age','Pclass','Sex','Honorifics']]
age_dummies = pd.get_dummies(age)
age_dummies.head(3)

image.png

#Divide into those with known age and those with missing age
known_age = age_dummies[age_dummies.Age.notnull()].as_matrix()
null_age = age_dummies[age_dummies.Age.isnull()].as_matrix()

#Divide into feature quantity and correct answer data
age_X = known_age[:, 1:]
age_y = known_age[:, 0]

#Create an age prediction model and substitute the predicted value
rf = RandomForestRegressor()
rf.fit(age_X, age_y)
pred_Age = rf.predict(null_age[:, 1:])
dataset.loc[(dataset.Age.isnull()),'Age'] = pred_Age

3.2 Review of missing values for'Embarked'(port of departure)

Next, check the missing data to fill in the missing values for'Embarked'.

# 'Embarked'(Departure point)Shows missing data
dataset[dataset['Embarked'].isnull()]

2020-01-10 (2).png In both cases, the'P class'(ticket class) is 1 and the'Fare' (fare) is 80. Comparing the median'Fare'for each'Embarked' with a'Pclass' of 1, C is the closest. Substitute C for the two missing values.

# 'Pclass'(Ticket class)Is 1,'Embarked'(Departure point)Every'Fare'(Fee)Show median
C = dataset[(dataset['Embarked']=='C') & (dataset['Pclass'] == 1)]['Fare'].median()
print("Median of C", C)
S = dataset[(dataset['Embarked']=='S') & (dataset['Pclass'] == 1)]['Fare'].median()
print("Median of S", S)
Q = dataset[(dataset['Embarked']=='Q') & (dataset['Pclass'] == 1)]['Fare'].median()
print("Median Q", Q)

# 'Embarked'Substitute C for the missing value of
dataset['Embarked'] = dataset['Embarked'].fillna('C')

Median C 76.7292 Median S 52.0 Median Q 90.0

3.3 Review of missing values for'Fare'

If you look at the data, you can see that the'P class'(ticket class) is 3 and the'Embarked' (port of departure) is'S'. Therefore, substitute the median value of'Pclass'(ticket class) for 3 and'Embarked' for'S' for this missing value. Now that you have filled in the missing values for Age, Embarked, and Fare, check them.

# 'Fare'(Fee)Shows missing data
dataset[dataset['Fare'].isnull()]

# 'Pclass'(Ticket class)Is 3'Embarked'(Departure point)But'S'Substitute the median of
fare_median=dataset[(dataset['Embarked'] == "S") & (dataset['Pclass'] == 3)].Fare.median()
dataset['Fare']=dataset['Fare'].fillna(fare_median)

#Check the total number of missing data
dataset_null = dataset.fillna(np.nan)
dataset_null.isnull().sum()

image.png Age 0 Cabin 0 Embarked 0 Fare 0 Name 0 Parch 0 PassengerId 0 Pclass 0 Sex 0 SibSp 0 Survived 418 Ticket 0 Deck 0 TicketGroup 0 Honorifics 0 dtype: int64

There are no missing values.

4. Family size

Two times before Processes the number of siblings / spouses and the number of parents / children on board that could not be handled well into usable data. Group the families on board and group them according to the survival rate according to the number of family members on board.

#Brothers on board/Survival rate comparison by number of spouses
sns.barplot(x="SibSp", y="Survived", data=train, palette='Set3')

#Parents on board/Survival rate comparison by number of children
sns.barplot(x="Parch", y="Survived", data=train, palette='Set3')

#Number of families on board
dataset['FamilySize']=dataset['SibSp']+dataset['Parch']+1
sns.barplot(x="FamilySize", y="Survived", data=dataset, palette='Set3')

#Grouping by survival rate by number of families
def Family_label(s):
    if (s >= 2) & (s <= 4):
        return 2
    elif ((s > 4) & (s <= 7)) | (s == 1):
        return 1
    elif (s > 7):
        return 0
dataset['FamilyLabel']=dataset['FamilySize'].apply(Family_label)
sns.barplot(x="FamilyLabel", y="Survived", data=dataset, palette='Set3')

image.png image.png image.png image.png I was able to divide it into beautiful differences.

5. Survival rate adjustment in surname

In the above'SibSp'and'Parch', the family relationship after the third parent is unknown, so we will investigate the survival rate in the surname instead of the family. You can see a big difference in the survival rate in the surname.

#Examine the characteristics of the surname
dataset['Surname'] = dataset['Name'].apply(lambda x:x.split(',')[0].strip()) #Last name(Of the name","Extract the word before)
Surname_Count = dict(dataset['Surname'].value_counts()) #Count the number of surnames
dataset['Surname_Count'] = dataset['Surname'].apply(lambda x:Surname_Count[x]) #Substitute the number of surnames

#Divide people with double surnames into a group of women and children and a group of adults and men.
Female_Child_Group=dataset.loc[(dataset['Surname_Count']>=2) & ((dataset['Age']<=12) | (dataset['Sex']=='female'))]
Male_Adult_Group=dataset.loc[(dataset['Surname_Count']>=2) & (dataset['Age']>12) & (dataset['Sex']=='male')]

#Compare the average number of survival rates for each surname in a group of women and children
Female_Child_mean = Female_Child_Group.groupby('Surname')['Survived'].mean() #Average survival rate by surname
Female_Child_mean_count = pd.DataFrame(Female_Child_mean.value_counts()) #Average number of survival rates by surname
Female_Child_mean_count.columns=['GroupCount']
Female_Child_mean_count

image.png

#Compare the average number of survival rates by surname in the male (adult) group
Male_Adult_mean = Male_Adult_Group.groupby('Surname')['Survived'].mean() #Average survival rate by surname
Male_Adult_mean_count = pd.DataFrame(Male_Adult_mean.value_counts()) #Average number of survival rates by surname
Male_Adult_mean_count.columns=['GroupCount']
Male_Adult_mean_count

image.png Both groups are usually 1 or 0, indicating that there is a large difference between the groups. Is it also a rule that if the family name is the same as that of a family with girls and children (adults and men), everyone will survive (death)? This clear feature is valuable. By treating the result that is the opposite of this rule as an outlier, it can be expected to contribute to the improvement of the score. What we do is rewrite the data. All the people with the last name who have the same last name as the family with girls and children (adults and men) but all of them are dead (surviving) will be the profile data according to the opposite rule.

#Handle exceptions for each group
#Extract surnames that are exceptions to each group
# Dead_List: Last name that all died in the women / children group
# Survived_List: Last name that all died in the male (adult) group
Dead_List = set(Female_Child_mean[Female_Child_mean.apply(lambda x:x==0)].index)
print("Dead_List", Dead_List, sep="\n")
Survived_List = set(Male_Adult_mean[Male_Adult_mean.apply(lambda x:x==1)].index)
print("Survived_List", Survived_List, sep="\n")

Dead_List {'Danbom', 'Turpin', 'Zabour', 'Bourke', 'Olsson', 'Goodwin', 'Cacic', 'Robins', 'Canavan', 'Lobb', 'Palsson', 'Ilmakangas', 'Oreskovic', 'Lefebre', 'Sage', 'Johnston', 'Arnold-Franchi', 'Skoog', 'Attalah', 'Lahtinen', 'Jussila', 'Ford', 'Vander Planke', 'Rosblom', 'Boulos', 'Rice', 'Caram', 'Strom', 'Panula', 'Barbara', 'Van Impe'} Survived_List {'Chambers', 'Beane', 'Jonsson', 'Cardeza', 'Dick', 'Bradley', 'Duff Gordon', 'Greenfield', 'Daly', 'Nakid', 'Taylor', 'Frolicher-Stehli', 'Beckwith', 'Kimball', 'Jussila', 'Frauenthal', 'Harder', 'Bishop', 'Goldenberg', 'McCoy'}

#Rewrite test data
#Decompose data into train and test
train = dataset.loc[dataset['Survived'].notnull()]
test = dataset.loc[dataset['Survived'].isnull()]

#A person with a surname who died in a group of women and children → A 60-year-old man, honorific title is Mr.
#A person with a surname who all survived in a male (adult) group → A 5-year-old woman, honorific title is Miss.
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Sex'] = 'male'
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Age'] = 60
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Title'] = 'Mr'
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Sex'] = 'female'
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Age'] = 5
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Title'] = 'Miss'

#Combine data again
dataset = pd.concat([train, test])

6. Make a prediction again

#Extract variables to use
dataset6 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked','Honorifics','FamilyLabel','Deck','TicketGroup']]
#Create a dummy variable
dataset_dummies = pd.get_dummies(dataset6)
dataset_dummies.head(3)

image.png

#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]

#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data

#Creating a predictive model
pipe = Pipeline([('classify', RandomForestClassifier(random_state = 10, max_features = 'sqrt'))])

param_test = {'classify__n_estimators':list(range(20, 30, 1)), 
              'classify__max_depth':list(range(3, 10, 1))}
gsearch = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
gsearch.fit(X, y)
print(gsearch.best_params_, gsearch.best_score_)

#Prediction of test data
predictions = gsearch.predict(test_set)

#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
submission.to_csv("submission6.csv", index=False)

'classify__max_depth': 5, 'classify__n_estimators': 28} 0.8451178451178452 The submitted score was 0.81818.

7. Prediction by reducing features

Since the number of features has increased significantly to 26 compared to the previous time, we will exclude unimportant features.

pipe = Pipeline([('select',SelectKBest(k=20)),  #Create a model using 20 features that are useful for prediction
               ('classify', RandomForestClassifier(random_state = 10, max_features = 'sqrt'))])

param_test = {'classify__n_estimators':list(range(20, 30, 1)), 
              'classify__max_depth':list(range(3, 10, 1))}
gsearch = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
gsearch.fit(X, y)
print(gsearch.best_params_, gsearch.best_score_)

{'classify__max_depth': 6, 'classify__n_estimators': 26} 0.8451178451178452

select = SelectKBest(k = 20)
clf = RandomForestClassifier(random_state = 10, warm_start = True, 
                                  n_estimators = 26,
                                  max_depth = 6, 
                                  max_features = 'sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)

The previous model and max_depth and n_estimators have changed. Using this max_depth and n_estimators information, the feature quantity is narrowed down to 20 again to create a prediction model and make a prediction.

#Given max_depth and n_Using estimators, narrow down the features to 20 and create a prediction model again to make predictions.
select = SelectKBest(k = 20)
clf = RandomForestClassifier(random_state = 10,
                             warm_start = True, 
                             n_estimators = 26,
                             max_depth = 6, 
                             max_features = 'sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)

cv_score = model_selection.cross_val_score(pipeline, X, y, cv= 10)
print("CV Score : Mean - %.7g | Std - %.7g " % (np.mean(cv_score), np.std(cv_score)))

#Prediction of test data
predictions = pipeline.predict(test_set)

#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
submission.to_csv("submission7.csv", index=False)

CV Score : Mean - 0.8451402 | Std - 0.03276752 The submitted score should now be 0.83732. The ranking as of 2019 is 217th. This corresponds to the top 1.5%.

8. Summary

By logically filling in the missing values, generating new features such as titles, and rewriting test data, we got a score of 0.83732, which is equivalent to the top 1.5% of Kaggle Titanic. Various data processing is coming out, and you can see that Titanic is treated as a tutorial on data analysis ability.

This is the end of Titanic. I hope it helps those who have read this article.

Recommended Posts

Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
Kaggle Tutorial Titanic know-how to be in the top 2%
Challenges for the Titanic Competition for Kaggle Beginners
It's okay to stumble on Titanic! Introducing the Kaggle strategy for super beginners
Examine the parameters of RandomForestClassifier in the Kaggle / Titanic tutorial
[For Kaggle beginners] Titanic (LightGBM)
[For beginners] I want to explain the number of learning times in an easy-to-understand manner.
[Kaggle for super beginners] Titanic (Logistic regression)
Switch the module to be loaded for each execution environment in Python
The fastest way for beginners to master Python
I tried to predict the horses that will be in the top 3 with LightGBM
Try to calculate RPN in Python (for beginners)
[For beginners] Introduction to vectorization in machine learning
Basic story of inheritance in Python (for beginners)
Day 66 [Introduction to Kaggle] The easiest Titanic forecast
How to limit the API to be published in the C language shared library of Linux
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
Find a guideline for the number of processes / threads to set in the application server
Output the specified table of Oracle database in Python to Excel for each file
python beginners tried to predict the number of criminals
[For beginners] How to use say command in python!
How to get the number of digits in Python
I tried the MNIST tutorial for beginners of tensorflow.
[For beginners] Install the package in the Anaconda environment (Janome)
Check for the existence of BigQuery tables in Java
[For beginners] Quantify the similarity of sentences with TF-IDF
To do the equivalent of Ruby's ObjectSpace._id2ref in Python
Everything for beginners to be able to do machine learning
[For beginners of competitive pros] Three input methods to remember when starting competitive programming in Python
What seems to be a template of the standard input part of the competition pro in python3
How to find the optimal number of clusters in k-means
Test code to check for broken links in the page
Check the operation of Python for .NET in each environment
Processing of python3 that seems to be usable in paiza
[For beginners] Summary of standard input in Python (with explanation)
■ Kaggle Practice for Beginners --Introduction of Python --by Google Colaboratory
Summary of stumbling blocks in Django for the first time
[Explanation for beginners] Introduction to convolution processing (explained in TensorFlow)
[Explanation for beginners] Introduction to pooling processing (explained in TensorFlow)
Get the number of occurrences for each element in the list
Tips for Python beginners to use the Scikit-image example for themselves
For beginners, how to deal with common errors in keras
[Python] The biggest weakness / disadvantage of Google Colaboratory [For beginners]
What beginners learned from the basics of variables in python
Google search for the last line of the file in Python
Check the Check button in Tkinter to allow Entry to be edited
How to get rid of the "Tags must be an array of hashes." Error in the qiita api
[For IT beginners] What to do when the rev command cannot be used with Git Bash
Put the process to sleep for a certain period of time (seconds) or more in Python
A story that is a little addicted to the authority of the directory specified by expdp (for beginners)
The story of returning to the front line for the first time in 5 years and refactoring Python Django
Kaggle for the first time (kaggle ①)
[For beginners] kaggle exercise (merucari)
Overview of Docker (for beginners)
Kaggle Tutorial Titanic Accuracy 80.9% (Top 7% 0.80861)
~ Tips for beginners to Python ③ ~
[For beginners] How to implement O'reilly sample code in Google Colab
How to handle multiple versions of CUDA in the same environment
How to change the log level of Azure SDK for Python