[PYTHON] Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2

Following Last time, Kaggle Titanic to the top 1.5% (0.83732) I will explain the approach of. The code to use is titanic (0.83732) _2 from Github. This time, we will increase the submitted score to 0.81339 and prepare for the next 0.83732. In addition, before forecasting, we will visualize the data used previous and analyze the data.

1. Import the required library and load the CSV.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
#Read CSV
train= pd.read_csv("train.csv")
test= pd.read_csv("test.csv")

#Data integration
dataset = pd.concat([train, test], ignore_index = True)

#For submission
PassengerId = test['PassengerId']

Let's look at the relationship of each data.

2. Confirm the relationship between age and survival rate

#Age and survival band graph
sns.barplot(x="Sex", y="Survived", data=train, palette='Set3')

#Survival rate by gender
print("females: %.2f" %(train['Survived'][train['Sex'] == 'female'].value_counts(normalize = True)[1]))
print("males: %.2f" %(train['Survived'][train['Sex'] == 'male'].value_counts(normalize = True)[1]))

image.png females: 0.74 males: 0.19 You can see that women are much more helpful. What about the survival rate for each ticket class?

3. Confirm the relationship of survival rate for each ticket class

#Ticket class and survival band graph
sns.barplot(x='Pclass', y='Survived', data=train, palette='Set3')

#Survival rate by ticket class
print("Pclass = 1 : %.2f" %(train['Survived'][train['Pclass']==1].value_counts(normalize = True)[1]))
print("Pclass = 2 : %.2f" %(train['Survived'][train['Pclass']==2].value_counts(normalize = True)[1]))
print("Pclass = 3 : %.2f" %(train['Survived'][train['Pclass']==3].value_counts(normalize = True)[1]))

image.png Pclass = 1 : 0.63 Pclass = 2 : 0.47 Pclass = 3 : 0.24 The higher the ticket purchaser, the higher the survival rate. What about the price?

4. Confirm the relationship of survival rate by fee

#Survival rate comparison by price
fare = sns.FacetGrid(train, hue="Survived",aspect=2)
fare.map(sns.kdeplot,'Fare',shade= True)
fare.set(xlim=(0, 200))
fare.add_legend()

image.png After all, you can see that the survival rate is low for people with low ticket prices.

5. Confirm the relationship between age and survival rate

#Survival rate comparison by age
age = sns.FacetGrid(train, hue="Survived",aspect=2)
age.map(sns.kdeplot,'Age',shade= True)
age.set(xlim=(0, train['Age'].max()))
age.add_legend()

image.png Did the child get help first? You can see that the survival rate under 10 years old is high.

6. Confirm the relationship between guest room and survival rate

From here, previous We will check the unused data. First is the room information. Cabin (room number) seems to have different room levels depending on the acronym. image.png

#Survival rate comparison by room level
dataset['Cabin'] = dataset['Cabin'].fillna('Unknown') #Substitute Unknown if room data is missing
dataset['Deck'] = dataset['Cabin'].str.get(0) #Get the first letter (0th letter) of Cabin (room number)
sns.barplot(x="Deck", y="Survived", data=dataset, palette='Set3')

image.png There are some variations. Last time After substituting the median value for the missing value and confirming that there is no missing value, the'Deck'(room hierarchy) information created this time is displayed. Make additional predictions.

6.1 Add room information and make predictions in the same way as Last time

# Age(age)And Fare(Fee)Is the median of each, Embarked(Departure point)Is S(Southampton)Substitute
dataset["Age"].fillna(dataset.Age.mean(), inplace=True) 
dataset["Fare"].fillna(dataset.Fare.mean(), inplace=True) 
dataset["Embarked"].fillna("S", inplace=True)

#Check the total number of missing data
dataset_null = dataset.fillna(np.nan)
dataset_null.isnull().sum()
#Extract variables to use
dataset3 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked', 'Deck']]

#Create a dummy variable
dataset_dummies = pd.get_dummies(dataset3)
dataset_dummies.head(3)
#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]

#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data

#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
              'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_)

#Prediction of test data
pred = grid.predict(test_set)

#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission3.csv", index=False)

{'classify__max_depth': 8, 'classify__n_estimators': 22} 0.8327721661054994 The score submitted was 0.78947. By including information on the room level, it has increased from the previous time.

7. Confirm the relationship between ticket and survival rate

Then try the ticket information. But how do you group them? It is possible to distinguish between the number of characters and whether or not to include the first letter or alphabet of the number, but if it is increased too much, the accuracy will be reduced. Let's check by dividing the number of characters in the ticket.

#Survival rate comparison by the number of characters in the ticket
Ticket_Count = dict(dataset['Ticket'].value_counts()) #Group by the number of characters in the ticket
dataset['TicketGroup'] = dataset['Ticket'].apply(lambda x:Ticket_Count[x]) #Group distribution
sns.barplot(x='TicketGroup', y='Survived', data=dataset, palette='Set3')

image.png There is a difference compared to the previous Cabin (room level) division.

7.1 Add initial ticket information to make predictions

#Extract variables to use
dataset4 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked', 'Deck', 'TicketGroup']]

#Create a dummy variable
dataset_dummies = pd.get_dummies(dataset4)
dataset_dummies.head(4)
#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]

#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data

#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
              'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_, sep="\n")

#Prediction of test data
pred = grid.predict(test_set)

#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission4.csv", index=False)

{'classify__max_depth': 8, 'classify__n_estimators': 23} 0.8406285072951739 The training score went up, but the score submitted to Kaggle fell to 0.77990. In the first place, realistically, the correlation between the number of characters in the ticket and the survival rate seems to be weak. However, since it is a feature that has come out with great effort, I will try to study by suppressing the items in two groups, a high group and a low group.

7.2 Make predictions by grouping the initial information of tickets

#Divide into two groups, a group with a high survival rate in terms of the number of characters in the ticket and a group with a low survival rate.
#Substitute 2 if high and 1 if low
def Ticket_Label(s):
    if (s >= 2) & (s <= 4): #Group with high survival rate in character count
        return 2
    elif ((s > 4) & (s <= 8)) | (s == 1): #Group with low survival rate in character count
        return 1
    elif (s > 8):
        return 0

dataset['TicketGroup'] = dataset['TicketGroup'].apply(Ticket_Label)
sns.barplot(x='TicketGroup', y='Survived', data=dataset, palette='Set3')

image.png It looks like it's separated neatly.

#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]

#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data

#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
              'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_, sep="\n")

#Prediction of test data
pred = grid.predict(test_set)

#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission5.csv", index=False)

{'classify__max_depth': 7, 'classify__n_estimators': 23} 0.8417508417508418 The score submitted to Kaggle has improved significantly to 0.81339.

8. Summary

This time, by adding information on the room hierarchy and information divided into two groups, a group with a high survival rate and a group with a low survival rate by the acronym of the ticket, previous / bc3889fa38ff32d46c13) submitted score improved from 0.78468 to 0.81339. Next time Finally, I will explain the approach to the submission score 0.83732, which corresponds to the top 1.5%.

Recommended Posts

Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
Kaggle Tutorial Titanic know-how to be in the top 2%
Examine the parameters of RandomForestClassifier in the Kaggle / Titanic tutorial
[For Kaggle beginners] Titanic (LightGBM)
[For beginners] I want to explain the number of learning times in an easy-to-understand manner.
[Kaggle for super beginners] Titanic (Logistic regression)
Switch the module to be loaded for each execution environment in Python
Data analysis in Python Summary of sources to look at first for beginners
The fastest way for beginners to master Python
I tried to predict the horses that will be in the top 3 with LightGBM
Try to calculate RPN in Python (for beginners)
[For beginners] Introduction to vectorization in machine learning
Basic story of inheritance in Python (for beginners)
Day 66 [Introduction to Kaggle] The easiest Titanic forecast
How to limit the API to be published in the C language shared library of Linux
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
Find a guideline for the number of processes / threads to set in the application server
Output the specified table of Oracle database in Python to Excel for each file
python beginners tried to predict the number of criminals
[For beginners] How to use say command in python!
How to get the number of digits in Python
I tried the MNIST tutorial for beginners of tensorflow.
[For beginners] Install the package in the Anaconda environment (Janome)
Check for the existence of BigQuery tables in Java
[For beginners] Quantify the similarity of sentences with TF-IDF
To do the equivalent of Ruby's ObjectSpace._id2ref in Python
Everything for beginners to be able to do machine learning
[For beginners of competitive pros] Three input methods to remember when starting competitive programming in Python
What seems to be a template of the standard input part of the competition pro in python3
How to find the optimal number of clusters in k-means
Test code to check for broken links in the page
Check the operation of Python for .NET in each environment
Processing of python3 that seems to be usable in paiza
[For beginners] Summary of standard input in Python (with explanation)
■ Kaggle Practice for Beginners --Introduction of Python --by Google Colaboratory
Summary of stumbling blocks in Django for the first time
[Explanation for beginners] Introduction to convolution processing (explained in TensorFlow)
[Explanation for beginners] Introduction to pooling processing (explained in TensorFlow)
Get the number of occurrences for each element in the list
Tips for Python beginners to use the Scikit-image example for themselves
For beginners, how to deal with common errors in keras
[Python] The biggest weakness / disadvantage of Google Colaboratory [For beginners]
What beginners learned from the basics of variables in python
Google search for the last line of the file in Python
Check the Check button in Tkinter to allow Entry to be edited
How to get rid of the "Tags must be an array of hashes." Error in the qiita api
Put the process to sleep for a certain period of time (seconds) or more in Python
A story that is a little addicted to the authority of the directory specified by expdp (for beginners)
The story of returning to the front line for the first time in 5 years and refactoring Python Django
Kaggle for the first time (kaggle ①)
[For beginners] kaggle exercise (merucari)
Overview of Docker (for beginners)
Kaggle Tutorial Titanic Accuracy 80.9% (Top 7% 0.80861)
~ Tips for beginners to Python ③ ~
How to set the output resolution for each keyframe in Blender
[For beginners] How to implement O'reilly sample code in Google Colab
How to handle multiple versions of CUDA in the same environment
How to determine the existence of a selenium element in Python
How to change the log level of Azure SDK for Python