jupyter notebook https://github.com/spica831/kaggle_titanic/blob/master/titanic.ipynb
I participated in a hackathon to estimate the price of a house in Kaggle I couldn't solve it in time due to lack of knowledge about how to use python and how to analyze it. Therefore, as a revenge, we predicted the survival of Titanic. https://www.kaggle.com/c/titanic
House Prices: Advanced Regression Techniques https://www.kaggle.com/c/house-prices-advanced-regression-techniques
From the conclusion, the correct answer rate of Titanic's prediction was 0.7512.
#Import required packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
#Read value
df = pd.read_csv("./input/train.csv")
df
Display the value.
Apparently, strings are used for names and genders. Since it cannot be used for analysis as it is, is it gender (Sex) or boarding rank? Since there are few character patterns such as (Embarked), they are replaced with numerical values such as 0, 1, and 2, respectively.
In addition, age (Age) has a missing value (NaN), so all were replaced with 0.
df.Embarked = df.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
#df.Cabin = df.Cabin.replace('NaN', 0)
df.Sex = df.Sex.replace(['male', 'female'], [0, 1])
df.Age = df.Age.replace('NaN', 0)
Items that are difficult to handle, such as Name and Ticket Cabin, have been deleted for each column. (painful)
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
All could be replaced with numerical values.
df
First calculate the correlation coefficient
Refer to the following wiki for the correlation coefficient https://ja.wikipedia.org/wiki/%E7%9B%B8%E9%96%A2%E4%BF%82%E6%95%B0
Correlation coefficient value
#Calculate the correlation coefficient
corrmat = df.corr()
corrmat
Correlation coefficient heat map
f, ax = plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=.8, square=True)
It was found that there was a correlation.
Divide into answers (train_labels Survived
here) and parameters (train_features other than Survived
here)
train_labels = df['Survived'].values
train_features = df
train_features.drop('Survived', axis=1, inplace=True)
train_features = train_features.values.astype(np.int64)
Finally, we created a two-class classification learner with a linear SVM in scikit-learn. (Detailed parameters are not set in particular, but it was better to perform L1 and L2 regularization)
from sklearn import svm
#Standard = svm.LinearSVC(C=1.0, intercept_scaling=1, multi_class=False , loss="l1", penalty="l2", dual=True)
svm = svm.LinearSVC()
svm.fit(train_features, train_labels)
Read the test value calculated this time
df_test = pd.read_csv("./input/test.csv")
#Delete unnecessary columns
df_test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
#Numerical replacement of strings
df_test.Embarked = df_test.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
df_test.Sex = df_test.Sex.replace(['male', 'female'], [0, 1])
df_test.Age = df_test.Age.replace('NaN', 0)
#Convert to array value
test_features = df_test.values.astype(np.int64)
y_test_pred = svm.predict(test_features)
#Reload the test value and add a column classified by SVM
df_out = pd.read_csv("./input/test.csv")
df_out["Survived"] = y_test_pred
#Output to the output directory
df_out[["PassengerId","Survived"]].to_csv("./output/submission.csv",index=False)
As mentioned at the beginning, the correct answer rate for Titanic's prediction was 0.7512. However, I was satisfied because I was able to form and submit in a short time of a few hours.
There were many points that needed to be improved during the creation.
I was able to produce output in a short time, so I achieved my goal. However, I deeply realized that I did not have the time and experience to come up with the optimal calculation method by using what I had learned so far in a short amount of time.
Recommended Posts