[PYTHON] I tried to predict and submit Titanic survivors with Kaggle

The execution code, contents, and explanation are described at the following URL.

jupyter notebook https://github.com/spica831/kaggle_titanic/blob/master/titanic.ipynb

background

I participated in a hackathon to estimate the price of a house in Kaggle I couldn't solve it in time due to lack of knowledge about how to use python and how to analyze it. Therefore, as a revenge, we predicted the survival of Titanic. https://www.kaggle.com/c/titanic

Predict home selling prices with Kaggle

House Prices: Advanced Regression Techniques https://www.kaggle.com/c/house-prices-advanced-regression-techniques

From the conclusion, the correct answer rate of Titanic's prediction was 0.7512.

Method

#Import required packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
#Read value
df = pd.read_csv("./input/train.csv")
df

Display the value. スクリーンショット 2017-04-19 22.53.49.png

Preprocessing

String replacement

Apparently, strings are used for names and genders. Since it cannot be used for analysis as it is, is it gender (Sex) or boarding rank? Since there are few character patterns such as (Embarked), they are replaced with numerical values such as 0, 1, and 2, respectively.

In addition, age (Age) has a missing value (NaN), so all were replaced with 0.

df.Embarked = df.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
#df.Cabin = df.Cabin.replace('NaN', 0)
df.Sex = df.Sex.replace(['male', 'female'], [0, 1])
df.Age = df.Age.replace('NaN', 0)

Delete column

Items that are difficult to handle, such as Name and Ticket Cabin, have been deleted for each column. (painful)

df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

Result of preprocessing

All could be replaced with numerical values.

df

スクリーンショット 2017-04-19 22.54.08.png



analysis

Correlation coefficient

First calculate the correlation coefficient

Refer to the following wiki for the correlation coefficient https://ja.wikipedia.org/wiki/%E7%9B%B8%E9%96%A2%E4%BF%82%E6%95%B0 image

Correlation coefficient value

#Calculate the correlation coefficient
corrmat = df.corr()
corrmat

スクリーンショット 2017-04-19 22.54.18.png

Correlation coefficient heat map

f, ax = plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=.8, square=True)

8nqTunXxjzAAAAABJRU5ErkJggg==.png

It was found that there was a correlation.

Learning

Preparation before learning

Divide into answers (train_labels Survived here) and parameters (train_features other than Survived here)

train_labels = df['Survived'].values
train_features = df
train_features.drop('Survived', axis=1, inplace=True)
train_features = train_features.values.astype(np.int64)

Learn with support vector machine

Finally, we created a two-class classification learner with a linear SVM in scikit-learn. (Detailed parameters are not set in particular, but it was better to perform L1 and L2 regularization)

from sklearn import svm
#Standard = svm.LinearSVC(C=1.0, intercept_scaling=1, multi_class=False , loss="l1", penalty="l2", dual=True)
svm = svm.LinearSVC()
svm.fit(train_features, train_labels)

test

Read the test value calculated this time

df_test = pd.read_csv("./input/test.csv")

Advance preparation

#Delete unnecessary columns
df_test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

#Numerical replacement of strings
df_test.Embarked = df_test.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
df_test.Sex = df_test.Sex.replace(['male', 'female'], [0, 1])
df_test.Age = df_test.Age.replace('NaN', 0)

#Convert to array value
test_features = df_test.values.astype(np.int64)

Classify with SVM.

y_test_pred = svm.predict(test_features)

Finally

Convert to a form that can be submitted to Kaggle

#Reload the test value and add a column classified by SVM
df_out = pd.read_csv("./input/test.csv")
df_out["Survived"] = y_test_pred

#Output to the output directory
df_out[["PassengerId","Survived"]].to_csv("./output/submission.csv",index=False)

result

As mentioned at the beginning, the correct answer rate for Titanic's prediction was 0.7512. However, I was satisfied because I was able to form and submit in a short time of a few hours.

Things to improve

There were many points that needed to be improved during the creation.

Preprocessing

  1. Age should be divided into two, excluding NaN and a certain value of NaN.
  2. Looking at the histogram, if the Gaussian distribution is on the left, it should be logarithmic to approach the Gaussian distribution. (Dr. Andrew also said that at Coursera.)
  3. No value whitening was performed.
  4. I should have done my best to convert the values of a large number of discarded strings into numbers. In particular, I didn't want to throw away Cabin and Ticket.

analysis

  1. I was only looking at the correlation coefficient.

Sorter

  1. The value was not regularized
  2. Non-linear SVM and other classifiers were not examined.

Summary

I was able to produce output in a short time, so I achieved my goal. However, I deeply realized that I did not have the time and experience to come up with the optimal calculation method by using what I had learned so far in a short amount of time.

Recommended Posts

I tried to predict and submit Titanic survivors with Kaggle
I tried to predict Titanic survival with PyCaret
I tried learning with Kaggle's Titanic (kaggle②)
I tried to predict next year with AI
I tried to read and save automatically with VOICEROID2 2
I tried to implement and learn DCGAN with PyTorch
I tried to automatically read and save with VOICEROID2
I tried to implement Grad-CAM with keras and tensorflow
Predict Kaggle's Titanic with keras (kaggle ⑦)
I tried to predict Boston real estate prices with PyCaret
I tried to make GUI tic-tac-toe with Python and Tkinter
PySpark learning record ② Kaggle I tried the Titanic competition with PySpark binding
I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA
I tried to make a periodical process with Selenium and Python
I tried to create Bulls and Cows with a shell program
I tried to easily detect facial landmarks with python and dlib
I tried to implement Autoencoder with TensorFlow
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried factor analysis with Titanic data!
I tried to implement CVAE with PyTorch
I tried to solve TSP with QAOA
I tried to predict Covid-19 using Darts
I tried to express sadness and joy with the stable marriage problem.
I tried to convert datetime <-> string with tzinfo using strftime () and strptime ()
I tried to learn the angle from sin and cos with chainer
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to control the network bandwidth and delay with the tc command
I tried to detect Mario with pytorch + yolov3
I tried to implement reading Dataset with PyTorch
I tried to use lightGBM, xgboost with Boruta
I tried to learn logical operations with TF Learn
I tried to move GAN (mnist) with keras
I implemented DCGAN and tried to generate apples
I tried to save the data with discord
I tried to detect motion quickly with OpenCV
I tried to integrate with Keras in TFv1.1
I tried playing with PartiQL and MongoDB connected
I tried principal component analysis with Titanic data!
I tried Jacobian and partial differential with python
I tried to get CloudWatch data with Python
I tried function synthesis and curry with python
I tried to output LLVM IR with Python
I tried to detect an object with M2Det!
I tried to automate sushi making with python
I tried to operate Linux with Discord Bot
I tried to study DP with Fibonacci sequence
I tried to start Jupyter with Amazon lightsail
I tried to judge Tsundere with Naive Bayes
[Introduction to PID] I tried to control and play ♬
I tried to predict the price of ETF
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
I tried to automate internal operations with Docker, Python and Twitter API + bonus
[ES Lab] I tried to develop a WEB application with Python and Flask ②
I tried to predict the horses that will be in the top 3 with LightGBM
[Introduction to AWS] I tried porting the conversation app and playing with text2speech @ AWS ♪
I tried to make a simple image recognition API with Fast API and Tensorflow
I tried to debug.
Introduction to AI creation with Python! Part 3 I tried to classify and predict images with a convolutional neural network (CNN)
I tried to paste
I tried to learn the sin function with chainer