This article is the 17th day of Akatsuki Advent Calendar 2016.
Nice to meet you. My name is @chosty and I am a server engineer at Akatsuki Co., Ltd. Recently, I have been saying that "work is bad for my health" at work.
Personally, I'm interested in machine learning and data analysis, and I'm working on it from time to time. Until now, I was doing this area with R, but there was a place I wanted to touch Python, so I will write an article that I tried doing a tutorial quickly with Kaggle.
This is a data analysis competition site. The official website is here. Data sets and themes (purposes) are presented by companies and researchers, and they compete for scores. It seems that if you give a good score, you will get a prize and talk about recruiting. That's a good story. Last year, Recruit was talking about the first competition held by a Japanese company. It seems that about 340,000 data analysts were registered with Kaggle at that time.
It's a service like that, but in addition to the competitions offered by companies and researchers, there are also competitions for learning provided by Kaggle. This time, I will try one of them, and I will tackle the survivor prediction problem of the Titanic. https://www.kaggle.com/c/titanic
The Titanic Survivor Prediction Problem predicts the life or death of a person on board the Titanic from given data, and that's it.
Download train.csv
and test.csv
from the above site and see what features are given.
Feature value | meaning |
---|---|
PassengerID | Just an ID given by Kaggle |
Survived | Did you survive(0 = NO, 1 = Yes) |
Pclass | Room class |
Name | name |
Sex | sex |
Age | age |
SibSp | Number of siblings and spouses riding together |
Parch | Number of parents and children riding together |
Ticket | Ticket number |
Fare | Passenger fare |
Cabin | cabin |
Embarked | Boarding location |
From the above features, it seems that it will be a flow to search for one that seems to be effective for prediction and make a prediction model using it.
When making a model, it is good to make a hypothesis and investigate. For example, it may be good to think that people who were in first-class rooms have a high survival rate, or that many families are riding with them and that men have a low survival rate.
On top of that, I think it makes sense this time to remove the features that don't work and add something that seems to work by yourself.
However, this time the purpose is to do it quickly, so I will put it sideways.
By the way, of course, Survived
is only in train.csv
.
From here, we will fetch the data in Python and see what values are stored.
##Preparation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
train = pd.read_csv("train.csv")
train.head() #Output the first 5 lines of data
train.info() #Data type confirmation
train.isnull().sum() #Confirmation of missing values
train.describe() #Summary
Such a result is returned.
jupyter convenient, the best.
Since the meaning of Survived is 0.38, it can be seen that about 60% of people have died. Even if everyone doesn't help for the time being, the accuracy will come out as it is.
I want to make up for the missing values related to age. There are various complementing methods, and it would be better to decide from the similarity with other records or estimate from other features, but for the time being, replace it with the median. I thought about changing the gender and boarding location to a dummy, and dropping the cabin because there are too many missing values, so I will do it. I'd like to handle the name, but it's troublesome, so I'll drop it for the time being. Tickets are also an obstacle, so drop them. Delete the record with the missing value at the end. If the accuracy is poor, it would be good to create useful features from the dropped information around here.
##Data shaping
train.Age = train.Age.fillna(train.Age.mean())
train = train.replace("male",0).replace("female",1).replace("C",0).replace("Q",1).replace("S",2)
train = train.drop(["Name", "Ticket", "Cabin", "PassengerID"], axis=1)
train = train.dropna()
train_data = train.values
Looking at the correlation coefficient in this state, it is as follows.
Gender has the highest correlation with Survived, followed by boarding fare, number of parents and children, and then guest room class. Is the boarding fare positively correlated but the guest room class negatively correlated? (Room classes are 1 = 1st, 2 = 2nd, 3 = 3rd.) Although I thought that the number of features was too small, there were some features that were correlated, so I will build a model using the formatted data as it is.
This time, I will build a model using Random Forest and predict the life and death of the person in the test data. I think it's easy to understand Random Forest by looking at Mr. Hamada's materials. http://www.slideshare.net/hamadakoichi/randomforest-web It's good to hit the original work, but there is quite a lot and I want to implement it! I want to know more! If you're not a person, you don't have to read it. For those who are troublesome, I think it's best to make a lot of decision trees and make a majority vote.
##Model building
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(train_data[0::,1::],train_data[0::,0])
##Forecast
test = pd.read_csv("test.csv")
ids = test["PassengerId"].values
test.Age = test.Age.fillna(test.Age.mean())
test.Fare = test.Fare.fillna(test.Fare.mean())
test = test.replace("male",0).replace("female",1).replace("C",0).replace("Q",1).replace("S",2)
test = test.drop(["Name", "Ticket", "Cabin", "PassengerId"], axis=1)
test_data = test.values
output = forest.predict(test_data)
##Export
import csv
output_file = open("output.csv", "w")
file = csv.writer(output_file)
file.writerow(["PassengerId","Survived"])
file.writerows(zip(ids, output.astype(np.int64)))
output_file.close()
Submit the exported csv to kaggle and you'll get a score. Like this.
This was about 75% accurate. It's more accurate than the prediction that everyone died, but it's not so high, and it feels like it will be reasonable even if you do it properly. You can see that I'm looking at the ladder, but I'm guessing all the top 14 people. How can I do that ...
It's rather easy to submit, and of course you can see your own score, and it's nice to be able to compare it with other people's scores. There are tons of posts on the forums, so take a look at my predictions! !! !! It's fun and educational to see. This time, my goal was to build a model quickly, predict it, and submit it, but I wish I had taken a little more time. Maybe you should plot or do something for each feature. So, my impression is that I will continue to find time and try a little more. It seems good to try other challenges.
After doing a little more detailed data analysis for this tutorial, I thought that the accuracy would be improved if the following items were considered. I will try to find time. --How to complement age (missing value) --Linking family information using name --Random Forests parameter adjustment
The end