[PYTHON] Kaguru for the first time

This article is the 17th day of Akatsuki Advent Calendar 2016.

Nice to meet you. My name is @chosty and I am a server engineer at Akatsuki Co., Ltd. Recently, I have been saying that "work is bad for my health" at work.

Personally, I'm interested in machine learning and data analysis, and I'm working on it from time to time. Until now, I was doing this area with R, but there was a place I wanted to touch Python, so I will write an article that I tried doing a tutorial quickly with Kaggle.

What is Kaggle

This is a data analysis competition site. The official website is here. Data sets and themes (purposes) are presented by companies and researchers, and they compete for scores. It seems that if you give a good score, you will get a prize and talk about recruiting. That's a good story. Last year, Recruit was talking about the first competition held by a Japanese company. It seems that about 340,000 data analysts were registered with Kaggle at that time.

It's a service like that, but in addition to the competitions offered by companies and researchers, there are also competitions for learning provided by Kaggle. This time, I will try one of them, and I will tackle the survivor prediction problem of the Titanic. https://www.kaggle.com/c/titanic

Task

The Titanic Survivor Prediction Problem predicts the life or death of a person on board the Titanic from given data, and that's it. Download train.csv and test.csv from the above site and see what features are given.

Feature value meaning
PassengerID Just an ID given by Kaggle
Survived Did you survive(0 = NO, 1 = Yes)
Pclass Room class
Name name
Sex sex
Age age
SibSp Number of siblings and spouses riding together
Parch Number of parents and children riding together
Ticket Ticket number
Fare Passenger fare
Cabin cabin
Embarked Boarding location

From the above features, it seems that it will be a flow to search for one that seems to be effective for prediction and make a prediction model using it. When making a model, it is good to make a hypothesis and investigate. For example, it may be good to think that people who were in first-class rooms have a high survival rate, or that many families are riding with them and that men have a low survival rate. On top of that, I think it makes sense this time to remove the features that don't work and add something that seems to work by yourself. However, this time the purpose is to do it quickly, so I will put it sideways. By the way, of course, Survived is only in train.csv.

Look at the data and try it for the time being

From here, we will fetch the data in Python and see what values are stored.

##Preparation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

train = pd.read_csv("train.csv")
train.head() #Output the first 5 lines of data
train.info() #Data type confirmation
train.isnull().sum() #Confirmation of missing values
train.describe() #Summary

Such a result is returned.

head.png info.png is_null.png describe.png

jupyter convenient, the best.

Since the meaning of Survived is 0.38, it can be seen that about 60% of people have died. Even if everyone doesn't help for the time being, the accuracy will come out as it is.

I want to make up for the missing values related to age. There are various complementing methods, and it would be better to decide from the similarity with other records or estimate from other features, but for the time being, replace it with the median. I thought about changing the gender and boarding location to a dummy, and dropping the cabin because there are too many missing values, so I will do it. I'd like to handle the name, but it's troublesome, so I'll drop it for the time being. Tickets are also an obstacle, so drop them. Delete the record with the missing value at the end. If the accuracy is poor, it would be good to create useful features from the dropped information around here.

##Data shaping
train.Age = train.Age.fillna(train.Age.mean())
train = train.replace("male",0).replace("female",1).replace("C",0).replace("Q",1).replace("S",2)
train = train.drop(["Name", "Ticket", "Cabin", "PassengerID"], axis=1)
train = train.dropna()
train_data = train.values

Looking at the correlation coefficient in this state, it is as follows.

corr.png

Gender has the highest correlation with Survived, followed by boarding fare, number of parents and children, and then guest room class. Is the boarding fare positively correlated but the guest room class negatively correlated? (Room classes are 1 = 1st, 2 = 2nd, 3 = 3rd.) Although I thought that the number of features was too small, there were some features that were correlated, so I will build a model using the formatted data as it is.

This time, I will build a model using Random Forest and predict the life and death of the person in the test data. I think it's easy to understand Random Forest by looking at Mr. Hamada's materials. http://www.slideshare.net/hamadakoichi/randomforest-web It's good to hit the original work, but there is quite a lot and I want to implement it! I want to know more! If you're not a person, you don't have to read it. For those who are troublesome, I think it's best to make a lot of decision trees and make a majority vote.

##Model building
from sklearn.ensemble import RandomForestClassifier 
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(train_data[0::,1::],train_data[0::,0])

##Forecast
test = pd.read_csv("test.csv")
ids = test["PassengerId"].values
test.Age = test.Age.fillna(test.Age.mean())
test.Fare = test.Fare.fillna(test.Fare.mean())
test = test.replace("male",0).replace("female",1).replace("C",0).replace("Q",1).replace("S",2)
test = test.drop(["Name", "Ticket", "Cabin", "PassengerId"], axis=1)
test_data = test.values
output = forest.predict(test_data)

##Export
import csv
output_file = open("output.csv", "w")
file = csv.writer(output_file)
file.writerow(["PassengerId","Survived"])
file.writerows(zip(ids, output.astype(np.int64)))
output_file.close()

Submit the exported csv to kaggle and you'll get a score. Like this.

result.png

This was about 75% accurate. It's more accurate than the prediction that everyone died, but it's not so high, and it feels like it will be reasonable even if you do it properly. You can see that I'm looking at the ladder, but I'm guessing all the top 14 people. How can I do that ...

Impressions and future

It's rather easy to submit, and of course you can see your own score, and it's nice to be able to compare it with other people's scores. There are tons of posts on the forums, so take a look at my predictions! !! !! It's fun and educational to see. This time, my goal was to build a model quickly, predict it, and submit it, but I wish I had taken a little more time. Maybe you should plot or do something for each feature. So, my impression is that I will continue to find time and try a little more. It seems good to try other challenges.

After doing a little more detailed data analysis for this tutorial, I thought that the accuracy would be improved if the following items were considered. I will try to find time. --How to complement age (missing value) --Linking family information using name --Random Forests parameter adjustment

The end

Recommended Posts

Kaguru for the first time
[For self-learning] Go2 for the first time
See python for the first time
Start Django for the first time
I tried tensorflow for the first time
MongoDB for the first time in Python
Let's try Linux for the first time
I tried using scrapy for the first time
How to use MkDocs for the first time
[Note] Deploying Azure Functions for the first time
I tried Mind Meld for the first time
Try posting to Qiita for the first time
I tried Python on Mac for the first time.
I tried python on heroku for the first time
For the first time, I learned about Unix (Linux).
AI Gaming I tried it for the first time
Summary of stumbling blocks in Django for the first time
Introducing yourself at Qiita for the first time (test post)
I tried the Google Cloud Vision API for the first time
If you're learning Linux for the first time, do this!
First time python
Qiita's first post (the reason for starting)
Python Master RTA for the time being
First time python
Impressions and memorandums when working with VS code for the first time
For the first time in Numpy, I will update it from time to time
A useful note when using Python for the first time in a while
Since I'm free, the front-end engineer tried Python (v3.7.5) for the first time.
For the time being, import them into jupyter
Use logger with Python for the time being
Run yolov4 "for the time being" on windows
I played with Floydhub for the time being
Try using LINE Notify for the time being
virtualenv For the time being, this is all!
The first GOLD "JDBC"
The first GOLD "Function"
Looking back on the machine learning competition that I worked on for the first time
Let's display a simple template that is ideal for Django for the first time
GTUG Girls + PyLadiesTokyo Meetup I went to machine learning for the first time
Flow memo to move LOCUST for the time being
Run with CentOS7 + Apache2.4 + Python3.6 for the time being
[Python] Measures and displays the time required for processing
I will install Arch Linux for the time being.
Next to Excel, for the time being, jupyter notebook
Import audit.log into Splunk and check the behavior when Splunk is started for the first time
After attending school, I participated in SIGNATE's BEGINNER limited competition for the first time.
I want to create a lunch database [EP1] Django study for the first time
I want to create a lunch database [EP1-4] Django study for the first time
For the G test 2020 # 2 exam
I want to move selenium for the time being [for mac]
I tried running PIFuHD on Windows for the time being
[Understand in the shortest time] Python basics for data analysis
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 1 [Environment construction]
[Introduction to Reinforcement Learning] Reinforcement learning to try moving for the time being
What is the interface for ...
What is a dog? Django--Getting Started with Form for the First Time POST Transmission Volume
For the time being, try using the docomo chat dialogue API
What kind of environment should people who are learning Python for the first time build?
Uppercase only the first letter
I want to create a Dockerfile for the time being.
Code that I wish I had remembered when I participated in AtCoder for the first time (Reflection 1 for the next time)