[PYTHON] I tried learning with Kaggle's Titanic (kaggle②)

Introduction

This is the story of participating in the Kaggle </ b> competition for the first time. In Last "First Kaggle", ・ How to participate in Kaggle ・ How to participate in the competition ・ Until you join and write the code ・ How to submit the results I mainly wrote. This time, I would like to proceed to the point of studying at the "Titanic Competition". Can the accuracy rate of the sample code exceed "76%"?

table of contents

  1. Prerequisite knowledge
  2. Learning flow
  3. Organize data 3.1. Extract the required items 3.2. Handle missing values 3.3. Digitize labels 3.4. Standardize numbers
  4. Build the model
  5. Learn with training data
  6. Predict results with test data
  7. Submitted result
  8. Summary History

1. Prerequisite knowledge

It's from how much a person who knows machine learning describes it. About half a year ago (April 2019), I became interested in machine learning and learned mainly from the following books. ・ [Theory and practice by a Python machine learning programming expert data scientist](https://www.amazon.co.jp/Python-%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-% E9% 81 % 94% E4% BA% BA% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3 % E3% 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% 90% 86% E8 % AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-impress-gear / dp / 4295003379 / ref = dp_ob_title_bk) ・ [Detailed explanation Deep learning ~ Time series data processing by TensorFlow ・ Keras ~](https://www.amazon.co.jp/%E8%A9%B3%E8%A7%A3-%E3%83%87%E3 % 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0-TensorFlow% E3 % 83% BBKeras% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 99% 82% E7% B3% BB% E5% 88% 97% E3% 83% 87% E3% 83 % BC% E3% 82% BF% E5% 87% A6% E7% 90% 86-% E5% B7% A3% E7% B1% A0-% E6% 82% A0% E8% BC% 94 / dp / 4839962510 / ref = sr_1_2? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords =% E8% A9% B3% E8% A7% A3 +% E3% 83% 87 % E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0 +% 7ETensorFlow % E3% 83% BBKeras% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 99% 82% E7% B3% BB% E5% 88% 97% E3% 83% 87% E3 % 83% BC% E3% 82% BF% E5% 87% A6% E7% 90% 86 & qid = 1575853564 & s = books & sr = 1-2)

The situation is that you don't know what "scikit-learn", "tensorflow", and "keras" are.

The image I understand is as follows.

  • scikit-learn has few parameters and can be learned easily (processing speed is fast)
  • keras is a group of machine learning libraries that run on tensorflow. It can be set more finely than scikit-learn. (Keras seems to work outside of tensorflow, but I don't know the details. Theano?)
  • tensorflow is a group of libraries for machine learning, but this library is close to a container. You can handle "constants", "variables", and "placeholders" that are convenient for machine learning, but if you only use tensorflow, you need to create "activation functions" and "evaluation functions" yourself.

At my own level, I wonder if I can write learning code using scikit-learn or keras.

2. Learning flow

The flow of machine learning is as follows.

  1. Organize data
  2. Build a model
  3. Learn with training data
  4. Predict results with test data

3. Organize data

Check and maintain the data.

20191209_01.png

First of all, since we will start with a new Notebook different from the previous one, click "New Notebook" and select the language "Paython" and Type "Notebook" as before.

20191209_02.png

Check train.csv. Since you can write the code, you can output the data with the pandas.head () command, but you can also download it, so let's download it. Click train.csv and you will see 100 lines of data on the screen. You can download it with the download button.

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, female 26 0 0 STON/O2. 3101282 7.925 S
4 1 1 Futrelle, female 35 1 0 113803 53.1 C123 S
5 0 3 Allen, male 35 0 0 373450 8.05 S
6 0 3 Moran, male 0 0 330877 8.4583 Q
7 0 1 McCarthy male 54 0 0 17463 51.8625 E46 S
8 0 3 Palsson, male 2 3 1 349909 21.075 S
9 1 3 Johnson, female 27 0 2 347742 11.1333 S
10 1 2 Nasser, female 14 1 0 237736 30.0708 C

Check the CSV with Excel etc. There are some items that I don't understand, but there is a description in the data of the competition. As an aside, as explained in OverView, the sample "gender_submission.csv" seems to consider "only women survived". Certainly, the values of "Sex" in "test.csv" and "Survived" in "gender_submission.csv" match. That's why the correct answer rate of "76%" is quite formidable.

20191209_03.png Data Dictionary

Variable Definition Translation Key
survival Survival Survival 0 = No, 1 = Yes
pclass Ticket class Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex sex
Age Age in years age
sibsp # of siblings / spouses aboard the Titanic Brothers riding the Titanic/Number of spouses
parch # of parents / children aboard the Titanic Parents riding the Titanic/Number of children
ticket Ticket number Ticket number
fare Passenger fare Passenger fare
cabin Cabin number Room number
embarked Port of Embarkation Boarding port C = Cherbourg, Q = Queenstown, S = Southampton

Consider which items to use for learning. Since "Survival" is the part that is being asked, I will use it as an answer for learning. Since women and children are more likely to get on lifeboats preferentially, "gender" and "age" are used. Also, depending on the situation, wealth may have had an effect. Let's also use "ticket class" and "passenger fare". "Name", "ticket number" and "port of embarkation" do not seem to be related, so they are excluded. The problem is "sibsp" and "parch". When "sibsp" and "parch" are aggregated by Excel etc., it is as follows. It seems to be related, but this time I excluded it for the sake of simplicity.

sibsp (number of siblings / spouses on the Titanic)

value of sibsp Survival=1 Survival=0 Survival rate
0 210 608 26%
1 112 209 35%
2 13 28 32%
3 4 16 20%
4 3 18 14%
5 0 5 0%
8 0 7 0%

parch (number of parents / children on the Titanic)

value of parch Survival=1 Survival=0 Survival rate
0 233 678 26%
1 65 118 36%
2 40 80 33%
3 3 5 38%
4 0 4 0%
5 1 5 17%
6 0 1 0%

3.1. Extract the required items

Delete the sample code and write the following code. Load train.csv and extract only the required items ('Survived','Pclass','Sex','Age','Fare').

import numpy 
import pandas 

##############################
#Data preprocessing 1
#Extract the required items
##############################

# train.load csv
df_train = pandas.read_csv('/kaggle/input/titanic/train.csv')

df_train = df_train.loc[:, ['Survived', 'Pclass', 'Sex', 'Age', 'Fare']]
df_train.head()
index Survived Pclass Sex Age Fare
0 0 3 male 22 7.25
1 1 1 female 38 71.2833
2 1 3 female 26 7.925
3 1 1 female 35 53.1
4 0 3 male 35 8.05

I was able to extract only the required items.

3.2. Handle missing values

Check for missing values.

##############################
#Data preprocessing 2
#Handle missing values
##############################

#Check for missing values
df_train.isnull().sum()
Column count
Survived 0
Pclass 0
Sex 0
Age 177
Fare 0

There are many data without age. If possible, fill in the missing values, but this time delete them.

#Delete rows with null age
# Delete rows with null age
df_train = df_train.dropna(subset=['Age']).reset_index(drop=True)
len(df_train)
count
714

Lines with null age have been removed.

3.3. Digitize labels

Gender "male" and "female" are difficult to handle as they are, so digitize them. Since there are only two types, male and female, you can convert them yourself, but scikit-learn has a convenient class called LabelEncoder </ b>, so let's use it. LabelEncoder: The fit method and fit_transform method replace the character string with an integer from 0 to N-1 when there are N types of character strings appearing in the input.

##############################
#Data preprocessing 3
#Quantify the label (name)
##############################
from sklearn.preprocessing import LabelEncoder

#Quantify gender using Label Encoder
encoder = LabelEncoder()
df_train['Sex'] = encoder.fit_transform(df_train['Sex'].values)
df_train.head()
index Survived Pclass Sex Age Fare
0 0 3 1 22 7.25
1 1 1 0 38 71.2833
2 1 3 0 26 7.925
3 1 1 0 35 53.1
4 0 3 1 35 8.05

"Sex" has been quantified. This encoder will also be used later when quantifying sex in test.csv.

3.4. Standardize numbers

It seems that there are many cases where learning can be done better by adjusting the scale (standardization) rather than inputting the numerical values as learning data as they are. For example, when analyzing test results, is it easier to understand by analyzing the deviation value than by analyzing the points (out of 100 points, out of 200 points)? Let's standardize "age" and "fare". As with label encoding, standardization has a useful class in scikit-learn. Standard Scaler </ b>.

##############################
#Data preprocessing 4
#Standardize numbers
# Data preprocessing 4
# Standardize numbers
##############################
from sklearn.preprocessing import StandardScaler

#Standardization
# Standardize numbers
standard = StandardScaler()
df_train_std = pandas.DataFrame(standard.fit_transform(df_train.loc[:, ['Age', 'Fare']]), columns=['Age', 'Fare'])

#Standardize Age
# Standardize Age
df_train['Age'] = df_train_std['Age']

#Standardize Fare
# Standardize Fare
df_train['Fare'] = df_train_std['Fare']

df_train.head()
index Survived Pclass Sex Age Fare
0 0 3 1 -0.530376641 -0.518977865
1 1 1 0 0.571830994 0.69189675
2 1 3 0 -0.254824732 -0.506213563
3 1 1 0 0.365167062 0.348049152
4 0 3 1 0.365167062 -0.503849804

Age and fares have been standardized. At this point, data maintenance is complete.

4. Build the model

Once the data is ready, it's time to build the model. For the time being, let's build it with scikit-learn. Below is a flowchart of algorithm selection on the sckit-learn site.

ml_map.png

Let's select a model according to this flowchart. "Category" YES ⇒ "With label data" Yes, proceed to "classification" on the upper left. I think it corresponds to "classification supervised learning". It became "Linear SVC" on the chart.

When learning, the data to be learned (= x_train) and the answer (= y_train) are passed to the model separately. The image is as follows.

y_train x_train
index Survived Pclass Sex Age Fare
0 0 3 1 -0.530376641 -0.518977865
1 1 1 0 0.571830994 0.69189675
2 1 3 0 -0.254824732 -0.506213563
3 1 1 0 0.365167062 0.348049152
4 0 3 1 0.365167062 -0.503849804

The code is below.

##############################
#Model building
##############################
from sklearn.svm import LinearSVC

#Prepare training data
x_train = df_train.loc[:, ['Pclass', 'Sex', 'Age', 'Fare']].values
y_train = df_train.loc[:, ['Survived']].values

# y_Remove train dimension
y_train = numpy.reshape(y_train,(-1))

#Generate a model
model = LinearSVC(random_state=1)

5. Learn with training data

Training simply passes the training data to the model.

##############################
#Learning
##############################
model.fit(x_train, y_train)

6. Predict results with test data

Let's see the learning result with test data. test.csv should be similar to the training data (x_train). There is a deficiency in age and fare, but even if it is deficient, the result must be predicted. If it is test data, it will be converted to "0" without being deleted.

##############################
# test.Convert csv
# convert test.csv
##############################
# test.load csv
# Load test.csv
df_test = pandas.read_csv('/kaggle/input/titanic/test.csv')

# 'PassengerId'To extract(To combine with the result)
df_test_index = df_test.loc[:, ['PassengerId']]

# 'Survived', 'Pclass', 'Sex', 'Age', 'Fare'To extract
# Extract 'Survived', 'Pclass', 'Sex', 'Age', 'Fare'
df_test = df_test.loc[:, ['Pclass', 'Sex', 'Age', 'Fare']]

#Quantify gender using Label Encoder
# Digitize gender using LabelEncoder
df_test['Sex'] = encoder.transform(df_test['Sex'].values)

df_test_std = pandas.DataFrame(standard.transform(df_test.loc[:, ['Age', 'Fare']]), columns=['Age', 'Fare'])

#Standardize Age
# Standardize Age
df_test['Age'] = df_test_std['Age']

#Standardize Fare
# Standardize Fare
df_test['Fare'] = df_test_std['Fare']

# Age,Convert Fare Nan to 0
# Convert Age and Fare Nan to 0
df_test = df_test.fillna({'Age':0, 'Fare':0})

df_test.head()
Index Pclass Sex Age Fare
0 3 1 0.298549339 -0.497810518
1 3 0 1.181327932 -0.512659955
2 2 1 2.240662243 -0.464531805
3 3 1 -0.231117817 -0.482887658
4 3 0 -0.584229254 -0.417970618

I was able to convert the test data in the same way. Predict the result. Just pass in the test data and predict.

##############################
#Predict results
# Predict results
##############################
x_test = df_test.values
y_test = model.predict(x_test)

The result is in y_test. Save the result in the same format as gender_submission.csv.

#Combine the result with the DataFrame of the PassengerId
# Combine the data frame of PassengerId and the result
df_output = pandas.concat([df_test_index, pandas.DataFrame(y_test, columns=['Survived'])], axis=1)

# result.Write csv to current directory
# Write result.csv to the current directory
df_output.to_csv('result.csv', index=False)

With the above, we were able to obtain the results. I will try to execute it with "Commit" as before. After the execution is completed, click "Open Viersion". You can see that result.csv has been created.

20191209_04.png

Click "Submit to Competition" to submit. What will happen ...

7. Submitted result

20191210_01.png

The result was "0.75119". 75%. It's worse than the sample data ^^;

8. Summary

How was that. I didn't adjust the learning parameters at all, but I understood the learning flow. Next time will examine the data and look at the learning parameters so that the score will be a little better.

History

2019/12/11 First edition released 2019/12/26 Next link installation 2020/01/03 Source comment correction 2020/03/26 Partially revised the source code of "6. Predict the result with test data"

Recommended Posts