[PYTHON] Select models with Kaggle's Titanic (kaggle ④)

Introduction

This is the story of participating in the Kaggle </ b> competition for the first time. In the previous "Checking the correlation with Kaggle's Titanic" (https://qiita.com/sudominoru/items/840e87cc77de29f10ca2), check the correlation and check Pclass (ticket class), Sex (gender), Fare. I decided to use the three input data of (fare). This time I would like to try some models.

table of contents

  1. Result
  2. About the model to use
  3. How to evaluate the model
  4. Try the model
  5. Parameter tuning
  6. Submit to Kaggle
  7. Summary

History

1. Result

From the result, the score went up a little to "0.77511". The result is the top 58% (as of December 29, 2019). I would like to see the flow until resubmission.

2. About the model to use

Last time, I used "Linear SVC" according to scikit-learn algorithm sheet. I first learned machine learning [this book](https://www.amazon.co.jp/Python-%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF % 92% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-% E9% 81% 94% E4% BA% BA% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-impress-gear / dp / 4295003379 / ref = dp_ob_title_bk) takes up the following model of scikit-learn on the classification problem. I am. ・ Sklearn.svm.LinearSVC ・ Sklearn.svm.SVC ・ Sklearn.ensemble.RandomForestClassifier ・ Sklearn.linear_model.LogisticRegression ・ Sklearn.linear_model.SGDClassifier

This time, I would like to try the above model.

3. How to evaluate the model

The procedure for evaluating the model is as follows.

  1. Learn using training data
  2. Predict using test data
  3. Check if the predicted result is correct

Kaggle's Titanic has training data [train.csv](data with unknown results) and test data [test.csv](data with unknown results). If you use test.csv to "predict" 2 and "confirm" 3 each time, you have to commit and submit the result, which is inefficient. Training data [train.csv] with known results can be evaluated efficiently by dividing it into training data and test data. scikit-learn provides a function "train_test_split" that splits into training data and test data.

from sklearn.model_selection import train_test_split

######################################
#Separate training data and test data
# Split training data and test data
######################################
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1, shuffle=True)

The image is as follows. If test_size = 0.3, the training data and test data will be divided by "7: 3".

〇Data before division

yx
SurvivedPclassSexFare
103male7.25
211female71.2833
313female7.925
411female53.1
503male8.05
603male8.4583
701male51.8625
803male21.075
913female11.1333
1012female30.0708

〇 Training data after division

y_trainx_train
SurvivedPclassSexFare
103male7.25
211female71.2833
411female53.1
503male8.05
603male8.4583
803male21.075
1012female30.0708

〇 Test data after division

y_testx_test
SurvivedPclassSexFare
313female7.925
701male51.8625
913female11.1333

Next is learning and prediction.

The scikit-learn model provides a method "fit" for learning and a method "score" for evaluating predictions. They are "fit" and "score".

from sklearn.svm import LinearSVC
model = LinearSVC(random_state=1)

######################################
#learn
# training
######################################
model.fit(x_train, y_train)

######################################
#Evaluate the predicted results
# Evaluate predicted results
######################################
score = model.score(x_test, y_test)

“Fit” is learning. "Score" predicts the result with "x_test", matches the result with "y_test", and returns the correct answer rate. In the above case, the score would be "0.753731343283582". The result is a 75% correct answer rate.

4. Try the model

You can evaluate the performance of your model by experimenting with different models and comparing their scores. Try the model in "2. About the model to use".

The overall code is below.

Preparation


import numpy 
import pandas

# train.load csv
# Load train.csv
df = pandas.read_csv('/kaggle/input/titanic/train.csv')

##############################
#Data preprocessing
#Extract the required items
# Data preprocessing 
# Extract necessary items
##############################
# 'Survived', 'Pclass', 'Sex', 'Fare'To extract
# Extract 'Survived', 'Pclass', 'Age', 'Fare'
df = df[['Survived', 'Pclass', 'Sex', 'Fare']]

##############################
#Data preprocessing
#Quantify the label (name)
# Data preprocessing 
# Digitize labels
##############################
from sklearn.preprocessing import LabelEncoder
#Quantify gender using Label Encoder
# Digitize gender using LabelEncoder
encoder_sex = LabelEncoder()
df['Sex'] = encoder_sex.fit_transform(df['Sex'].values)

##############################
#Data preprocessing
#Standardize numbers
# Data preprocessing
# Standardize numbers
##############################
from sklearn.preprocessing import StandardScaler

#Standardization
# Standardize numbers
standard = StandardScaler()
df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])

#Standardize Fare
# Standardize Fare
df['Pclass'] = df_std['Pclass']
df['Fare'] = df_std['Fare']

from sklearn.model_selection import train_test_split

x = df.drop(columns='Survived')
y = df[['Survived']]

Training data, test data creation


#######################################
#Separate training data and test data
# Split training data and test data
#######################################
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1, shuffle=True)
y_train = numpy.ravel(y_train)
y_test = numpy.ravel(y_test)

Model evaluation


#######################################
#Evaluate the model
# Evaluate the model
#######################################
from sklearn.svm import LinearSVC
model = LinearSVC(random_state=1)
model.fit(x_train, y_train)
score = model.score(x_test, y_test)
score

By replacing the definition part of the model in "Model evaluation", you can evaluate with various models. Try the model described in "2. About the model to use". The result is as follows.

model score
sklearn.svm.LinearSVC 0.753
sklearn.svm.SVC 0.783
sklearn.ensemble.RandomForestClassifier 0.805
sklearn.linear_model.LogisticRegression 0.753
sklearn.linear_model.SGDClassifier 0.753

The result is that Random Forest is the best. Next, let's adjust the parameters of the random forest model.

5. Parameter tuning

Use the grid search (GridSearchCV) in scikit-learn to adjust the parameters. Grid search evaluates the specified parameters in all patterns and finds the optimum combination of parameters. However, since all patterns are evaluated, the more parameters you have, the longer the process will take. Check the Random Forest Documentation (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and decide to adjust the following parameters:

Parameters pattern
criterion gini / entropy
n_estimators 25 / 100 / 500 / 1000 / 2000
min_samples_split 0.5 / 2 / 4 / 10
min_samples_leaf 1 / 2 / 4 / 10
bootstrap Ture / False

You can perform grid search by replacing "model evaluation" with the following "grid search".

Grid search


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
###############################################
#Try LogisticRegression parameters with grid search
# Tuning LogisticRegression parameters with grid search
###############################################
pipe_svc = RandomForestClassifier(random_state=1)

param_grid = {'criterion':['gini','entropy'],
			  'n_estimators':[25, 100, 500, 1000, 2000],
			  'min_samples_split':[0.5, 2,4,10],
			  'min_samples_leaf':[1,2,4,10],
			  'bootstrap':[True, False]
			  }

grid = GridSearchCV(estimator=RandomForestClassifier(random_state=1), param_grid=param_grid)
grid = grid.fit(x_train, y_train)

print(grid.best_score_)
print(grid.best_params_)

The result is as follows. In my environment, it took about 10 minutes to execute the grid search.

Grid search results


0.8105939004815409
{'bootstrap': False, 'criterion': 'entropy', 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 100}

6. Submit to Kaggle

Let's specify the parameters tuned by grid search, learn and predict. Rewrite the code of "Training data, test data creation" and "Grid search" to the following to perform learning and prediction.

Learning, anticipation


##############################
#Model building
# Model building
##############################
from sklearn.ensemble import RandomForestClassifier

#Generate a model
# Generate a model
model = RandomForestClassifier(n_estimators=100, \
                               criterion='entropy', \
                               min_samples_split=2, \
                               min_samples_leaf=10, \
                               bootstrap=False, \
                               random_state=1)

##############################
#Learning
# Trainig
##############################
y = numpy.ravel(y)
model.fit(x, y)

# test.Convert csv
# convert test.csv
##############################
# test.load csv
# Load test.csv
df_test = pandas.read_csv('/kaggle/input/titanic/test.csv')

#Convert Fare Nan
# Convert Fare Nan to 0 
df_test = df_test.fillna({'Fare':0})

# 'PassengerId'To extract(To combine with the result)
# Extract 'PassengerId'(To combine with the result)
df_test_index = df_test[['PassengerId']]

# 'Pclass', 'Sex', 'Fare'To extract
# Extract 'Pclass', 'Sex', 'Fare'
df_test = df_test[['Pclass', 'Sex', 'Fare']]

#Standardization
# Standardize
df_test_std = pandas.DataFrame(standard.transform(df_test[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])
df_test['Pclass'] = df_test_std['Pclass']
df_test['Fare'] = df_test_std['Fare']

#Label encoding
# Label Encoding
df_test ['Sex'] = encoder_sex.transform(df_test ['Sex'].values)

##############################
#Predict results
# Predict results
##############################
x_test = df_test.values
y_test = model.predict(x_test)

#Combine the result with the DataFrame of the PassengerId
# Combine the data frame of PassengerId and the result
df_output = pandas.concat([df_test_index, pandas.DataFrame(y_test, columns=['Survived'])], axis=1)

# result.Write csv to current directory
# Write result.csv to the current directory
df_output.to_csv('result.csv', index=False)

Write the above in a Kaggle environment. Run "Run All" and verify that result.csv is created.

Submit by selecting "Commit" ⇒ "Open Version" ⇒ "Submit to Competition". 20191229_01.png

The score is now "0.77511".

7. Summary

This time, I was able to raise the score a little by comparing 5 types of models and tuning the parameters. Next time would like to find a more suitable model from various models of scikit-learn.

History

2019/12/29 First edition released 2020/01/01 Add next link 2020/01/03 Source comment correction

Recommended Posts

Select models with Kaggle's Titanic (kaggle ④)
Predict Kaggle's Titanic with keras (kaggle ⑦)
Check raw data with Kaggle's Titanic (kaggle ⑥)
I tried learning with Kaggle's Titanic (kaggle②)
Check the correlation with Kaggle's Titanic (kaggle③)
Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)
Challenge Kaggle Titanic
Try Kaggle's Titanic tutorial
Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Modeling-
I tried to predict and submit Titanic survivors with Kaggle
Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Prediction / Evaluation-
Kaggle Tutorial Titanic Accuracy 80.9% (Top 7% 0.80861)
[For Kaggle beginners] Titanic (LightGBM)
Try machine learning with Kaggle
PySpark learning record ② Kaggle I tried the Titanic competition with PySpark binding