[PYTHON] Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)

Introduction

This is the story of participating in the Kaggle </ b> competition for the first time. In the previous "Selecting Models with Kaggle's Titanic" (https://qiita.com/sudominoru/items/1c21cf4afaf67fda3fee), we were able to evaluate some models and raise their scores a bit. This time I would like to try all models of scikit-learn.

table of contents

  1. Result
  2. About all models of scikit-learn
  3. Cross-validation
  4. Evaluate all models by cross-validation
  5. Parameter tuning
  6. Submit to Kaggle
  7. Summary

History

1. Result

According to the result, the score went up a little to "0.78947". The result is the top 25% (as of December 30, 2019). I would like to see the flow up to submission.

2. About all models of scikit-learn

All scikit-learn models can be obtained with "all_estimators". When getting, you can narrow down from the following 4 with the parameter of "type_filter". 「classifier / regressor / cluster / transformer」 This time it is a classification problem, so filter by "classifier".

from sklearn.utils.testing import all_estimators
all_estimators(type_filter="classifier")

3. Cross-validation

Let's verify the model acquired above with "cross-validation". This time, we will perform "K-fold cross-validation". K-validation first divides the training data into K pieces. Then, one of them is used as test data, and the remaining K-1 is used as training data. K − Train with one training data and evaluate with the remaining test data. This is a method of repeating this k times and averaging the obtained k times results (score) to evaluate the model. scikit-learn provides a K-fold cross-validation class. "KFold" and "cross_validate".

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
kf = KFold(n_splits=3, shuffle=True, random_state=1)
scores = cross_validate(model, x_train, y_train, cv=kf, scoring=['accuracy'])

Specify how many splits with "n_splits" of KFold. Specify the training data, KFold, and scoring method with cross_validate. The evaluation specified by scoring is returned as the return value of cross_validate. The returned value will be returned as an array for the amount divided by n_splits.

4. Evaluate all models by cross-validation

Now, I would like to evaluate all models by K-validation. The code is below. The "preparation" code is the same as before.

Preparation


import numpy 
import pandas 

##############################
#Data preprocessing
#Extract the required items
# Data preprocessing
# Extract necessary items
##############################
# train.load csv
# Load train.csv
df = pandas.read_csv('/kaggle/input/titanic/train.csv')
df = df[['Survived', 'Pclass', 'Sex', 'Fare']]

Preparation


from sklearn.preprocessing import LabelEncoder
##############################
#Data preprocessing
#Quantify the label (name)
# Data preprocessing
# Digitize labels
##############################
#df = pandas.get_dummies(df)
encoder_sex = LabelEncoder()
df['Sex'] = encoder_sex.fit_transform(df['Sex'].values)

Preparation


from sklearn.preprocessing import StandardScaler
##############################
#Data preprocessing
#Standardize numbers
# Data preprocessing
# Standardize numbers
##############################

#Standardization
standard = StandardScaler()
df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])

df['Pclass'] = df_std['Pclass']
df['Fare'] = df_std['Fare']

K-Partition cross validation


import sys
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.utils.testing import all_estimators
##############################
#K on all models-Perform split-validation
# K-fold cross-validation with all estimators.
##############################
x_train = df.drop(columns='Survived').values
y_train = df[['Survived']].values
y_train = numpy.ravel(y_train)

kf = KFold(n_splits=3, shuffle=True, random_state=1)

writer = open('./all_estimators_classifier.txt', 'w', encoding="utf-8")
writer.write('name\taccuracy\n')

for (name,Estimator) in all_estimators(type_filter="classifier"):

    try:
        model = Estimator()
        if 'score' not in dir(model):
            continue;
        scores = cross_validate(model, x_train, y_train, cv=kf, scoring=['accuracy'])
        accuracy = scores['test_accuracy'].mean()
        writer.write(name + "\t" + str(accuracy) + '\n')
    except:
        print(sys.exc_info())
        print(name)
        pass

writer.close()

I get the model of the classification problem with "all_estimators (type_filter =" classifier ")" and loop. Only models that have a score with "if'score' not in dir (model):" are targeted. Evaluate cross-validation with "cross_validate". Specify the "KFold" specified above for the parameter. Output the model name and evaluation value to the file name "all_estimators_classifier.txt".

I will try it. When the process is completed, "all_estimators_classifier.txt" will be output. Let's take a look at the contents. About 30 model names are output. The following are the 10 models picked up in descending order of "accuracy".

name accuracy
ExtraTreeClassifier 0.82155
GradientBoostingClassifier 0.82043
HistGradientBoostingClassifier 0.81706
DecisionTreeClassifier 0.81481
ExtraTreesClassifier 0.81481
RandomForestClassifier 0.80920
GaussianProcessClassifier 0.80471
MLPClassifier 0.80471
KNeighborsClassifier 0.80022
LabelPropagation 0.80022

There are 5 models with higher accuracy rate than "Random Forest Classifier" of Last time.

5. Parameter tuning

Let's check the parameters of each of the top 5 models by grid search. It became the following.

model Parameters
ExtraTreeClassifier criterion='gini', min_samples_leaf=10, min_samples_split=2, splitter='random'
GradientBoostingClassifier learning_rate=0.2, loss='deviance', min_samples_leaf=10, min_samples_split=0.5, n_estimators=500
HistGradientBoostingClassifier learning_rate=0.05, max_iter=50, max_leaf_nodes=10, min_samples_leaf=2
DecisionTreeClassifier criterion='entropy', min_samples_split=2, min_samples_leaf=1
ExtraTreesClassifier n_estimators=25, criterion='gini', min_samples_split=10, min_samples_leaf=2, bootstrap=True

6. Submit to Kaggle

I'll submit each model to Kaggle. The parameters should be the parameters checked by grid search. The score is as follows.

model score
ExtraTreeClassifier 0.78947
GradientBoostingClassifier 0.75598
HistGradientBoostingClassifier 0.77990
DecisionTreeClassifier 0.77511
ExtraTreesClassifier 0.78468

ExtraTreeClassifier gave the best score with a result of "0.78947".

7. Summary

All scikit-learn models were evaluated by cross-validation. For this input data, ExtraTreeClassifier has the best score The result was "0.78947". Next time, I would like to visually check the data. By checking the raw data, I would like to find out if the accuracy of the input data can be further improved by screening.

reference

[Choose the best model all_estimators ()](https://betashort-lab.com/%E3%83%87%E3%83%BC%E3%82%BF%E3%82%B5%E3%82%A4 % E3% 82% A8% E3% 83% B3% E3% 82% B9 /% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92 /% E6% 9C% 80 % E9% 81% A9% E3% 81% AA% E3% 83% A2% E3% 83% 87% E3% 83% AB% E9% 81% B8% E3% 81% B3all_estimators /) Python: Calculate your own evaluation index with scikit-learn's cross_validate () function Types of cross-validation of sklearn and their behavior

History

2020/01/01 First edition released 2020/01/29 Next link added

Recommended Posts

Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)
Select models with Kaggle's Titanic (kaggle ④)
Try Kaggle's Titanic tutorial
Predict Kaggle's Titanic with keras (kaggle ⑦)
Check raw data with Kaggle's Titanic (kaggle ⑥)
Try SVM with scikit-learn on Jupyter Notebook
I tried learning with Kaggle's Titanic (kaggle②)
Check the correlation with Kaggle's Titanic (kaggle③)
Challenge Kaggle Titanic