Introduction

This is the story of participating in the Kaggle </ b> competition for the first time. In the previous "Selecting Models with Kaggle's Titanic" (https://qiita.com/sudominoru/items/1c21cf4afaf67fda3fee), we were able to evaluate some models and raise their scores a bit. This time I would like to try all models of scikit-learn.

table of contents

Result

About all models of scikit-learn

Cross-validation

Evaluate all models by cross-validation

Parameter tuning

Submit to Kaggle

Summary

History

1. Result

According to the result, the score went up a little to "0.78947". The result is the top 25% (as of December 30, 2019). I would like to see the flow up to submission.

2. About all models of scikit-learn

All scikit-learn models can be obtained with "all_estimators". When getting, you can narrow down from the following 4 with the parameter of "type_filter". 「classifier / regressor / cluster / transformer」 This time it is a classification problem, so filter by "classifier".

from sklearn.utils.testing import all_estimators all_estimators(type_filter="classifier")

3. Cross-validation

Let's verify the model acquired above with "cross-validation". This time, we will perform "K-fold cross-validation". K-validation first divides the training data into K pieces. Then, one of them is used as test data, and the remaining K-1 is used as training data. K − Train with one training data and evaluate with the remaining test data. This is a method of repeating this k times and averaging the obtained k times results (score) to evaluate the model. scikit-learn provides a K-fold cross-validation class. "KFold" and "cross_validate".

from sklearn.model_selection import KFold from sklearn.model_selection import cross_validate kf = KFold(n_splits=3, shuffle=True, random_state=1) scores = cross_validate(model, x_train, y_train, cv=kf, scoring=['accuracy'])

Specify how many splits with "n_splits" of KFold. Specify the training data, KFold, and scoring method with cross_validate. The evaluation specified by scoring is returned as the return value of cross_validate. The returned value will be returned as an array for the amount divided by n_splits.

4. Evaluate all models by cross-validation

Now, I would like to evaluate all models by K-validation. The code is below. The "preparation" code is the same as before.

Preparation

import numpy import pandas ############################## #Data preprocessing #Extract the required items # Data preprocessing # Extract necessary items ############################## # train.load csv # Load train.csv df = pandas.read_csv('/kaggle/input/titanic/train.csv') df = df[['Survived', 'Pclass', 'Sex', 'Fare']]

Preparation

from sklearn.preprocessing import LabelEncoder ############################## #Data preprocessing #Quantify the label (name) # Data preprocessing # Digitize labels ############################## #df = pandas.get_dummies(df) encoder_sex = LabelEncoder() df['Sex'] = encoder_sex.fit_transform(df['Sex'].values)

Preparation

from sklearn.preprocessing import StandardScaler ############################## #Data preprocessing #Standardize numbers # Data preprocessing # Standardize numbers ############################## #Standardization standard = StandardScaler() df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Fare']]), columns=['Pclass', 'Fare']) df['Pclass'] = df_std['Pclass'] df['Fare'] = df_std['Fare']

K-Partition cross validation

import sys from sklearn.model_selection import KFold from sklearn.model_selection import cross_validate from sklearn.utils.testing import all_estimators ############################## #K on all models-Perform split-validation # K-fold cross-validation with all estimators. ############################## x_train = df.drop(columns='Survived').values y_train = df[['Survived']].values y_train = numpy.ravel(y_train) kf = KFold(n_splits=3, shuffle=True, random_state=1) writer = open('./all_estimators_classifier.txt', 'w', encoding="utf-8") writer.write('name\taccuracy\n') for (name,Estimator) in all_estimators(type_filter="classifier"): try: model = Estimator() if 'score' not in dir(model): continue; scores = cross_validate(model, x_train, y_train, cv=kf, scoring=['accuracy']) accuracy = scores['test_accuracy'].mean() writer.write(name + "\t" + str(accuracy) + '\n') except: print(sys.exc_info()) print(name) pass writer.close()

I get the model of the classification problem with "all_estimators (type_filter =" classifier ")" and loop. Only models that have a score with "if'score' not in dir (model):" are targeted. Evaluate cross-validation with "cross_validate". Specify the "KFold" specified above for the parameter. Output the model name and evaluation value to the file name "all_estimators_classifier.txt".

I will try it. When the process is completed, "all_estimators_classifier.txt" will be output. Let's take a look at the contents. About 30 model names are output. The following are the 10 models picked up in descending order of "accuracy".

name accuracy

ExtraTreeClassifier 0.82155

GradientBoostingClassifier 0.82043

HistGradientBoostingClassifier 0.81706

DecisionTreeClassifier 0.81481

ExtraTreesClassifier 0.81481

RandomForestClassifier 0.80920

GaussianProcessClassifier 0.80471

MLPClassifier 0.80471

KNeighborsClassifier 0.80022

LabelPropagation 0.80022

There are 5 models with higher accuracy rate than "Random Forest Classifier" of Last time.

5. Parameter tuning

Let's check the parameters of each of the top 5 models by grid search. It became the following.

model Parameters

ExtraTreeClassifier criterion='gini', min_samples_leaf=10, min_samples_split=2, splitter='random'

GradientBoostingClassifier learning_rate=0.2, loss='deviance', min_samples_leaf=10, min_samples_split=0.5, n_estimators=500

HistGradientBoostingClassifier learning_rate=0.05, max_iter=50, max_leaf_nodes=10, min_samples_leaf=2

DecisionTreeClassifier criterion='entropy', min_samples_split=2, min_samples_leaf=1

ExtraTreesClassifier n_estimators=25, criterion='gini', min_samples_split=10, min_samples_leaf=2, bootstrap=True

6. Submit to Kaggle

I'll submit each model to Kaggle. The parameters should be the parameters checked by grid search. The score is as follows.

model score

ExtraTreeClassifier 0.78947

GradientBoostingClassifier 0.75598

HistGradientBoostingClassifier 0.77990

DecisionTreeClassifier 0.77511

ExtraTreesClassifier 0.78468

ExtraTreeClassifier gave the best score with a result of "0.78947".

7. Summary

All scikit-learn models were evaluated by cross-validation. For this input data, ExtraTreeClassifier has the best score The result was "0.78947". Next time, I would like to visually check the data. By checking the raw data, I would like to find out if the accuracy of the input data can be further improved by screening.

reference

[Choose the best model all_estimators ()](https://betashort-lab.com/%E3%83%87%E3%83%BC%E3%82%BF%E3%82%B5%E3%82%A4 % E3% 82% A8% E3% 83% B3% E3% 82% B9 /% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92 /% E6% 9C% 80 % E9% 81% A9% E3% 81% AA% E3% 83% A2% E3% 83% 87% E3% 83% AB% E9% 81% B8% E3% 81% B3all_estimators /) Python: Calculate your own evaluation index with scikit-learn's cross_validate () function Types of cross-validation of sklearn and their behavior

History

2020/01/01 First edition released 2020/01/29 Next link added

Recommended Posts
Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)

Select models with Kaggle's Titanic (kaggle ④)

Try Kaggle's Titanic tutorial

Predict Kaggle's Titanic with keras (kaggle ⑦)

Check raw data with Kaggle's Titanic (kaggle ⑥)

Try SVM with scikit-learn on Jupyter Notebook

I tried learning with Kaggle's Titanic (kaggle②)

Check the correlation with Kaggle's Titanic (kaggle③)

Challenge Kaggle Titanic

[PYTHON] Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)

Introduction

table of contents

1. Result

2. About all models of scikit-learn

3. Cross-validation

4. Evaluate all models by cross-validation

`Preparation`

`Preparation`

`Preparation`

`K-Partition cross validation`

5. Parameter tuning

6. Submit to Kaggle

7. Summary

reference

History

name	accuracy
ExtraTreeClassifier	0.82155
GradientBoostingClassifier	0.82043
HistGradientBoostingClassifier	0.81706
DecisionTreeClassifier	0.81481
ExtraTreesClassifier	0.81481
RandomForestClassifier	0.80920
GaussianProcessClassifier	0.80471
MLPClassifier	0.80471
KNeighborsClassifier	0.80022
LabelPropagation	0.80022

model	Parameters
ExtraTreeClassifier	criterion='gini', min_samples_leaf=10, min_samples_split=2, splitter='random'
GradientBoostingClassifier	learning_rate=0.2, loss='deviance', min_samples_leaf=10, min_samples_split=0.5, n_estimators=500
HistGradientBoostingClassifier	learning_rate=0.05, max_iter=50, max_leaf_nodes=10, min_samples_leaf=2
DecisionTreeClassifier	criterion='entropy', min_samples_split=2, min_samples_leaf=1
ExtraTreesClassifier	n_estimators=25, criterion='gini', min_samples_split=10, min_samples_leaf=2, bootstrap=True

model	score
ExtraTreeClassifier	0.78947
GradientBoostingClassifier	0.75598
HistGradientBoostingClassifier	0.77990
DecisionTreeClassifier	0.77511
ExtraTreesClassifier	0.78468