This is the story of participating in the Kaggle </ b> competition for the first time. In the previous "Selecting Models with Kaggle's Titanic" (https://qiita.com/sudominoru/items/1c21cf4afaf67fda3fee), we were able to evaluate some models and raise their scores a bit. This time I would like to try all models of scikit-learn.
History
According to the result, the score went up a little to "0.78947". The result is the top 25% (as of December 30, 2019). I would like to see the flow up to submission.
All scikit-learn models can be obtained with "all_estimators". When getting, you can narrow down from the following 4 with the parameter of "type_filter". 「classifier / regressor / cluster / transformer」 This time it is a classification problem, so filter by "classifier".
from sklearn.utils.testing import all_estimators
all_estimators(type_filter="classifier")
Let's verify the model acquired above with "cross-validation". This time, we will perform "K-fold cross-validation". K-validation first divides the training data into K pieces. Then, one of them is used as test data, and the remaining K-1 is used as training data. K − Train with one training data and evaluate with the remaining test data. This is a method of repeating this k times and averaging the obtained k times results (score) to evaluate the model. scikit-learn provides a K-fold cross-validation class. "KFold" and "cross_validate".
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
kf = KFold(n_splits=3, shuffle=True, random_state=1)
scores = cross_validate(model, x_train, y_train, cv=kf, scoring=['accuracy'])
Specify how many splits with "n_splits" of KFold. Specify the training data, KFold, and scoring method with cross_validate. The evaluation specified by scoring is returned as the return value of cross_validate. The returned value will be returned as an array for the amount divided by n_splits.
Now, I would like to evaluate all models by K-validation. The code is below. The "preparation" code is the same as before.
Preparation
import numpy
import pandas
##############################
#Data preprocessing
#Extract the required items
# Data preprocessing
# Extract necessary items
##############################
# train.load csv
# Load train.csv
df = pandas.read_csv('/kaggle/input/titanic/train.csv')
df = df[['Survived', 'Pclass', 'Sex', 'Fare']]
Preparation
from sklearn.preprocessing import LabelEncoder
##############################
#Data preprocessing
#Quantify the label (name)
# Data preprocessing
# Digitize labels
##############################
#df = pandas.get_dummies(df)
encoder_sex = LabelEncoder()
df['Sex'] = encoder_sex.fit_transform(df['Sex'].values)
Preparation
from sklearn.preprocessing import StandardScaler
##############################
#Data preprocessing
#Standardize numbers
# Data preprocessing
# Standardize numbers
##############################
#Standardization
standard = StandardScaler()
df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])
df['Pclass'] = df_std['Pclass']
df['Fare'] = df_std['Fare']
K-Partition cross validation
import sys
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.utils.testing import all_estimators
##############################
#K on all models-Perform split-validation
# K-fold cross-validation with all estimators.
##############################
x_train = df.drop(columns='Survived').values
y_train = df[['Survived']].values
y_train = numpy.ravel(y_train)
kf = KFold(n_splits=3, shuffle=True, random_state=1)
writer = open('./all_estimators_classifier.txt', 'w', encoding="utf-8")
writer.write('name\taccuracy\n')
for (name,Estimator) in all_estimators(type_filter="classifier"):
try:
model = Estimator()
if 'score' not in dir(model):
continue;
scores = cross_validate(model, x_train, y_train, cv=kf, scoring=['accuracy'])
accuracy = scores['test_accuracy'].mean()
writer.write(name + "\t" + str(accuracy) + '\n')
except:
print(sys.exc_info())
print(name)
pass
writer.close()
I get the model of the classification problem with "all_estimators (type_filter =" classifier ")" and loop. Only models that have a score with "if'score' not in dir (model):" are targeted. Evaluate cross-validation with "cross_validate". Specify the "KFold" specified above for the parameter. Output the model name and evaluation value to the file name "all_estimators_classifier.txt".
I will try it. When the process is completed, "all_estimators_classifier.txt" will be output. Let's take a look at the contents. About 30 model names are output. The following are the 10 models picked up in descending order of "accuracy".
name | accuracy |
---|---|
ExtraTreeClassifier | 0.82155 |
GradientBoostingClassifier | 0.82043 |
HistGradientBoostingClassifier | 0.81706 |
DecisionTreeClassifier | 0.81481 |
ExtraTreesClassifier | 0.81481 |
RandomForestClassifier | 0.80920 |
GaussianProcessClassifier | 0.80471 |
MLPClassifier | 0.80471 |
KNeighborsClassifier | 0.80022 |
LabelPropagation | 0.80022 |
There are 5 models with higher accuracy rate than "Random Forest Classifier" of Last time.
Let's check the parameters of each of the top 5 models by grid search. It became the following.
model | Parameters |
---|---|
ExtraTreeClassifier | criterion='gini', min_samples_leaf=10, min_samples_split=2, splitter='random' |
GradientBoostingClassifier | learning_rate=0.2, loss='deviance', min_samples_leaf=10, min_samples_split=0.5, n_estimators=500 |
HistGradientBoostingClassifier | learning_rate=0.05, max_iter=50, max_leaf_nodes=10, min_samples_leaf=2 |
DecisionTreeClassifier | criterion='entropy', min_samples_split=2, min_samples_leaf=1 |
ExtraTreesClassifier | n_estimators=25, criterion='gini', min_samples_split=10, min_samples_leaf=2, bootstrap=True |
I'll submit each model to Kaggle. The parameters should be the parameters checked by grid search. The score is as follows.
model | score |
---|---|
ExtraTreeClassifier | 0.78947 |
GradientBoostingClassifier | 0.75598 |
HistGradientBoostingClassifier | 0.77990 |
DecisionTreeClassifier | 0.77511 |
ExtraTreesClassifier | 0.78468 |
ExtraTreeClassifier gave the best score with a result of "0.78947".
All scikit-learn models were evaluated by cross-validation. For this input data, ExtraTreeClassifier has the best score The result was "0.78947". Next time, I would like to visually check the data. By checking the raw data, I would like to find out if the accuracy of the input data can be further improved by screening.
[Choose the best model all_estimators ()](https://betashort-lab.com/%E3%83%87%E3%83%BC%E3%82%BF%E3%82%B5%E3%82%A4 % E3% 82% A8% E3% 83% B3% E3% 82% B9 /% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92 /% E6% 9C% 80 % E9% 81% A9% E3% 81% AA% E3% 83% A2% E3% 83% 87% E3% 83% AB% E9% 81% B8% E3% 81% B3all_estimators /) Python: Calculate your own evaluation index with scikit-learn's cross_validate () function Types of cross-validation of sklearn and their behavior
2020/01/01 First edition released 2020/01/29 Next link added
Recommended Posts