[PYTHON] [Kaggle] Baseline model construction, Pipeline processing

0. Introduction

The general flow of regression / classification using machine learning with Kaggle etc. is Data reading → Missing value processing → Label encoding → → EDA → One Hot Encoding → → Create baseline model → Build complex model → → Parameter tuning It's like that.

Among them, regression / classification in the baseline model can be solved as a relatively pattern. Furthermore, it can be solved more quickly by using Pipeline processing there. So, this time, I summarized the construction of the baseline model and the pipeline processing.

1. Preparation

The dataset used this time is the titanic dataset that is on Kaggle.

It looks like this until one hot encoding. The details so far are summarized in the previous article.

In[1]


%matplotlib inline
import matplotlib as plt
import pandas as pd
import numpy as np
import seaborn as sns
import collections

df = pd.read_csv('./train.csv')
df = df.set_index('PassengerId') #Set a unique column to index
df = df.drop(['Name', 'Ticket'], axis=1) #Drop columns that are not needed for analysis
df = df.drop(['Cabin'], axis=1) #Deleted because it seems difficult to use for analysis
df = df.dropna(subset = ['Embarked']) #Cabin has few defects, so delete the line with dropna
df = df.fillna(method = 'ffill') #Other columns complement from previous data

from sklearn.preprocessing import LabelEncoder
for column in ['Sex','Embarked']:
    le = LabelEncoder()
    le.fit(df[column])
    df[column] = le.transform(df[column])

df_continuous = df[['Age','SibSp','Parch','Fare']]

df = pd.get_dummies(df, columns = ['Pclass','Embarked'])
df.head()

Then split this into train and test data with train_test_split.

In[2]


from sklearn.model_selection import train_test_split

X = df.drop(['Survived'], axis=1) 
y = df['Survived']

validation_size = 0.20
seed = 42 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=validation_size,random_state=seed)

2. Creating a Baseline Model

Thanks: https://www.kaggle.com/mdiqbalbajmi/titanic-survival-prediction-beginner

The required library looks like this.

In[3]


from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

This time, we will use these six algorithms as the base model.

--Logistic regression --Linear discrimination --KNN (k-nearest neighbor method) --CART (decision tree) --Gaussian Naive Bayes --Support vector machine

We also use cross-validation with KFold to check the variance of the scores for each model.

In[4]



# Spot-check Algorithms
models = []

# In LogisticRegression set: solver='lbfgs',multi_class ='auto', max_iter=10000 to overcome warning
models.append(('LR',LogisticRegression(solver='lbfgs',multi_class='auto',max_iter=10000)))
models.append(('LDA',LinearDiscriminantAnalysis()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('CART',DecisionTreeClassifier()))
models.append(('NB',GaussianNB()))
models.append(('SVM',SVC(gamma='scale')))

# evaluate each model in turn
results = []
names = []

for name, model in models:
    # initializing kfold by n_splits=10(no.of K)
    kfold = KFold(n_splits = 10, random_state=seed, shuffle=True) #crossvalidation
    
    # cross validation score of given model using cross-validation=kfold
    cv_results = cross_val_score(model,X_train,y_train,cv=kfold, scoring="accuracy")
    
    # appending cross validation result to results list
    results.append(cv_results)
    
    # appending name of algorithm to names list
    names.append(name)
    
    # printing cross_validation_result's mean and standard_deviation
    print(name, cv_results.mean()*100.0, "(",cv_results.std()*100.0,")")

The output looks like this.

out[4]


LR 80.15845070422536 ( 5.042746503951439 )
LDA 79.31533646322379 ( 5.259067356458109 )
KNN 71.1658841940532 ( 3.9044128926316235 )
CART 76.50821596244131 ( 4.1506876372712815 )
NB 77.91471048513301 ( 4.426999157571688 )
SVM 66.9424882629108 ( 6.042153317290744 )

Visualize this with a box plot.

In[5]


figure = plt.figure()
figure.suptitle('Algorithm Comparison')
ax = figure.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names);

Here is the output graph.

image.png

3. Pipeline processing

When there are multiple processes, there are some nice things to do by pipelined those processes. The code can be simplified, and hyperparameter search becomes easier when combined with GridSearch.

What you can do with Pipeline processing

This time, let's pipeline Standard Scaling and classification by model.

In[6]


# import pipeline to make machine learning pipeline to overcome data leakage problem
from sklearn.pipeline import Pipeline

# import StandardScaler to Column Standardize the data
# many algorithm assumes data to be Standardized
from sklearn.preprocessing import StandardScaler

# test options and evaluation matrix
num_folds=10
seed=42
scoring='accuracy'

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

A warning will appear, so we are also dealing with it.

Build pipelines for each base model and store them in pipelines.

In[7]


# source of code: machinelearningmastery.com
# Standardize the dataset
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()),('LR',LogisticRegression(solver='lbfgs',multi_class='auto',max_iter=10000))])))
pipelines.append(('ScaledLDA', Pipeline([('Scaler', StandardScaler()),('LDA',LinearDiscriminantAnalysis())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()),('KNN',KNeighborsClassifier())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()),('CART',DecisionTreeClassifier())])))
pipelines.append(('ScaledNB', Pipeline([('Scaler', StandardScaler()),('NB',GaussianNB())])))
pipelines.append(('ScaledSVM', Pipeline([('Scaler', StandardScaler()),('SVM', SVC(gamma='scale'))])))
results = []
names = []
for name, model in pipelines:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std())
    print(msg)

Here is the output.

out[7]


ScaledLR: 79.743740 (0.029184)
ScaledLDA: 79.045383 (0.042826)
ScaledKNN: 82.838419 (0.031490)
ScaledCART: 78.761737 (0.028512)
ScaledNB: 77.779734 (0.037019)
ScaledSVM: 82.703443 (0.029366)

Plot this on a box plot.

In[8]


# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

The output graph is as follows.

image.png

4. Summary

I summarized the construction of the baseline model and the pipeline processing. Pipelining processing seems to be deep inside.

We are always looking for articles, comments, etc.

Recommended Posts

[Kaggle] Baseline model construction, Pipeline processing
Prediction model construction ①
real-time-Personal-estimation (new model construction)
Kaggle House Prices ② ~ Model Creation ~