[PYTHON] Data prediction competition in 3 steps (titanic)

After reading Kaggle's articles, I personally thought that data analysis could be summarized in the following three steps.

Step 1: Feature engineering Step 2: Select learner and adjust hyperparameters Step 3: Ensemble

After completing the above three steps, the model will be decided, so apply test data (data for submitting forecasts) to the model to make predictions and submit. (Until then, do not submit and evaluate the performance by the cross-validation value.) Since cross-validation is performed at least once in one step, cross-validation will be performed at least three times in total.

Step 1: Feature engineering

Consider features that can be used in EDA (exploratory data analysis), etc., and create new features if necessary. Also consider data leaks. It also performs pre-processing such as missing value processing. Finally, set up the learner and remove unnecessary features by RFE (recursive feature elimination). At this time, cross-validation is used for evaluation. You can also see the importance of features with a decision tree learner. For complex data such as images and text, consider using a neural network to extract features.

Step 2: Select learner and adjust hyperparameters

It's a good idea to try everything you can think of as a learner. (Because it can be ensemble later.) Hyperparameter adjustment can be done by grid search, random search, and recently Bayesian optimization by optuna. This will create a state where you can get the highest score with each learner. Delete the learner whose performance evaluation result is too low.

Step 3: Ensemble

It summarizes the results of each learner. If there is only one learner, it cannot be ensemble, so that learner is the final model. You can simply vote or mean, or you can stack blenders and stack them. When stacking, it is also necessary to adjust the hyperparameters of the learner that becomes the blender. Also, there are times when the evaluation results are better without an ensemble, in which case it is a matter of thinking which one will be the final model. (It may be better to submit both.)

About cross-validation

Cross-validation is often used as a model performance evaluation method. It is done with K-Fold, but basically it is better to have a large number of K. (However, it takes time if there are many Ks.) In addition, Stratified KFold and random extraction are performed so that data imbalance does not occur. The performance evaluation focuses on the standard deviation as well as the average score for each Fold.

Implementation example in titanic competition

Referenced articles https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy http://maruo51.com/2019/04/07/titanic-3/ https://yolo-kiyoshi.com/2020/01/22/post-1588/

Feature engineering

EDA is omitted. Random forest was used for RFE.

import pandas as pd 
import numpy as np

data_train = pd.read_csv('./input/train.csv')
data_test  = pd.read_csv('./input/test.csv')
data_train_raw = data_train.copy(deep = True) #Make a copy of the original data
data_test_raw = data_test.copy(deep = True) #Make a copy of the original data
data_cleaner = [data_train, data_test]#Combining as a list is convenient because you can also operate the copy source together.
#By the way, the join in the data frame is data_all = pd.concat([data_train, data_test], sort=False)Can be done with

#Delete unused ones
drop_column = ['PassengerId','Cabin', 'Ticket','Name']
data_train.drop(drop_column, axis=1, inplace = True)
#Missing value replacement
for dataset in data_cleaner:    
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
#Creating new features
for dataset in data_cleaner:    
    dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1
    dataset['IsAlone'] = 1
    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 
#Label encoding
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
for dataset in data_cleaner:    
    dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
drop_column = ['Sex','Embarked']
data_train.drop(drop_column, axis=1, inplace = True)
# x,Get label for y
lavel_y = "Survived"
lavel_x = data_train.columns.values[1:]

#RFE in Random Forest(Recursive feature elimination)
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
NFOLD = 20 #CV split Fold number
kf = StratifiedKFold(n_splits=NFOLD, shuffle=True, random_state=0)
rfe = RFECV(RandomForestClassifier(random_state = 0), scoring = 'accuracy', cv = kf)
rfe.fit(data_train[lavel_x], data_train[lavel_y])
f_importances = pd.DataFrame({"features":lavel_x,"select":rfe.get_support()})
print(f_importances) #RFE result display
#Confirm the feature data to be used
X_train = rfe.transform(data_train[lavel_x])
X_test = rfe.transform(data_test[lavel_x])
Y_train = data_train[lavel_y]

Learner selection and hyperparameter adjustment

Using Random Forest, XGBoost, and LightGBM as learners, we performed hyperparameter searches on Optuna.

from sklearn.model_selection import cross_validate
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import optuna

#Random forest
def objective(trial):
    param_grid_rfc = {
        "max_depth": trial.suggest_int("max_depth", 5, 15),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 5),
        'min_samples_split': trial.suggest_int("min_samples_split", 7, 15),
        "criterion": trial.suggest_categorical("criterion", ["gini", "entropy"]),
        'max_features': trial.suggest_int("max_features", 3, 10),
        "random_state": 0
    }
    model = RandomForestClassifier(**param_grid_rfc)
    scores = cross_validate(model, X=X_train, y=Y_train, cv=kf)
    return scores['test_score'].mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
rfc_best_param = study.best_params

# XGBoost
def objective(trial):   
    param_grid_xgb = {
        'min_child_weight': trial.suggest_int("min_child_weight", 1, 5),
        'gamma': trial.suggest_discrete_uniform("gamma", 0.1, 1.0, 0.1),
        'subsample': trial.suggest_discrete_uniform("subsample", 0.5, 1.0, 0.1),
        'colsample_bytree': trial.suggest_discrete_uniform("colsample_bytree", 0.5, 1.0, 0.1),
        'max_depth': trial.suggest_int("max_depth", 3, 10),
        "random_state": 0
    }
    model = XGBClassifier(**param_grid_xgb)
    scores = cross_validate(model, X=X_train, y=Y_train, cv=kf)
    return scores['test_score'].mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
xgb_best_param = study.best_params

# LightGBM
def objective(trial):
    param_grid_lgb = {
        'num_leaves': trial.suggest_int("num_leaves", 3, 10),
        'learning_rate': trial.suggest_loguniform("learning_rate", 1e-8, 1.0),
        'max_depth': trial.suggest_int("max_depth", 3, 10),
        "random_state": 0
    }
    model = LGBMClassifier(**param_grid_lgb)
    scores = cross_validate(model, X=X_train, y=Y_train, cv=kf)
    return scores['test_score'].mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
lgb_best_param = study.best_params

ensemble

Ensemble by majority rule voting.

from sklearn.ensemble import VotingClassifier
estimators = [
    ('rfc', RandomForestClassifier(**rfc_best_param)),
    ('xgb', XGBClassifier(**xgb_best_param)),
    ('lgb', LGBMClassifier(**lgb_best_param)),
]
voting = VotingClassifier(estimators)
scores = cross_validate(voting, X=X_train, y=Y_train, cv=kf)
print(scores["test_score"].mean())
print(scores["test_score"].std())

Since the result of the ensemble seems to be the best, I predict with that model and make a submission file.

voting.fit(X_train, Y_train)
data_test['Survived'] = voting.predict(X_test)
submit = data_test[['PassengerId','Survived']]
submit.to_csv("./submit.csv", index=False)

Finally

The cross-validation value was around 0.85, which looked pretty good, but not so good with PLB. There are many things that can be devised, such as how to create features, selection of learners in RFE, number of KFolds, selection of learners, selection of hyperparameter range, selection of ensemble method, etc. I think they can improve it even more.

Recommended Posts

Data prediction competition in 3 steps (titanic)
Data analysis Titanic 2
Data analysis Titanic 1
Data analysis Titanic 3
Stop thinking for use in data analysis competition LightGBM
OPT data science competition
Handle Ambient data in Python
Data Manipulation in Python-Try Pandas_plyr
Display UTM-30LX data in Python
Write data in HDF format
Probability prediction of imbalanced data