[PYTHON] Parameter tuning with GridSearchCV / RandomizedSearchCV while using Voting Classifier

Overview

scikit-learn has a Voting Classifier that merges predictions from multiple models according to specific rules. The input model must be thrown with the parameters specified individually, or the parameter tuning and Voting Classifier must be operated at the same time.

This time, I will summarize how to tune parameters for individual models while using Voting Classifier. Basically, it is the same as the method described in the following document.

1.11. Ensemble methods — scikit-learn 0.18.1 documentation

How to specify parameters

For param_grid / param_distributions specified in GridSearchCV / RandomizedSearchCV, specify {model name} __ {parameter name} like lr__C. For the model name, use the name specified in estimators in Voting Classifier.

# http://scikit-learn.org/stable/modules/ensemble.html#using-the-votingclassifier-with-gridsearch
params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200],}

Precautions when tuning parameters

VotingClassifier does not tune parameters on each model, but tunes on all combinations of parameters on all models. In other words, if there are 100 combinations in xg boost and 100 combinations in Random Forest, try 100 each and finally match the tuning results to vote and decide the best, but try 100 x 100 = 10,000 classifiers with Voting. It has become like. Therefore, if the number of parameters to be tried is large, a combinatorial explosion may occur in GridSearchCV, so if possible, it is more realistic to try it first after determining the number of trials (n_iter) in RandomizedSearchCV.

#Bad example of abnormal number of combinations due to using GridSearchCV on 3 types of models
Fitting 5 folds for each of 324000 candidates, totalling 1620000 fits
[CV] xg__colsample_bytree=0.5, rf__random_state=0, xg__learning_rate=0.5, rf__n_estimators=5, rf__n_jobs=1, xg__n_estimators=50, rf__max_depth=3, rf__min_samples_split=3, rf__max_features=3, xg__max_depth=3, lg__C=1.0 
[...]

Also, if the dictionary used for parameter tuning contains parameters with unnecessary model names other than the model used by Voting Classifier, an error will occur. The following is an example of an error when "lr__" is included in the parameter even though there is no model name lr.

ValueError: Invalid parameter lr for estimator VotingClassifier(estimators=[(
[...]

Source code

An example where xgboost, RandomForest, and LogisticRegression are used as input for Voting Classifier. The parameters of each model are renamed when merging dictionaries in order to unify the writing style with the case of not voting.

from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


xg = xgb.XGBClassifier()
rf = RandomForestClassifier()
lr = LogisticRegression()

xg_param = {
    "n_estimators": [50, 100, 150],
    "max_depth": [3, 6, 9],
    "colsample_bytree": [0.5, 0.9, 1.0],
    "learning_rate": [0.5, 0.9, 1.0]
}
rf_param = {
    "n_estimators": [5, 10, 50, 100, 300],
    "max_features": [3, 5, 10, 15, 20],
    "min_samples_split": [3, 5, 10, 20],
    "max_depth": [3, 5, 10, 20]
}
lr_param = {
    "C": list(np.logspace(0, 4, 10))
}

params = {}
params.update({"xg__" + k: v for k, v in xg_param.items()})
params.update({"rf__" + k: v for k, v in rf_param.items()})
params.update({"lr__" + k: v for k, v in lr_param.items()})

eclf = VotingClassifier(estimators=[("xg", xg),
                                    ("rf", rf),
                                    ("lr", lr)],
                        voting="soft")

clf = RandomizedSearchCV(eclf,
                         param_distributions=params,
                         cv=5,
                         n_iter=100,
                         n_jobs=1,
                         verbose=2)
clf.fit(X_train, y_train)
predict = clf.predict(X_test)

reference

machine learning - How would you do RandomizedSearchCV with VotingClassifier for Sklearn? - Stack Overflow -Scikit-learn model voting and caching --verilog writer --How to speed up by caching the model to put in VotingClassifier. I'm using a package called stacked_generalization implemented by the author