scikit-learn has a Voting Classifier that merges predictions from multiple models according to specific rules. The input model must be thrown with the parameters specified individually, or the parameter tuning and Voting Classifier must be operated at the same time.
This time, I will summarize how to tune parameters for individual models while using Voting Classifier. Basically, it is the same as the method described in the following document.
1.11. Ensemble methods — scikit-learn 0.18.1 documentation
For param_grid / param_distributions specified in GridSearchCV / RandomizedSearchCV, specify {model name} __ {parameter name}
like lr__C
. For the model name, use the name specified in estimators in Voting Classifier.
# http://scikit-learn.org/stable/modules/ensemble.html#using-the-votingclassifier-with-gridsearch
params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200],}
VotingClassifier does not tune parameters on each model, but tunes on all combinations of parameters on all models. In other words, if there are 100 combinations in xg boost and 100 combinations in Random Forest, try 100 each and finally match the tuning results to vote and decide the best, but try 100 x 100 = 10,000 classifiers with Voting. It has become like. Therefore, if the number of parameters to be tried is large, a combinatorial explosion may occur in GridSearchCV, so if possible, it is more realistic to try it first after determining the number of trials (n_iter) in RandomizedSearchCV.
#Bad example of abnormal number of combinations due to using GridSearchCV on 3 types of models
Fitting 5 folds for each of 324000 candidates, totalling 1620000 fits
[CV] xg__colsample_bytree=0.5, rf__random_state=0, xg__learning_rate=0.5, rf__n_estimators=5, rf__n_jobs=1, xg__n_estimators=50, rf__max_depth=3, rf__min_samples_split=3, rf__max_features=3, xg__max_depth=3, lg__C=1.0
[...]
Also, if the dictionary used for parameter tuning contains parameters with unnecessary model names other than the model used by Voting Classifier, an error will occur. The following is an example of an error when "lr__" is included in the parameter even though there is no model name lr.
ValueError: Invalid parameter lr for estimator VotingClassifier(estimators=[(
[...]
An example where xgboost, RandomForest, and LogisticRegression are used as input for Voting Classifier. The parameters of each model are renamed when merging dictionaries in order to unify the writing style with the case of not voting.
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
xg = xgb.XGBClassifier()
rf = RandomForestClassifier()
lr = LogisticRegression()
xg_param = {
"n_estimators": [50, 100, 150],
"max_depth": [3, 6, 9],
"colsample_bytree": [0.5, 0.9, 1.0],
"learning_rate": [0.5, 0.9, 1.0]
}
rf_param = {
"n_estimators": [5, 10, 50, 100, 300],
"max_features": [3, 5, 10, 15, 20],
"min_samples_split": [3, 5, 10, 20],
"max_depth": [3, 5, 10, 20]
}
lr_param = {
"C": list(np.logspace(0, 4, 10))
}
params = {}
params.update({"xg__" + k: v for k, v in xg_param.items()})
params.update({"rf__" + k: v for k, v in rf_param.items()})
params.update({"lr__" + k: v for k, v in lr_param.items()})
eclf = VotingClassifier(estimators=[("xg", xg),
("rf", rf),
("lr", lr)],
voting="soft")
clf = RandomizedSearchCV(eclf,
param_distributions=params,
cv=5,
n_iter=100,
n_jobs=1,
verbose=2)
clf.fit(X_train, y_train)
predict = clf.predict(X_test)