[PYTHON] Random forest (classification) and hyperparameter tuning

Introduction

Create a classifier for breast cancer datasets in Wisconsin to determine whether breast cancer tumors are benign or malignant, by tuning random forests and hyperparameters. The data is included in sklearn, and the number of data is 569, of which 212 are benign, 357 are malignant, and 30 types of features.

series

-Calculation of coefficient of determination by linear multiple regression and model selection -Calculation of coefficient of determination by linear multiple regression and selection of model part_2 -Calculation of contribution rate by simple regression analysis -Linear regression and narrowing down features -Logistic regression (classification) and tuning of hyperparameters -Linear SVC (classification) and hyperparameter tuning -Nonlinear SVC (classification) and tuning of hyperparameters -Decision tree (classification) and hyperparameter tuning -Decision tree (classification) and hyperparameter tuning 2 -Random forest (classification) and hyperparameter tuning

What is Random Forest?

Proposed by Leo Breiman in 2001 [1] A machine learning algorithm used for classification, regression and clustering. It is an ensemble learning algorithm that uses a decision tree as a weak learner, and its name is derived from the use of a large number of decision trees learned from randomly sampled training data. (From wikipedia)

Random forest hyperparameters

See below for details. RandomForestClassifier

Hyperparameters Choices default
n_estimators int type 10
criterion gini、entropy gini
max_depth int type or None None
min_samples_split int, float type 2
min_samples_leaf int, float type 1
min_weight_fraction_leaf float type 0
max_features int, float type, None, auto, sqrt, log2 auto
max_leaf_nodes int type or None None
min_impurity_decrease float type 0
min_impurity_split float type 1e-7
bootstrap bool type True
oob_score bool type False
n_jobs int type or None None
random_state int type, RandomState instance or None None
verbose int type 0
warm_start bool type False
class_weight Dictionary type, balanced, balanced_subsample or None None

procedure

--Reading breast cancer data --Separation of training data and test data

Implementation by python

%%time
from tqdm import tqdm
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

#Reading breast cancer data
cancer_data = load_breast_cancer()

#Separation of training data and test data
train_X, test_X, train_y, test_y = train_test_split(cancer_data.data, cancer_data.target, random_state=0)

#Condition setting
max_score = 0
SearchMethod = 0
RFC_grid = {RandomForestClassifier(): {"n_estimators": [i for i in range(1, 21)],
                                       "criterion": ["gini", "entropy"],
                                       "max_depth":[i for i in range(1, 5)],
                                       "random_state": [i for i in range(0, 101)]
                                      }}

#Random forest execution
for model, param in tqdm(RFC_grid.items()):
    clf = GridSearchCV(model, param)
    clf.fit(train_X, train_y)
    pred_y = clf.predict(test_X)
    score = f1_score(test_y, pred_y, average="micro")

    if max_score < score:
        max_score = score
        best_param = clf.best_params_
        best_model = model.__class__.__name__

print("Best score:{}".format(max_score))
print("model:{}".format(best_model))
print("parameter:{}".format(best_param))

#Comparison with no hyperparameter adjustment
model = RandomForestClassifier()
model.fit(train_X, train_y)
score = model.score(test_X, test_y)
print("")
print("Default score:", score)

result

100%|███████████████████████████████████████████| 1/1 [10:39<00:00, 639.64s/it]
Best score:0.965034965034965
model:RandomForestClassifier
parameter:{'criterion': 'entropy', 'max_depth': 4, 'n_estimators': 14, 'random_state': 62}

Default score: 0.951048951049
Wall time: 10min 39s

in conclusion

By tuning the hyperparameters, we were able to obtain a higher accuracy rate than the default.

Recommended Posts

Random forest (classification) and hyperparameter tuning
Decision tree and random forest
Random Forest (2)
Learn Japanese text categories with tf-idf and Random Forest ~ [Tuning]
Random Forest
Hyperparameter tuning
Hyperparameter tuning 2
Multi-label classification by random forest with scikit-learn
Disease classification in Random Forest using Python
I tried using Random Forest
Random forest (implementation / parameter summary)
[Machine learning] Understanding random forest
Difference between regression and classification
Use Random Forest in Python
Supervised Learning 3 Hyperparameters and Tuning (2)
Machine Learning: Supervised --Random Forest
Supervised learning 2 Hyperparameters and tuning (1)
Learn Japanese text categories with tf-idf and Random Forest ~ livedoor news