[PYTHON] Machine Learning: Supervised --Random Forest

Target

Understand the Random Forest algorithm and try it with scikit-learn.

theory

Random forest is a type of ensemble learning called bagging that combines multiple decision trees.

No free lunch theorem

The no free lunch theorem originally states that in combinatorial optimization, if you apply a search algorithm to all possible problems, the average performance of all algorithms will be the same.

This is because each algorithm has its own prerequisites, and not all possible problems meet those prerequisites, so one problem may work well, another. It shows that no algorithm is better than the others for all problems, as it will perform worse than the other algorithms.

It is cited in arguing that there is no one-size-fits-all learning device that evolves from that and gives the best results for any problem in machine learning.

Bagging

The no free lunch theorem mentioned above shows that no universal learner is perfect for any problem. Therefore, it is a natural idea to come up with a method of combining multiple learning devices.

The learning method of taking the majority vote of the output from multiple learners and making it the final output is called ensemble learning. The individual classifiers used for ensemble learning are called weak classifiers because they only need to perform a little better than random.

Bagging (Boostrap AGGregatING) is a typical method of ensemble learning. In bagging, as shown in the figure below, multiple classifiers are trained using the bootstrap sample of the training data, and for new data, the category is output by majority vote in the classification, and the estimated value is output by the average in the regression.

108_bagging.png

Bagging allows individual discriminators to be trained independently and in parallel, but bootstrap sampling allows duplication, so using a decision tree as a weak discriminator increases the correlation between the decision trees, and they are all similar. There is a possibility that it will become a waste.

Random forest has improved this problem.

Random forest

In random forest learning, when learning a decision tree with a bootstrap sample, instead of using all the features, a specified number of features are randomly selected and the decision tree is used. To build.

In bagging, the decision tree was constructed from the bootstrap sample, but in Random Forest, the feature amount used in the bootstrap sample is randomly selected to construct the decision tree as shown in the figure below.

108_random_forest.png

By randomizing the features used in each bootstrap sample in this way, each decision tree becomes diverse, and it can be expected to reduce the correlation between decision trees, which was a problem in bagging.

In scikit-learn, the argument n_estimators can specify the number of weak classifiers, and the argument max_features can specify the number of features to use. By default, the number of features used is the square root of the feature.

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.3.1 ・ Numpy 1.19.2 ・ Scikit-learn 0.23.2

Program to run

The implemented program is published on GitHub.

random_forest.py


result

Classification by random forest

Here is the result of applying a random forest to the breast cancer dataset that we have been using so far.

Accuracy 92.98%
Precision, Positive predictive value(PPV) 94.03%
Recall, Sensitivity, True positive rate(TPR) 94.03%
Specificity, True negative rate(TNR) 91.49%
Negative predictive value(NPV) 91.49%
F-Score 94.03%

The figure below shows the identification boundaries when performing a multiclass classification on an iris dataset.

108_random_forest_classification.png

Random forest regression

The data of the regression problem is a sine wave plus a random number. In regression, the mean value is the final output value.

108_random_forest_regression.png

reference

1.11.2. Forests of randomized trees

  1. Leo Breiman. "Random forests", Machine learning 45.1 (2001): pp. 5-32.
  2. Yuzo Hirai. "First Pattern Recognition", Morikita Publishing, 2012.

Recommended Posts

Machine Learning: Supervised --Random Forest
[Machine learning] Understanding random forest
[Machine learning] Try studying random forest
Machine Learning: Supervised --AdaBoost
Machine Learning: Supervised --Linear Regression
Machine Learning: Supervised --Support Vector Machine
Supervised machine learning (classification / regression)
Machine Learning: Supervised --Decision Tree
Random Forest
Machine learning
Random seed research in machine learning
Machine Learning: Supervised --Linear Discriminant Analysis
[Machine learning] Let's summarize random forest in an easy-to-understand manner
[Machine learning] Supervised learning using kernel density estimation
Supervised learning (classification)
[Memo] Machine learning
Machine learning classification
Machine Learning sample
[Machine learning] Supervised learning using kernel density estimation Part 2
[Machine learning] Supervised learning using kernel density estimation Part 3
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Supervised learning (regression) 1 Basics
Python: Supervised Learning (Regression)
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Python: Supervised Learning (Classification)
Machine Learning: k-Nearest Neighbors
What is machine learning?
Machine learning model considering maintainability
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
Balanced Random Forest in python
Machine learning in Delemas (practice)
Machine learning / classification related techniques
Basics of Machine Learning (Notes)
Python: Supervised Learning: Hyperparameters Part 1
Machine learning beginners tried RBM
I tried using Random Forest
Machine learning with Python! Preparation
Decision tree and random forest
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Use Random Forest in Python
Supervised Learning 3 Hyperparameters and Tuning (2)
Understand machine learning ~ ridge regression ~.
Machine learning article summary (self-authored)
About machine learning mixed matrices
Supervised learning 1 Basics of supervised learning (classification)
Practical machine learning system memo
Machine learning Minesweeper with PyTorch
Machine learning environment construction macbook 2021