[PYTHON] Random Forest size / processing time comparison

Overview

I often see articles about the accuracy of predictions using random forests, but I couldn't find any articles about the size of the model and the time of learning and prediction, so I compared them. This time, learning / prediction is performed using the boston data (number of records: 506, number of columns: 13) contained in scikit-learn. The parameters n_estimators (number of trees), max_depth (maximum tree depth), and the number of records, which are often used in Random Forest, were changed for measurement. The number of records has been increased from the original number to n = 506, 5060, 50600, but when bulking up, each value is randomly increased or decreased by 10% to avoid creating duplicate data as much as possible.

Model size

image.png

It can be seen that the number of trees in the random forest and the size of the model are simply proportional. It was found that the maximum tree depth and model size increased by about 1.5 to 2 times with each increase in depth. This can be expected from the fact that the number of nodes doubles at most as the maximum depth of the binary tree increases by one, and that the number of branches increases as the number of records increases.

Processing time when creating a model

image.png The processing time for modeling seems to be proportional to the number of trees, the maximum depth, and the training data size.

Processing time at the time of prediction

At the time of prediction, the learning model created with n = 5060 was used, and the time required to predict the data was measured using the predict function. image.png It can be seen that the processing time at the time of prediction is also proportional to the number of trees and the depth of the trees. On the other hand, surprisingly, the number of records to be predicted did not increase at all, and the processing time did not change even if the prediction was made with about 100 million data as a trial.

Finally

I really wanted to change the number of columns, but I couldn't find a good way, so I went through this time. Of course, this result will change depending on the tendency of the data, so please refer to it.

Recommended Posts

Random Forest size / processing time comparison
Random Forest (2)
Random Forest
Language-specific int64 prime factorization (trial division) processing time comparison
Balanced Random Forest in python
Random forest (implementation / parameter summary)
[Machine learning] Understanding random forest
Decision tree and random forest
Use Random Forest in Python
Machine Learning: Supervised --Random Forest