[PYTHON] [Machine learning] OOB (Out-Of-Bag) and its ratio

OOB (Out-Of-Bag), which you often encounter when reading the explanation about Random Forest, will be explained in detail.

Bootstrap sampling

By allowing duplicates from $ N $ training samples $ \ {\ boldsymbol {x} _i, y_i \} _ {i = 1} ^ N $ and randomly selecting the same number of $ N $ The method of creating a training sample set is called bootstrap sampling. Random Forest is named "Forest" because it makes a large number of decision trees from the $ M $ training samples created by this bootstrap sampling.

スクリーンショット 2016-08-05 01.11.36.png

At this time, $ N $ is selected from $ N $ with duplicates, so some data may not be selected. This is called OOB (Out-Of-Bag). It is also used to evaluate random forest errors (such as here) $ i $ th data $ (\ boldsymbol { Focusing on x} _i, y_i) $, $ M $ is not used in some of this sample set. Therefore, the OOB error rate is to collect only the unused trees to form a subtree and evaluate the accuracy.

As for how much this OOB will be, it seems that about 36% of the samples will not be used when making one sample, is that true? Let's find out.

Try with simulation

The following is a bar graph drawn by repeating the results of randomly selecting 100 data from 100 data 12 times and counting each of them. The number of unselected data, that is, the number of OOBs, is listed in the title of each graph. It can be said that the number is about the middle of 30. The average number of OOBs for 12 was about 34.83. Certainly about 30%.

A74gzhQV6Nx6AAAAAElFTkSuQmCC.png

Python


rd.seed(71)
n_data = 100
result = np.zeros((12, n_data))

OOBs = []
plt.figure(figsize=(16,8))
for i in range(12):
    ax = plt.subplot(4,3,i+1)
    result[i] = rd.randint(1,n_data+1, n_data)
    res = plt.hist(result[i], bins=range(1,n_data+1))
    cnt = Counter(res[0])
    plt.title("# of OOB = {}".format(cnt[0]))
    OOBs.append(cnts[0])
    plt.xlim(0, n_data)
     
plt.tight_layout() 
print("Average of OOB = {}".format(np.mean(OOBs)))

Think theoretically

Consider the probability that a sample $ \ boldsymbol {x} _i $ will not be chosen. Since there are $ N $ in total data, the fact that the data you are currently paying attention to is not selected is the same as $ N-1 $, which is the number of data minus one of your own, so the probability is

\left( { N-1 \over N} \right)

is. Now that we are selecting $ N $ samples, we will be sampling from the $ N $ dataset, so we will be multiplying by $ N $.

\left( { N-1 \over N} \right)^{N} = \left( 1-{ 1 \over N} \right)^{N}

If you calculate this, in fact, if you increase $ N $, it will be about 36%. let's do it.

Python


res_list = []
for i in range(6):
    n_data = 10**i
    result = (((n_data-1)/n_data)**n_data)*n_data 
    res_list.append([n_data,  result, result/n_data])
df = pd.DataFrame(res_list)
print( tabulate(df, ["The number of data", "Number of unselected columns" , "Not chosen rate"], tablefmt="pipe"))
The number of data Number of unselected columns Not chosen rate
0 1 0 0
1 10 3.48678 0.348678
2 100 36.6032 0.366032
3 1000 367.695 0.367695
4 10000 3678.61 0.367861
5 100000 36787.8 0.367878

It seems that it has converged to around 36% properly: kissing_heart:

Relationship with Napier number e

** @nykergoto taught us about the relationship with the number of Napiers. Thank you! ** **

By the way, this probability,

\left( 1-{ 1 \over N} \right)^{N}

Definition of the number of Napier $ e $,

\lim_{n \rightarrow \infty} \left( 1 + { 1 \over N} \right)^{N}

It's similar to If you change the variable as $ t = N-1 $ here

\left( { N-1 \over N} \right)^{N} = \left( { t \over t+1} \right)^{t+1} = \left( { t+1 \over t} \right)^{-(t+1)} = \left( 1 + { 1 \over t} \right)^{-(t+1)} = \left( \left( 1 + { 1 \over t} \right)^{t+1} \right)^{-1}

If this $ t $ is $ t \ rightarrow \ infty $, from the definition of the number of napiers

\lim_{n \rightarrow \infty}  \left( \left( 1 + { 1 \over t} \right)^{t+1} \right)^{-1} = e^{-1}

The ratio of OOB, which is the unselected part of the bootstrap sample, was the reciprocal of the number of Napiers!

When I calculate it, it's certainly about 36%!

Python


np.e**-1

out


0.36787944117144233

reference

Python code for this article (GitHub)  https://github.com/matsuken92/Qiita_Contents/blob/master/General/OOB-test.ipynb

Scikit-Learn OOB Errors for Random Forests  http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html

Recommended Posts

[Machine learning] OOB (Out-Of-Bag) and its ratio
Machine learning #k-nearest neighbor method and its implementation and various
Machine learning and mathematical optimization
Significance of machine learning and mini-batch learning
Classification and regression in machine learning
Organize machine learning and deep learning platforms
Machine learning
Personal notes and links about machine learning ① (Machine learning)
Machine learning algorithm classification and implementation summary
Python and machine learning environment construction (macOS)
"OpenCV-Python Tutorials" and "Practical Machine Learning System"
Study machine learning and computer science. Resource list
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Machine learning Training data division and learning / prediction / verification
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Machine learning with Raspberry Pi 4 and Coral USB Accelerator
"Apache Flink" new machine learning interface and Flink-Python module
[Machine learning] Understanding SVM from both scikit-learn and mathematics
Easy machine learning with scikit-learn and flask ✕ Web app
Python learning memo for machine learning by Chainer Chapters 1 and 2
Machine learning engineer lawyer explains AI and rights story
Artificial intelligence, machine learning, deep learning to implement and understand
Practical machine learning with Scikit-Learn and TensorFlow-TensorFlow gave up-
Set up python and machine learning libraries on Ubuntu
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
I considered the machine learning method and its implementation language from the tag information of Qiita
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
[Machine learning] Start Spark with iPython Notebook and try MLlib
Music and Machine Learning Preprocessing MFCC ~ Mel Frequency Cepstrum Coefficient
Build a machine learning scikit-learn environment with VirtualBox and Ubuntu
[Machine learning] Understanding decision trees from both scikit-learn and mathematics
What I learned about AI and machine learning using Python (4)
Personal memos and links related to machine learning ③ (BI / Visualization)
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Machine Learning Super Introduction Probability Model and Maximum Likelihood Estimate
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
[Machine learning] Understanding logistic regression from both scikit-learn and mathematics