[PYTHON] [Machine learning] OOB (Out-Of-Bag) and its ratio

OOB (Out-Of-Bag), which you often encounter when reading the explanation about Random Forest, will be explained in detail.

Bootstrap sampling

By allowing duplicates from $ N $ training samples $ \ {\ boldsymbol {x} _i, y_i \} _ {i = 1} ^ N $ and randomly selecting the same number of $ N $ The method of creating a training sample set is called bootstrap sampling. Random Forest is named "Forest" because it makes a large number of decision trees from the $ M $ training samples created by this bootstrap sampling.

At this time, $ N $ is selected from $ N $ with duplicates, so some data may not be selected. This is called OOB (Out-Of-Bag). It is also used to evaluate random forest errors (such as here) $ i $ th data $ (\ boldsymbol { Focusing on x} _i, y_i) $, $ M $ is not used in some of this sample set. Therefore, the OOB error rate is to collect only the unused trees to form a subtree and evaluate the accuracy.

As for how much this OOB will be, it seems that about 36% of the samples will not be used when making one sample, is that true? Let's find out.

Try with simulation

The following is a bar graph drawn by repeating the results of randomly selecting 100 data from 100 data 12 times and counting each of them. The number of unselected data, that is, the number of OOBs, is listed in the title of each graph. It can be said that the number is about the middle of 30. The average number of OOBs for 12 was about 34.83. Certainly about 30%.

`Python`


rd.seed(71)
n_data = 100
result = np.zeros((12, n_data))

OOBs = []
plt.figure(figsize=(16,8))
for i in range(12):
    ax = plt.subplot(4,3,i+1)
    result[i] = rd.randint(1,n_data+1, n_data)
    res = plt.hist(result[i], bins=range(1,n_data+1))
    cnt = Counter(res[0])
    plt.title("# of OOB = {}".format(cnt[0]))
    OOBs.append(cnts[0])
    plt.xlim(0, n_data)
     
plt.tight_layout() 
print("Average of OOB = {}".format(np.mean(OOBs)))

Think theoretically

Consider the probability that a sample $ \ boldsymbol {x} _i $ will not be chosen. Since there are $ N $ in total data, the fact that the data you are currently paying attention to is not selected is the same as $ N-1 $, which is the number of data minus one of your own, so the probability is

\left( { N-1 \over N} \right)

is. Now that we are selecting $ N $ samples, we will be sampling from the $ N $ dataset, so we will be multiplying by $ N $.

\left( { N-1 \over N} \right)^{N} = \left( 1-{ 1 \over N} \right)^{N}

If you calculate this, in fact, if you increase $ N $, it will be about 36%. let's do it.

`Python`


res_list = []
for i in range(6):
    n_data = 10**i
    result = (((n_data-1)/n_data)**n_data)*n_data 
    res_list.append([n_data,  result, result/n_data])
df = pd.DataFrame(res_list)
print( tabulate(df, ["The number of data", "Number of unselected columns" , "Not chosen rate"], tablefmt="pipe"))

	The number of data	Number of unselected columns	Not chosen rate
0	1	0	0
1	10	3.48678	0.348678
2	100	36.6032	0.366032
3	1000	367.695	0.367695
4	10000	3678.61	0.367861
5	100000	36787.8	0.367878

It seems that it has converged to around 36% properly: kissing_heart:

Relationship with Napier number e

** @nykergoto taught us about the relationship with the number of Napiers. Thank you! ** **

By the way, this probability,

\left( 1-{ 1 \over N} \right)^{N}

Definition of the number of Napier $ e $,

\lim_{n \rightarrow \infty} \left( 1 + { 1 \over N} \right)^{N}

It's similar to If you change the variable as $ t = N-1 $ here

\left( { N-1 \over N} \right)^{N} = \left( { t \over t+1} \right)^{t+1} = \left( { t+1 \over t} \right)^{-(t+1)} = \left( 1 + { 1 \over t} \right)^{-(t+1)} = \left( \left( 1 + { 1 \over t} \right)^{t+1} \right)^{-1}

If this $ t $ is $ t \ rightarrow \ infty $, from the definition of the number of napiers

\lim_{n \rightarrow \infty}  \left( \left( 1 + { 1 \over t} \right)^{t+1} \right)^{-1} = e^{-1}

The ratio of OOB, which is the unselected part of the bootstrap sample, was the reciprocal of the number of Napiers!

When I calculate it, it's certainly about 36%!

`Python`


np.e**-1

`out`


0.36787944117144233

reference

Python code for this article (GitHub) 　https://github.com/matsuken92/Qiita_Contents/blob/master/General/OOB-test.ipynb

Scikit-Learn OOB Errors for Random Forests 　http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html