[PYTHON] A comment on Boruta algorithm

Omission

Boruta is one of the feature selection methods using feature importance.

I studied by referring to the explanation on the following site. It is a very easy-to-understand explanation, and it is very helpful with experiments.

-Feature selection method Boruta using random forest and test

comment

Boruta's algorithm uses the null hypothesis to determine if a feature is useful for prediction. The importance of this feature is the same as the importance of the feature that does not contribute to discrimination (regression). ", And perform the test. If the null hypothesis is correct, the importance of a feature and the capacity of a feature that does not contribute to discrimination (regression) should be random, and the probability that a feature will become important is $ p = 0.5. It should follow the binomial distribution of $.

What I was interested in here is not one Shadow Feature, but the same number as the original feature dimension. Then, the size comparison is performed by "the highest importance of a certain feature VS Shadow Feature".

スクリーンショット 2020-05-16 0.07.35.png The above image is taken from [Feature selection method Boruta using random forest and test](https://aotamasaki.hatenablog.com/entry/2019/01/05/195813).

Implementation of boruta_py also looks like that.

    def _do_tests(self, dec_reg, hit_reg, _iter):
        active_features = np.where(dec_reg >= 0)[0]
        hits = hit_reg[active_features]
        # get uncorrected p values based on hit_reg
        to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
        to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()
        ...

However, if the random variable representing the maximum feature importance of Shadow Feature is $ S_ {\ mathrm {max}} $,

S_{\mathrm{max}} \leq F_1 \Leftrightarrow S_1\leq F_1 \wedge S_2\leq F_1 \wedge S_3\leq F_1 \wedge S_4\leq F_1 

Therefore,

\mathrm{Pr}(S_{\mathrm{max}} \leq F_1) = \mathrm{Pr}(S_1 \leq F_1)\mathrm{Pr}(S_2 \leq F_1)\mathrm{Pr}(S_3 \leq F_1)\mathrm{Pr}(S_4 \leq F_1) = \left(\frac{1}{2}\right)^4

If the null hypothesis is correct, then the probability that a feature will be greater than the largest feature importance of the Shadow Feature should follow the binomial distribution of $ p = (1/2) ^ 4 $.

%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import binom

plt.rcParams['font.family'] = 'IPAPGothic' 
plt.rcParams["figure.figsize"] = [12, 6]
plt.rcParams['font.size'] = 20 
plt.rcParams['xtick.labelsize'] = 15
plt.rcParams['ytick.labelsize'] = 15

#Number of trials
N = 100
#Number of occurrences during the number of trials(Array)   
k = np.arange(N+1)

#Plot on graph
fig, ax = plt.subplots(1,1)
ax.plot(k, binom.pmf(k, N, p=0.5), 'bo', ms=8, label='p=0.5')
ax.vlines(k, 0, binom.pmf(k, N, p), colors='b', lw=1, alpha=0.2)
ax.plot(k, binom.pmf(k, N, p=0.5**4), 'x', ms=8, color='r', label='p=0.5^4')
ax.vlines(k, 0, binom.pmf(k, N, p=0.5**4), colors='r', lw=1, alpha=0.2)
ax.set_xlabel('iterarion')
ax.set_ylabel('probability')
ax.legend()
plt.show()
スクリーンショット 2020-05-16 1.36.19.png

As you can see in the graph, it is quite different from the rejection area.

Grass

I wrote so far, but when I actually use Boruta, does it work properly? So, is there something wrong with the way of thinking?

Recommended Posts

A comment on Boruta algorithm
A * algorithm (Python edition)
Create a classroom on Jupyterhub
Building a Python environment on Mac
matplotlib: Insert comment on timeline graph
Algorithm Gymnastics 24 Reverse a Linked List
Building a Python environment on Ubuntu
Create a Python environment on Mac (2017/4)
Install Arch Linux on DeskMini A300
Write A * (A-star) algorithm in Python
Run a Linux server on GCP
Create a SlackBot service on Pepper
Implement a Django app on Hy
Run Matplotlib on a Docker container
Create a Linux environment on Windows 10
Create a python environment on centos
Run headless-chrome on a Debian-based image
Comment on Pull Request from CircleCI
Using a serial console on Ubuntu 20.04
Run TensorFlow2 on a VPS server
Implementing a simple algorithm in Python 2
Run a simple algorithm in Python
Register as a package on PyPI
Build a python3 environment on CentOS7
Algorithm Gymnastics 22 Squaring a Sorted Array
Cause a bus error on x86