Balanced Random Forest in python

About Random Forest

Random Forest is a group learning algorithm that uses a decision tree as a weak learning machine, and is one of the typical machine learning methods. A decision tree is constructed by sampling feature quantities and observation data. If you look it up, you will find a lot, so I will omit the algorithms here.

Problems applying to imbalanced data

For example, if you try to classify the data of 1000 positive cases and 50 negative cases by Random Forest, you cannot use it as a classifier unless you take some measures. It can also be handled by weighting the imbalanced data (adjustable as a class_weight parameter in scikit-learn's RandomForest) However, it is possible that the weight of one data is too strong and overfitting may occur.

What is Balanced Random Forest?

Therefore, there is a method for dealing with imbalanced data by adjusting the number of samples for each decision tree. (In the previous example, for example, for each decision tree, 50 positive examples and 50 negative examples are created. Such sampling is performed without weighting to deal with imbalanced data.) It seems that this is called Balanced Random Forest. In R, it can be easily executed with parameters, but in scikit-learn, there was no corresponding one, so I tried it myself. (If you make a mistake, please comment)

Implementation

`balanced_random_forest.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-

from collections import Counter
from random import sample
from math import sqrt
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import resample


def _build_tree(train: np.ndarray, label: np.ndarray):
    tree = DecisionTreeClassifier()
    tree.fit(train, label)
    return tree


def _sampling_equal(y_values: np.ndarray, n_samples: int, bootstrap: bool = True):
    """
    :param y_values: label data
    :param n_samples: number of samples
    :param bootstrap: whether bootstrap or not
    :return: sampling index
    """
    if bootstrap:
        return [i for v in [resample(pd.Series(y_values)[pd.Series(y_values) == uq].index,
                                     n_samples=n_samples) for uq in np.unique(y_values)] for i in v]
    else:
        return [i for v in [sample(list(pd.Series(y_values)[pd.Series(y_values) == uq].index),
                                   n_samples) for uq in np.unique(y_values)] for i in v]


class BalancedRandomForestClassifier():
    """Supports unbalanced data by creating a tree with a constant number of samples for each class"""

    def __init__(self, n_estimator: int, n_samples: int, bootstrap: bool = True, max_features: int = 0):
        """
        :param n_estimator: number of tree
        :param n_samples: number of sampling data
        :param bootstrap: how to resample
        :param max_features: number of feature
        """
        self.n_estimator = n_estimator
        self.n_samples = n_samples
        self.bootstrap = bootstrap
        self.max_features = max_features

    def fit(self, x_values: np.ndarray, y_values: np.ndarray):
        """
        :param x_values: train data
        :param y_values: label data
        """
        if self.max_features == 0:
            self.max_features = round(sqrt(x_values.shape[1]))
        #Select the data to be trained by the tree with bootstrap etc.
        index_list = [_sampling_equal(y_values, self.n_samples, self.bootstrap) for i in range(0, self.n_estimator)]
        #Select the feature amount for each tree
        self.feature_list = [sample(range(0, x_values.shape[1]), self.max_features) for i in range(0, self.n_estimator)]
        #Build a tree based on the above
        self.forest = Parallel(n_jobs=-1, backend="threading")(
            delayed(_build_tree)(x_values[np.ix_(index, feature)], y_values[index])
            for index, feature in zip(index_list, self.feature_list))
        #Determined by majority vote of each tree's forecast
        self.predict = lambda x: [Counter(item).most_common(1)[0][0]
                                  for item in np.array([tree.predict(x[:, feature])
                                                        for tree, feature in zip(self.forest, self.feature_list)]).T]
        #From here, the importance calculation
        count = np.zeros(x_values.shape[1])
        feature_importances = np.zeros(x_values.shape[1])
        for tree, feature in zip(self.forest, self.feature_list):
            count[feature] += 1
            feature_importances[feature] += tree.feature_importances_
        self.feature_importances_ = feature_importances / count

Implementation of the base decision tree

def _build_tree(train: np.ndarray, label: np.ndarray):
    tree = DecisionTreeClassifier()
    tree.fit(train, label)
    return tree

This is the same as the decision tree of scikit-learn. Balanced Random Forest only changes the sampling method, so There is no doubt that some are used (my mounting power is 53)

Sampling with Bootstrap

def _sampling_equal(y_values: np.ndarray, n_samples: int, bootstrap: bool = True):
    """
    :param y_values: label data
    :param n_samples: number of samples
    :param bootstrap: whether bootstrap or not
    :return: sampling index
    """
    if bootstrap:
        return [i for v in [resample(pd.Series(y_values)[pd.Series(y_values) == uq].index,
                                     n_samples=n_samples) for uq in np.unique(y_values)] for i in v]
    else:
        return [i for v in [sample(list(pd.Series(y_values)[pd.Series(y_values) == uq].index),
                                   n_samples) for uq in np.unique(y_values)] for i in v]

It gets the same number of labels for each label with bootstrap (optionally sampling without duplication, but usually not used). The reason why it is such annoying inclusion is that if you sample each label, it will be a multidimensional array, so to convert it to a one-dimensional array

Learning

def fit(self, x_values: np.ndarray, y_values: np.ndarray):
        """
        :param x_values: train data
        :param y_values: label data
        """
        if self.max_features == 0:
            self.max_features = round(sqrt(x_values.shape[1]))
        #Select the data to be trained by the tree with bootstrap etc.
        index_list = [_sampling_equal(y_values, self.n_samples, self.bootstrap) for i in range(0, self.n_estimator)]
        #Select the feature amount for each tree
        self.feature_list = [sample(range(0, x_values.shape[1]), self.max_features) for i in range(0, self.n_estimator)]
        #Build a tree based on the above
        self.forest = Parallel(n_jobs=-1, backend="threading")(
            delayed(_build_tree)(x_values[np.ix_(index, feature)], y_values[index])
            for index, feature in zip(index_list, self.feature_list))
        #Determined by majority vote of each tree's forecast
        self.predict = lambda x: [Counter(item).most_common(1)[0][0]
                                  for item in np.array([tree.predict(x[:, feature])
                                                        for tree, feature in zip(self.forest, self.feature_list)]).T]
        #Calculation of importance from here
        count = np.zeros(x_values.shape[1])
        feature_importances = np.zeros(x_values.shape[1])
        for tree, feature in zip(self.forest, self.feature_list):
            count[feature] += 1
            feature_importances[feature] += tree.feature_importances_
        self.feature_importances_ = feature_importances / count

――First of all, decide how many features to use for each decision tree. -(If the feature is $ M $ dimension, $ \ sqrt {M} $ for classification and $ \ frac {M} {3} $ for regression seem to be recommended values) --Next, the bootstrap method determines subsamples evenly for each label. ――The amount of features used for each tree is also determined by the number above. --After that, build a tree in parallel and store it in a variable called forest.

Since the predicted value is determined by the majority vote of the created tree, it is predicted based on the feature amount ( self.feature_list) used for each tree, and the majority vote is taken (Counter (item) .most_common. (1) [0] [0] has acquired the most numerous labels.)

Since importance is calculated as the average value of the contribution rate of the feature amount for each decision tree, The total value of importance for each feature is divided by the number of occurrences.