Random Forest is a group learning algorithm that uses a decision tree as a weak learning machine, and is one of the typical machine learning methods. A decision tree is constructed by sampling feature quantities and observation data. If you look it up, you will find a lot, so I will omit the algorithms here.
For example, if you try to classify the data of 1000 positive cases and 50 negative cases by Random Forest, you cannot use it as a classifier unless you take some measures. It can also be handled by weighting the imbalanced data (adjustable as a class_weight parameter in scikit-learn's RandomForest) However, it is possible that the weight of one data is too strong and overfitting may occur.
Therefore, there is a method for dealing with imbalanced data by adjusting the number of samples for each decision tree. (In the previous example, for example, for each decision tree, 50 positive examples and 50 negative examples are created. Such sampling is performed without weighting to deal with imbalanced data.) It seems that this is called Balanced Random Forest. In R, it can be easily executed with parameters, but in scikit-learn, there was no corresponding one, so I tried it myself. (If you make a mistake, please comment)
balanced_random_forest.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from collections import Counter
from random import sample
from math import sqrt
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import resample
def _build_tree(train: np.ndarray, label: np.ndarray):
tree = DecisionTreeClassifier()
tree.fit(train, label)
return tree
def _sampling_equal(y_values: np.ndarray, n_samples: int, bootstrap: bool = True):
"""
:param y_values: label data
:param n_samples: number of samples
:param bootstrap: whether bootstrap or not
:return: sampling index
"""
if bootstrap:
return [i for v in [resample(pd.Series(y_values)[pd.Series(y_values) == uq].index,
n_samples=n_samples) for uq in np.unique(y_values)] for i in v]
else:
return [i for v in [sample(list(pd.Series(y_values)[pd.Series(y_values) == uq].index),
n_samples) for uq in np.unique(y_values)] for i in v]
class BalancedRandomForestClassifier():
"""Supports unbalanced data by creating a tree with a constant number of samples for each class"""
def __init__(self, n_estimator: int, n_samples: int, bootstrap: bool = True, max_features: int = 0):
"""
:param n_estimator: number of tree
:param n_samples: number of sampling data
:param bootstrap: how to resample
:param max_features: number of feature
"""
self.n_estimator = n_estimator
self.n_samples = n_samples
self.bootstrap = bootstrap
self.max_features = max_features
def fit(self, x_values: np.ndarray, y_values: np.ndarray):
"""
:param x_values: train data
:param y_values: label data
"""
if self.max_features == 0:
self.max_features = round(sqrt(x_values.shape[1]))
#Select the data to be trained by the tree with bootstrap etc.
index_list = [_sampling_equal(y_values, self.n_samples, self.bootstrap) for i in range(0, self.n_estimator)]
#Select the feature amount for each tree
self.feature_list = [sample(range(0, x_values.shape[1]), self.max_features) for i in range(0, self.n_estimator)]
#Build a tree based on the above
self.forest = Parallel(n_jobs=-1, backend="threading")(
delayed(_build_tree)(x_values[np.ix_(index, feature)], y_values[index])
for index, feature in zip(index_list, self.feature_list))
#Determined by majority vote of each tree's forecast
self.predict = lambda x: [Counter(item).most_common(1)[0][0]
for item in np.array([tree.predict(x[:, feature])
for tree, feature in zip(self.forest, self.feature_list)]).T]
#From here, the importance calculation
count = np.zeros(x_values.shape[1])
feature_importances = np.zeros(x_values.shape[1])
for tree, feature in zip(self.forest, self.feature_list):
count[feature] += 1
feature_importances[feature] += tree.feature_importances_
self.feature_importances_ = feature_importances / count
def _build_tree(train: np.ndarray, label: np.ndarray):
tree = DecisionTreeClassifier()
tree.fit(train, label)
return tree
This is the same as the decision tree of scikit-learn. Balanced Random Forest only changes the sampling method, so There is no doubt that some are used (my mounting power is 53)
def _sampling_equal(y_values: np.ndarray, n_samples: int, bootstrap: bool = True):
"""
:param y_values: label data
:param n_samples: number of samples
:param bootstrap: whether bootstrap or not
:return: sampling index
"""
if bootstrap:
return [i for v in [resample(pd.Series(y_values)[pd.Series(y_values) == uq].index,
n_samples=n_samples) for uq in np.unique(y_values)] for i in v]
else:
return [i for v in [sample(list(pd.Series(y_values)[pd.Series(y_values) == uq].index),
n_samples) for uq in np.unique(y_values)] for i in v]
It gets the same number of labels for each label with bootstrap (optionally sampling without duplication, but usually not used). The reason why it is such annoying inclusion is that if you sample each label, it will be a multidimensional array, so to convert it to a one-dimensional array
def fit(self, x_values: np.ndarray, y_values: np.ndarray):
"""
:param x_values: train data
:param y_values: label data
"""
if self.max_features == 0:
self.max_features = round(sqrt(x_values.shape[1]))
#Select the data to be trained by the tree with bootstrap etc.
index_list = [_sampling_equal(y_values, self.n_samples, self.bootstrap) for i in range(0, self.n_estimator)]
#Select the feature amount for each tree
self.feature_list = [sample(range(0, x_values.shape[1]), self.max_features) for i in range(0, self.n_estimator)]
#Build a tree based on the above
self.forest = Parallel(n_jobs=-1, backend="threading")(
delayed(_build_tree)(x_values[np.ix_(index, feature)], y_values[index])
for index, feature in zip(index_list, self.feature_list))
#Determined by majority vote of each tree's forecast
self.predict = lambda x: [Counter(item).most_common(1)[0][0]
for item in np.array([tree.predict(x[:, feature])
for tree, feature in zip(self.forest, self.feature_list)]).T]
#Calculation of importance from here
count = np.zeros(x_values.shape[1])
feature_importances = np.zeros(x_values.shape[1])
for tree, feature in zip(self.forest, self.feature_list):
count[feature] += 1
feature_importances[feature] += tree.feature_importances_
self.feature_importances_ = feature_importances / count
――First of all, decide how many features to use for each decision tree. -(If the feature is $ M $ dimension, $ \ sqrt {M} $ for classification and $ \ frac {M} {3} $ for regression seem to be recommended values) --Next, the bootstrap method determines subsamples evenly for each label. ――The amount of features used for each tree is also determined by the number above. --After that, build a tree in parallel and store it in a variable called forest.
Since the predicted value is determined by the majority vote of the created tree, it is predicted based on the feature amount ( self.feature_list
) used for each tree, and the majority vote is taken (Counter (item) .most_common. (1) [0] [0]
has acquired the most numerous labels.)
Since importance is calculated as the average value of the contribution rate of the feature amount for each decision tree, The total value of importance for each feature is divided by the number of occurrences.
Recommended Posts