\newcommand{\argmax}{\mathop{\rm argmax}\limits}
\newcommand{\argmin}{\mathop{\rm argmin}\limits}

Target person

This time, I am writing a memorandum on the gradient boosting decision tree, which is a method of machine learning rather than deep learning. If you want to get a rough idea of the machine learning methods used in XGBoost and LightGBM, which are (likely) very active in Kaggle, please take a look.

-[What is a gradient boosting decision tree](#What is a gradient boosting decision tree) -[Gradient descent method](# Gradient descent method) -

[Ensemble learning](#Ensemble learning) </ summary> -Bagging -Boosting -Stacking

[Decision tree](#Decision tree) </ summary> - Level-wise tree growth - Leaf-wise tree growth

-[Gradient Boosting Decision Tree Algorithm](# Gradient Boosting Decision Tree Algorithm) -[Characteristics of Gradient Boosting Decision Tree](#Characteristics of Gradient Boosting Decision Tree)

XGBoost
LightGBM
- [GOSS](#goss)

-[Overview of GOSS algorithm](Overview of #goss algorithm) -

EFB

-[Overview of EFB algorithm](Overview of #efb algorithm)

Catboost -Try -Conclusion

What is a gradient boosting decision tree?

Gradient Boosting Decision Tree (GBDT) is a machine that combines the three methods of "Gradient", "Ensemble Learning (Boosting)", and "Decision Tree". It is a learning method. First, I will outline each of them.

Gradient descent

About the gradient descent method, I will summarize in the past articles (this and this). Please refer to that as well. Simply put, it is a method for minimizing a certain objective function, and it is an effective method that can be used when the objective function is a continuous function. In deep learning, the objective function is given as an error function. The algorithm is

Find the gradient (partial differential) of the objective function at your current location
Move according to a certain rule ** according to the obtained gradient
If the increase / decrease in the value of the objective function falls below the threshold value, learning ends, otherwise it returns to 1.

It looks like. The "certain rule" here is a learning rule represented by SGD and Adam.

Ensemble learning

Ensemble Learning is a method of learning multiple models at the same time and interacting them to improve the accuracy of the model as one large learning model. For example, "it is faster and more accurate for multiple people to work on a problem with wisdom than to work on a problem alone." ** It's the wisdom of Manjushri ** if three people approach. The actual ensemble learning algorithm is as follows.

Prepare multiple models
Train each model ** independently **
Get prediction results for unknown data from each of the trained models
Integrate the prediction results using ** majority vote, average **, etc. and output as the overall prediction result

The advantage of ensemble learning is that the overall prediction accuracy is high even if each model is a weak learner [^ 1]. To briefly explain with mathematical formulas, assuming that the misjudgment rate of each model is uniformly $ \ theta $ in binary classification, the overall misjudgment rate is

P(k) = {}_mC_k \theta^k (1-\theta)^{m-k}

It is calculated by. here

-$ m : Number of models - k $: Number of misjudged models

is. From this formula, we can see that "the larger the value of $ m $, the smaller the value of $ P (k) $". This is rather simple, or a discussion based on many assumptions, so it only helps intuitive understanding, but it can be said that it shows the power of ensemble learning.

By the way, I have said several times such as "improving prediction accuracy", but what does it mean to improve prediction accuracy in the first place? Important terms here are ** Bias ** and ** Variance **. To briefly explain

--Bias: Average error between predicted and actual values --Variant: In Japanese, it's "distributed." Shows how the data is scattered. This time, I will show you the scattered tools of the predicted values.

It looks like. "Increasing prediction accuracy" is synonymous with "aiming for low bias and low variance". However, in reality, bias and variance are in a trade-off relationship, and aiming for low bias results in high variance. What's wrong with that is that the low-bias, high-variance state represents the so-called ** overfitting ** state. It fits too much into the training data and cannot be predicted well for unknown data. By the way, this is a ** case with one model **. It is said that such a result may be brought about by training to fit all unknown data with the expressive ability of one model. Therefore, by training with multiple models, for example, even if each model is in a state of low bias and high variance (that is, overfitting), by taking a majority vote or averaging them, low bias and low variance can be obtained as a result. It can be realized. Of course, the above figure is a very convenient example, so it may not be exactly like this, but you can understand that the prediction accuracy can be improved like this.

So far, we have provided an overview of ensemble learning. Next, I will move on to the introduction of representative methods.

Bagging

Bagging (Bootstrap Aggregating) generally tends to ** lower the variance of the model's predictions **. As an overview of the method, the bootstrap method is used to add diversity to the data, leading to a decrease in variance. The bootstrap method is a method for estimating the statistics of the population itself from statistics such as the mean value of the sample population obtained from the population. Extract data from the sample population, allowing duplication, and generate a slightly different dataset. For more information, click here (https://qiita.com/saltcooky/items/451b96b3f346cb93a0b4) and here Please refer to.

Boosting

This is about boosting that appears in the gradient boosting decision tree in this article. Boosting, unlike bagging, tends to ** lower the bias **. The outline of the method is to repeat the process of "learning one model first as the base model and then learning the next model by adding the erroneous judgment of the base model", and finally using all the models as a whole. Configure the output. This will gradually overfit the training data and make the overall decision, including the non-overfitted model, which tends to reduce bias. However, of course, increasing the number of models too much will lead to overfitting.

Stacking

Stacking is a complex technique, but it is a ** technique that can successfully reduce both bias and variance. As the outline of the method, as the name of the method suggests, the prediction accuracy is improved by stacking multiple models. More specifically, first, various models of the first layer are trained like bagging, and in the next layer, some are selected from the output results of the first layer and used. At this time, the second layer is used. Now let's learn which model of the first layer to use the output result (close to boosting). By repeating this, the accuracy of the model as a whole can be improved.

Decision tree

A decision tree is one of the machine learning methods that classifies and diffracts using a tree structure. Even though it is a tree structure, it is upside down from the trees that grow around it. The decision tree is an image that sets data on the node and conditional branching on the edge. For easy understanding, there is a column to write a conditional branch directly under each node. The decision tree is to classify the data as correctly as possible by conditional branching like this. The biggest advantage of decision trees is that they can explain exactly how the algorithm categorized them. Although classification results can be obtained with neural networks such as deep learning, which is popular these days, it is virtually impossible to know the classification process, and it seems that various studies are being conducted to solve this black box problem.

Quiet talk ...

For more detailed topics on decision trees, please refer to here. Note that the decision tree itself is as shown in the previous figure, but it is not an algorithm called a decision tree. The decision tree is just a concept, and there are various algorithms to build it.

Level-wise tree growth There are two ways to grow a decision tree. Let's start with Level-wise. Level-wise is, to put it simply, a feeling of breadth-first growth. (The figure is taken from the official document) When verbalized, it feels like "the depth from the root node is the same everywhere."

Leaf-wise tree growth Leaf-wise, on the other hand, feels like it grows with depth first. (The figure is taken from the official document) It feels like creating the next node only from the necessary nodes (this node is called a child node). It requires less calculation than Level-wise. But it tends to be complicated. Therefore, if not properly controlled, overfitting may occur easily.

Gradient boosting decision tree algorithm

This is the end of the prior knowledge. The gradient boosting decision tree (GBDT) is a combination of the methods introduced above. In other words, it is the gradient boosting decision tree that applies the gradient descent method and ensemble learning boosting to the decision tree. Now let's look at the actual algorithm. For the time being, the procedure is

Calculate the average of the objective variables and use it as the initial prediction value.
Calculate the error
Build a decision tree
Calculate new predicted values using ensemble learning
Repeat 2-4 enough times
Output the final predicted value

It's like that. In the formula

Initialize the model with constants $ F_0 (x) = \ argmin_ {\ gamma} \ sum_ {i = 1} ^ {n} {L (y_i, \ gamma)} $
Repeat the following for the number of models $ \ textrm {For} \ m = 1 \ \ textrm {to} \ M $
Calculate the pseudo-residue $ r_ {im} =-\ left [\ cfrac {\ partial L (y_i, F (x_i))} {\ partial F (x_i)} \ right] \ _ {F (x) ) = F \ _ {m-1} (x)} \ quad \ textrm {for} \ i = 1, \ ldots, n $
Create and fit the dataset $ \ {(x_i, r_ {im}) \} _ {i = 1} ^ n $ using the pseudo-residue of the base model $ h_m (x) $
Solve the following one-dimensional optimization problem to calculate the multiplier $ \ gamma_m = \ argmin_ {\ gamma} \ sum_ {i = 1} ^ {n} {L (y_i, F \ _ {m-1 } (x_i) + \ gamma h_m (x_i))} $
Update the model $ F_m (x) = F \ _ {m-1} (x) + \ gamma_m h_m (x) $
Output the final model $ F_M (x) $

It will be. A concrete calculation example can be found at here, so if you actually follow it, you will deepen your understanding.

Gradient boosting decision tree features

--One of the most powerful methods of supervised learning --Cannot be parallelized because it uses boosting ――It is easy to cause overfitting because it is easily affected by parameters. --No need to perform scale conversion (standardization, etc.) --Easy to be affected by outliers --The depth of the decision tree is often set below $ 5 $ to prevent overfitting.

XGBoost Now, let me introduce a framework that can handle gradient boosting decision trees. Let's start with XGBoost (eXtreme Gradient Boosting). XGBoost seems to have been eaten since the appearance of LightGBM, which will be described later, but it will be a framework for gradient boosting decision trees that are still active. As a feature

--High scale scalability --Propose efficient and theoretically correct weighted quantiles --A new sparseness recognition algorithm for ensemble learning --Efficient cache-enabled block structure

Somehow I wrote it in Paper. I don't know ... I wonder if I should read the dissertation properly if I feel like it ... Histogram-based and Pre-sorted are used as decision tree construction algorithms. By the way, Histgram-based seems to have added an implementation later.

Pre-sorted Algorithm --A method of counting all possible division points from pre-sorted features --The optimum division point can be obtained, but the time and space efficiency is poor.
Histogram-based Algorithm --A method of converting continuous value features to discrete values and constructing a histogram based on them. --In general, the optimum division point cannot be obtained, but time and space efficiency is good.

For details, please refer to here. For XGBoost, is Level Wise the way the decision tree grows? I searched a lot, but it's not specified, so it's a subtle point ... However, there was a description that it could be Leaf-wise depending on the parameters, so it's probably Level-wise.

LightGBM Next is LightGBM (Light Gradient Boosting Machine). Released in 2016, it is a popular framework for its accuracy and overwhelmingly high speed processing. Its feature is

--Reduction of time / space calculation --Speed up histogram construction of sparse data --Change the method of finding the division point of the histogram to the next best solution instead of the optimal solution to speed up --Parallelizable

etc. For the time being, is it okay to feel that we have succeeded in speeding up with high accuracy? The decision tree construction algorithm is only Histogram-based in LightGBM. Roughly speaking, it seems that LightGBM can be recognized as GBDT incorporating GOSS and EFB. GOSS and EFB are new techniques to improve the following problems with gradient boosting decision trees:

--In downsampling to reduce the data set, the accuracy deteriorates because the distribution of the data changes. --The Histgram-based algorithm cannot reduce features even for sparse datasets.

I will briefly summarize each method.

GOSS GOSS (Gradient-based One-Side Sampling)

In a technique called data downsampling [^ 2], which is often used to reduce the size of training data, downsampling below a frequency that can restore the distribution of the original dataset will change the distribution, resulting in accuracy. become worse*

It is a method to avoid the problem. Frequency is the word Aya. Since the story of sampling frequency and so on corresponds to signal processing, you can understand it by google around it (such as the sampling theorem in the field of signal processing).

Overview of GOSS algorithm

Note that the AdaBoost sampling method cannot be used as-is for gradient boosting decision trees, but instead gradients can play a similar role. If the gradient is small, the error is small and it means that you have already learned enough. I think the idea that comes to mind here is "discard data with a small gradient", but this has the problem of changing the distribution of the data. Therefore, GOSS adopts the method of "random sampling of data with a small gradient while keeping the data with a large gradient". Sort the data by the absolute value of the gradient, keep the top $ a \ times 100% $ data, and randomly sample the remaining $ b \ times 100% $ data. Then the small gradient ($ b \ times 100% $ data) is multiplied by $ \ frac {1-a} {b} $ when calculating the data gain [^ 3], resulting in undertrained data. You can pay attention to it without compromising the distribution of the data. See Original Paper for detailed theoretical analysis.

EFB EFB (Exclusive Feature Bundling)

Data sets targeting the real world are usually sparse, and calculation costs can be reduced by ignoring negligible features, but the Histgram-based algorithm has zero features due to the characteristics of the algorithm. Since it is necessary to access the bin that holds the feature regardless of / or not, the feature reduction as described above cannot be performed *

This is a new remedy for the problem.

Overview of EFB algorithm

High-dimensional data is usually very sparse, which shows that there is a possibility of feature reduction. In addition, the sparseness of high-dimensional data is often exclusive, and it is unlikely that multiple features will take non-zero values at the same time. From such characteristics, we can safely bundle exclusive features into a single feature, and we will call this operation the Exclusive Feature Bundle. There were two problems in constructing such an algorithm.

――How to decide which features to bundle ――How to build a bundle

Regarding the first point, this problem can be regarded as equivalent to the graph coloring problem [^ 4], but it is classified as a problem called NP-hardness. By the way, to put it simply, classification such as NP-hardness

--P: The class to which the problem belongs that shows an algorithm that can be solved in polynomial time --NP: The class to which the problem belongs that shows an algorithm that can prove the existence of a solution in polynomial time --NP-hard: The class to which the problem (not necessarily deterministic) can transform the solution algorithm into any NP problem in polynomial time. --NP-complete: The class to which the problem belongs to both NP and NP-hard

It is divided into classes like this. Regarding the second point, it has been shown that the optimal bundle strategy cannot be found in polynomial time, so to find the suboptimal strategy.

Correspond each vertex and feature in the graph coloring problem, and if it is not exclusive, add an edge every 2 features *

By reducing the scale to the graph color coding problem, you can use the greedy algorithm to obtain the next best bundle strategy. Furthermore, even if the features are not completely exclusive, further improvement in calculation efficiency can be expected if some collisions are allowed. The calculation order is $ \ mathcal {O} (\ {(1- \ gamma) n \} ^ {-2/3}) $. $ \ Gamma $ is the collision rate threshold for each bundle. This is the outline of Greedy Bundling's algorithm.

From the discussion so far, the EFB algorithm can be designed from Greedy Bundling.

Weight the edges of the graph corresponding to the total number of collisions between each feature
Sort features in descending order by degree of graph
Check each feature in the sorted list and assign it to an existing bundle with conflicts controlled by $ \ gamma $ or create a new bundle

By the way, the calculation order of Greedy Bundling is $ \ mathcal {O} (# feature ^ 2) $, but if the number of features reaches millions, the search cost is not light, so it is more in the paper. As a good method, we propose a method of sorting by the number of non-zero elements instead of constructing a graph. It would also be nice to have a way to merge features into the same bundle to reduce complexity. At that time, it must be guaranteed that the value of the original feature can be identified. In the Histgram-based algorithm, features are stored in bins as discrete values instead of continuous values, so you can guarantee that by keeping exclusive features in different bins. Such an algorithm is Merge Exclusive Features. In addition to applying these algorithms, LightGBM also implements optimizations that require additional memory and computational costs, but can be more effective.

Catboost I'm exhausted with LightGBM, so I'll write it soon ...

Try out

Now, let's actually experiment with the classification of the MNIST dataset with XGBoost and LightGBM.

By the way, the installation of LightGBM did not work for some reason when I installed it with pip. I also tried installing with gcc, but it didn't work ... but when I did it with brew, it went smoothly. I wonder what ... If you have any problems, please see github repository.

As in the example, the execution environment is jupyter notebook. Experiment with a file called test.ipynb.

We have a pip installation in the first cell. Not required if you have already installed the required packages.

`test.py`


#%pip install lightgbm
%pip install xgboost
%pip install tensorflow
%pip install sklearn

I added the second cell because I got a mysterious error when installing the MNIST dataset.

`test.py`


import requests
requests.packages.urllib3.disable_warnings()
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    # Legacy Python that doesn't verify HTTPS certificates by default
    pass
else:
    # Handle target environment that doesn't support HTTPS verification
    ssl._create_default_https_context = _create_unverified_https_context

In the third cell, we are preparing a dataset of experimental code.

`test.py`


import time


import xgboost as xgb
import lightgbm as lgb
import numpy as np
from tensorflow.keras.datasets import mnist
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


#Preparing the MNIST dataset
(x_train, t_train), (x_test, t_test) = mnist.load_data()
#Divided into training data and verification data
split_ratio = 0.2
x_train, x_validation, t_train, t_validation = train_test_split(x_train, t_train, test_size=split_ratio)
#Smoothing
x_train = x_train.reshape(-1, 784)
x_validation = x_validation.reshape(-1, 784)
x_test = x_test.reshape(-1, 784)
#Normalization
x_train = x_train.astype(float) / 255
x_validation = x_validation.astype(float) / 255
x_test = x_test.astype(float) / 255
#Set data
xgb_train_data = xgb.DMatrix(x_train, label=t_train)
xgb_eval_data = xgb.DMatrix(x_validation, label=t_validation)
xgb_test_data = xgb.DMatrix(x_test, label=t_test)
lgb_train_data = lgb.Dataset(x_train, label=t_train)
lgb_eval_data = lgb.Dataset(x_validation, label=t_validation, reference=lgb_train_data)

The fourth cell contains the XGBoost experimental code itself.

`test.py`


#XGBoost model construction
start = time.time()
xgb_params = {"objective": "multi:softmax",
              "num_class": 10,
              "eval_metric": "mlogloss"}
evals = [(xgb_train_data, "train"), (xgb_eval_data, "eval")]
gbm = xgb.train(xgb_params, xgb_train_data,
                num_boost_round=100,
                early_stopping_rounds=10,
                evals=evals)
preds = gbm.predict(xgb_test_data)
print("accuracy score: {}".format(accuracy_score(t_test, preds)))
print("elapsed time: {}".format(time.time() - start))

The result looks like this.

Fifth, the experimental code body of LightGBM is described.

`test.py`


#LightGBM model construction
start = time.time()
lgb_params = {"task": "train",
             "boosting_type": "gbdt",
             "objective": "multiclass",
             "num_class": 10}
gbm = lgb.train(lgb_params, lgb_train_data,
                valid_sets=lgb_eval_data, 
                num_boost_round=100, 
                early_stopping_rounds=10)
preds = gbm.predict(x_test)
y_pred = []
for x in preds:
    y_pred.append(np.argmax(x))

print("accuracy score: {}".format(accuracy_score(t_test, y_pred)))
print("elapsed time: {}".format(time.time() - start))

The result looks like this.

The accuracy doesn't change that much, but the speed is different. Please note that ** does not mean that "LightGBM is always good" **. In the MNIST dataset, it just happened to work ** faster with similar accuracy, and in fact XGBoost may be better, or LightGBM may be better. ** The point is that it is important to use them properly **. There is something similar to the story around here gradient descent method.

in conclusion

Until now, I have focused solely on deep learning, but I thought that there are various other machine learning methods. After all AI-related talk is a difficult field to study ~ I can not understand the theory at all, but laughter I will devote myself ...

reference

-[What is a weak learner](https://toukei-lab.com/weak-learning#:~:text=%E5%BC%B1%E5%AD%A6%E7%BF%92%E5%99 % A8% E3% 81% A8% E3% 81% AF% E3% 80% 81% E3% 81% 9D% E3% 81% AE% E5% 90% 8D% E3% 81% AE% E9% 80% 9A % E3% 82% 8A% E3% 80% 8C% E5% BC% B1% E3% 81% 84,% E3% 83% 90% E3% 83% AA% E3% 82% A2% E3% 83% B3% E3% 82% B9% E3% 80% 8D% E3% 81% A8% E3% 81% 84% E3% 81% 86% E8% 80% 83% E3% 81% 88% E6% 96% B9% E3% 81% 8C% E3% 81% 82% E3% 82% 8A% E3% 81% BE% E3% 81% 99% E3% 80% 82 & text =% E5% BC% B1% E5% AD% A6% E7% BF % 92% E5% 99% A8% E3% 81% AF% E3% 80% 81% E5% 8D% 98% E7% B4% 94,% E4% BD% 8E% E3% 81% 8F% E3% 81% AA% E3% 82% 8B% E5% 82% BE% E5% 90% 91% E3% 81% AB% E3% 81% 82% E3% 82% 8A% E3% 81% BE% E3% 81% 99% E3% 80% 82) -[Introduction] Two typical methods and algorithms for ensemble learning -Are all advanced machine learning users using it? !! I will explain the mechanism of ensemble learning and the three types -[Introduction] Decision tree analysis for beginners by beginners -Intuitive understanding of GBDT mechanism and procedure with diagrams and concrete examples

Implementation reference

-What is XGboost? Theory and practice in Python and R! -Implement the gradient boosting method in Python and compare it! -Thorough introduction to LightGBM-How to use LightGBM, how it works, and how it differs from XGBoost -Move your hands to understand GBDT -I tried using LightGBM, which is popular in machine learning competitions, with Python

[^ 1]: A model with low prediction accuracy by itself. It seems that there is no clear definition, but decision trees and simple neural networks can be regarded as weak learners. [^ 2]: Generally, it is a process to reduce the number of data by lowering the sampling frequency from the original sampling frequency. It's just like thinning out the dataset here. [^ 3]: Translated as "gain" in Japanese. The nuances vary considerably depending on the context. What they have in common is something like "a dominant index in a certain formula". Here, it refers to the weight imposed on the data. [^ 4]: [This](https://ja.wikipedia.org/wiki/%E3%82%B0%E3%83%A9%E3%83%95%E5%BD%A9%E8%89% Please refer to B2) etc.

[PYTHON] What is a gradient boosting decision tree?

Target person

table of contents

What is a gradient boosting decision tree?

Gradient descent

Ensemble learning

Bagging

Boosting

Stacking

Decision tree

Gradient boosting decision tree algorithm

Gradient boosting decision tree features

Overview of GOSS algorithm

Overview of EFB algorithm

Try out

`test.py`

`test.py`

`test.py`

`test.py`

`test.py`

in conclusion

reference

Implementation reference