[PYTHON] I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)

Introduction

Here is a summary of the original boosting such as XGBoost, which is often used in Kaggle. Among the boosting algorithms, I would like to understand Adaboost while implementing it.   I used this article by @amber_kshz and implemented it.

Implemented PRML Chapter 14 Bagging and AdaBoost in Python https://qiita.com/amber_kshz/items/8869d8ef7f9743b16d8d

** In particular, I was able to learn how to calculate efficiently and quickly using an array. ** I would like to summarize how to do that.

The main points are as follows.

Understand Boosting

The method of combining learning devices called weak learners is called ** ensemble learning **. Typical ensemble learning can be roughly divided into four.

  1. Bagging (discrimination from training data by multiple weak learners and averaging (sampling ** with duplication **))
  2. Paceting (discriminate from training data with multiple weak learners and average (sampling ** no duplication **))
  3. Stacking (load averaging considering the importance of weak learners)
  4. ** (This time) Boosting (improve the accuracy of weak learners to make strong learners) **

In the ensemble learning, the algorithm that continuously learns the weak learner and modifies it to improve the accuracy is called ** boosting **. We aim to make the learning device more accurate by paying attention to the wrong (= misclassified) sample and performing the next learning.

The figure below shows the flow. $ M $ is the number of trials, and the number of trials increases from the upper left to the lower right. I would like to classify blue and red circles, but the misclassified samples are largely ** circled **. You can see that the green classification line moves to somehow separate the misclassified sample.

image.png

A similar learner is ** Bagging **. For bagging, the results of independent learning are averaged. While each of these learners is independent, the difference is that in the case of boosting, the results learned earlier are used in later learning.

Understand Adaboost's algorithm

Now, let's understand the algorithm of Adaboost. Adaboost is an abbreviation for Adaptive Boosting. This is an algorithm announced in 1997. Adaboost applies the idea of using the new predictor to make corrections that pay extra attention to the parts that were misjudged by the previous predictor.

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting https://www.sciencedirect.com/science/article/pii/S002200009791504X

The specific algorithm is as follows.

** Premise **

algorithm

  1. Input: The training data is $ (X, T) $, the base classifier is $ y $, and the number of classifiers is $ M $.
  2. Initialize the training data weight $ (w_0, w_1, \ dots, w_ {N-1}) $ to $ w ^ {(0)} _ {n} = 1 / N $.
  3. Repeat the following for $ m = 0,1, .... M-1 $. (a) The classifier $ y_m (x) $ is represented below. The error function $ J_m $ is adjusted to the test data so as to be minimized. $ \begin{align} J_m = \sum_{n=0}^{N-1} w^{(m)}\_{n} I \left[ y(x_n, \theta_m) \neq t_n \right] \end{align} $

⇒ At this time, $ I \ left [y (x_n, \ theta_m) \ neq t_n \ right] $ is ** 0 (= correct answer) if the test data and the predicted value match, and 1 if wrong. I will. Therefore, if the number of correct answers is large, $ J_m $ will be small ** (= it plays the role of an error function and is OK).

(b) Calculate $ \ epsilon_m and \ alpha_m $ from the error of the base classifier. $ \ Epsilon_m $ represents the error rate, and $ \ alpha_m $ is the coefficient to be reflected in the weight parameter. $ \begin{align} \varepsilon_m &:= \frac{\sum_{n=0}^{N-1} w^{(m)}\_{n} I \left[ y(x_n, \theta_m) \neq t_n \right]}{\sum_{n=0}^{N-1} w^{(m)}\_{n}} \\\ \alpha_m &:= \log \left(\frac{1 - \varepsilon_m}{\varepsilon_m} \right) \end{align} $

(c) Use the $ \ alpha_m $ found in the previous step to set the weight $ w $ $ \begin{align} \tilde{w}^{(m+1)}\_{n} &= w^{(m)}\_{n} \exp\left\\{ \alpha_m I \left[ y(x_n, \theta_m) \neq t_n \right] \right\\} \\\ \end{align} $ Update like.

⇒ At this time, if the predicted value and the test data match, $ I $ indicates $ 0 $. On the contrary, if it is wrong, it will be $ 1 $, so the weighting will be updated together with the previous $ \ alpha $.

Deepen your understanding by implementing Adaboost

Now, I would like to actually move on to implementation.

Understand Decision Stump

This time, the weak learner uses the determined strain. A deciding stock is to select one of the input variables of ** certain data, set a threshold value according to the value, and classify. ** In short, I understand that it is a decision tree with a depth of 1. Now, the program below defines the determined stock class.

adaboost.ipynb



    def fit_onedim(self, X, y, sample_weight, axis):

        N = len(X)

        # Here we sort everything according the axis-th coordinate of X
        sort_ind = np.argsort(X[:, axis])
        sorted_label = y[sort_ind]
        sorted_input = X[sort_ind]
        sorted_sample_weight = sample_weight[sort_ind]

        pred = -2*np.tri(N-1, N, k=0, dtype='int') + 1 
        mistakes = (pred != sorted_label ).astype('int')

        # The (weighted) error is calculated for each classifier
        errs = np.zeros((2, N-1))
        errs[0] = mistakes @ sorted_sample_weight
        errs[1] = (1 - mistakes) @ sorted_sample_weight

        # Here, we select the best threshold and sign
        ind = np.unravel_index(np.argmin(errs, axis=None), errs.shape)
        sign = -2*ind[0] + 1
        threshold = ( sorted_input[ind[1], axis] + sorted_input[ind[1] + 1, axis] ) / 2
        err = errs[ind]
        return sign, threshold, err

Learn more about calculations

This time, as shown below, we are using the makemoons dataset. Consider the case where the number of sample data is $ 100 $ as shown in the image below, the data with $ 1 $ label is $ 50 $, and the data with $ -1 $ label is $ 50 $.

image.png

This dataset is stored in $ X = (x_1, x_2) $. For convenience, define the index before sorting in ascending order by the following np.argsort.

image.png

Next, perform the np.tri method (upper and lower triangular matrix generation) as the matrix pred that gives a tentative prediction value. image.png

Then define this pred-y as a mistake matrix. As a result, 99 sets of errata were created. image.png

Finally, multiply this errata with the weight to find the loss function. image.png

The boundary line is the average value of $ x_1 ^ m, x_1 ^ {m + 1} $, which is the boundary when the minimum value of this loss function is taken. In the case of Adaboost, there is also a process to update the weight for the incorrect label here.

(Supplement) Determine if they match

Here are some of the things I thought were a good way to do this. Whether the predicted value is correct or not is determined by the truth value (bool type). In the contents of the error function earlier, it was $ 0 $ if it was hit, and $ 1 $ if it was not, so we will work it that way.

sample.ipynb


a = np.array([[1,1,1,1],[1,1,1,1],[1,1,0,1]])
b = np.array([1,1,1,0])
c1 = (a != b )
c2 = (a != b ).astype('int')

print(c1)
print(c2)
[[False False False  True]
 [False False False  True]
 [False False  True  True]]
[[0 0 0 1]
 [0 0 0 1]
 [0 0 1 1]]

Here is a simple example. Let $ a $ be the predicted value and $ b $ be the correct label to determine. You can see that the program returns $ 0 $ for False this time, and $ 1 $ for True if it is off.

Actually move + evaluate the difference due to the number of trials

Let's actually move it. Try going from $ 1 to $ 9 as the number of trials.

adaboost.ipynb



num_clf_list = [1, 2, 3, 4, 5, 6,7,8,9]
clfs = [AdaBoost(DecisionStump, num_clfs=M) for M in num_clf_list]

fig, axes = plt.subplots(3, 3, figsize=(15,15))

for i in range(len(clfs)):
    clfs[i].fit(X, y)
    plot_result(ax=np.ravel(axes)[i], clf=clfs[i], xx=xx, yy=yy, X=X, y=y)
    np.ravel(axes)[i].set_title(f'M = {num_clf_list[i]}')

002.png

You can see that the line to classify gradually separates, and it seems that it will become an accurate classifier. In the case of a determined stock, as you can see from the definition, it is difficult to classify with a non-linear curve because it tries to classify with a straight line.

At the end

This time, I deepened my understanding of Adaboost with implementation. Although the algorithm itself is easy to understand, when it comes to implementing it, I found that I could not proceed without a good understanding of matrix calculations such as numpy. From here, I would like to improve my analytical skills by understanding algorithms that are often used in Kaggle such as XG boost.

The full program is here. https://github.com/Fumio-eisan/adaboost_20200427

Recommended Posts

I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
[Machine learning] I tried to summarize the theory of Adaboost
Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
I tried to understand the learning function in the neural network carefully without using the machine learning library (second half).
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 1
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 2
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
I tried to organize the evaluation indexes used in machine learning (regression model)
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the change in snowfall for 2 years by machine learning
I didn't understand the Resize of TensorFlow so I tried to summarize it visually.
I tried to compress the image using machine learning
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
[Introduction] I tried to implement it by myself while explaining to understand the binary tree
I tried to visualize the lyrics of GReeeen, which I used to listen to crazy in my youth but no longer listen to it.
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to verify the yin and yang classification of Hololive members by machine learning
I wrote it in Go to understand the SOLID principle
I tried "Implementing a genetic algorithm (GA) in python to solve the traveling salesman problem (TSP)"
How to use machine learning for work? 01_ Understand the purpose of machine learning
I tried how to improve the accuracy of my own Neural Network
I tried to classify guitar chords in real time using machine learning
I tried to understand the decision tree (CART) that makes the classification carefully
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to display the altitude value of DTM in a graph
I tried the common story of using Deep Learning to predict the Nikkei 225
Record the steps to understand machine learning
I wanted to know the number of lines in multiple files, so I tried to get it with a command
I tried to understand the support vector machine carefully (Part 1: I tried the polynomial / RBF kernel using MakeMoons as an example).
Note that I understand the least squares algorithm. And I wrote it in Python.
[Azure] I tried to create a Linux virtual machine in Azure of Microsoft Learn
I tried to process and transform the image and expand the data for machine learning
I tried to rescue the data of the laptop by booting it on Ubuntu
I tried to optimize while drying the laundry
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
[Introduction] I tried to implement it by myself while explaining the binary search tree.
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to put HULFT IoT (Agent) in the gateway Rooster of Sun Electronics
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I tried to make it easy to change the setting of authenticated Proxy on Jupyter
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to graph the packages installed in Python
I tried to summarize the basic form of GPLVM
About testing in the implementation of machine learning models
I want to fully understand the basics of Bokeh
I tried to implement GA (genetic algorithm) in Python
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I investigated the reinforcement learning algorithm of algorithmic trading
I tried to implement automatic proof of sequence calculation
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried to visualize the power consumption of my house with Nature Remo E lite