[PYTHON] [Machine learning] Let's summarize random forest in an easy-to-understand manner

Introduction

In this article, I will summarize the random forest algorithm.

Random forests are a combination of many decision trees, so you need to understand the decision tree algorithm first.

Please refer to the article here for the decision tree.

Random forest is one of ensemble learning. Let's talk about ensemble learning.

What is ensemble learning?

Ensemble learning is a technique that attempts to obtain better predictions by combining multiple learning machines.

In many cases, you will get better results than using a single model.

As for how to combine multiple learners, in the case of classification, the majority vote of multiple learners is taken, and in the case of regression, theaverageof multiple learners is taken.

Commonly used techniques in ensemble learning include bagging, boosting, stacking, and bumping.

Random forest can be said to be ensemble learning using a decision tree as a learner, using a technique called bagging.

A lot of terms came out and it became difficult to understand. I will explain each technique.

I referred to the article here.

About bagging

Bagging is an abbreviation for bootstrap aggregating.

Using a technique called boost trap, create several datasets from one dataset, generate one learner for each duplicated dataset, and make a majority vote of the multiple learners created in this way. Doing so makes the final prediction.

Boosttrap is a method of sampling n pieces of data from a dataset, allowing duplication.

Let the dataset be $ S_0 = (d_1, d_2, d_3, d_4, d_5) $, and when sampling n = 5 data, $ S_1 = (d_1, d_1, d_3, d_4, d_5) $ or $ S_2 = You will be creating a dataset such as (d_2, d_2, d_3, d_4, d_5) $.

As you can see, you can use Boosttrap to create many different datasets from one dataset.

Let's consider the predicted value with a concrete example.

Generate N boost trap data sets of magnitude n from the training dataset.

Create N prediction models using those data, and let each prediction value be $ y_n (X) $.

Since the average of these N predicted values is the final predicted value, the final predicted value of the model using bagging is as follows.

y(X) = \frac{1}{N}\sum_{n=1}^{N}y_n(X)

This is the end of the explanation of bagging. Next, let's look at boosting.

About boosting

In boosting, weak learners are not created independently as in bagging, but weak learners are constructed one by one. At that time, the k + 1th weak learner is constructed based on the kth weak learner (to compensate for the weakness).

Unlike bagging, which generates weak learners independently, boosting, which requires you to generate weak learners one by one, takes time. Instead, boosting tends to be more accurate than bagging.

Stacking

For bagging, we considered a simple average of N predicted values.

This algorithm evaluates individual predictions equally and does not take into account the importance of each model.

Stacking adds weights to individual predicted values according to their importance to make the final predicted value.

It is expressed by the following formula.

y(X) = \sum_{n=1}^{N}W_ny_n(X)

Bumping

Bumping is a technique for finding the best-fitting model among multiple learners.

Generate N models using the boost trap data set, apply the learner created using it to the original data, and select the one with the smallest prediction error as the best model.

This may seem like a less beneficial method, but it avoids learning with poor quality data.

About the random forest algorithm

So far we have dealt with ensemble learning.

Random forest is a method that uses bagging in ensemble learning and also uses decision tree as a base learner.

The algorithm is as follows.

  1. Create N boost trap data sets from the training data.

  2. Use this data set to generate N decision trees. At this time, m features are randomly selected from p features.

  3. In the case of classification, the majority vote of N decision trees is used, and in the case of regression, the average of the predictions of N decision trees is the final prediction.

Due to 2, there is a reason to use only some features.

This is because in ensemble learning, the lower the correlation between models, the more accurate the predictions.

The image is that it is better to have people with different ideas than to have many similar people.

The boost trap already trains with different data, but by changing the features, it is possible to train with different data and lower the correlation of the model.

Random forest implementation

Now let's implement it.

This time, let's classify the data generated by make_moons in sklearn.

Let's draw the data with the following code.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from matplotlib.colors import ListedColormap
import mglearn

moons = make_moons(n_samples=200, noise=0.2, random_state=0)
X = moons[0]
Y = moons[1]
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.show()

image.png

mglearn.discrete_scatter can be drawn by taking (X coordinate, Y coordinate, correct label) arguments.

Let's draw using normal ax.plot instead of mglearn. I created the function as follows.

def plot_datasets(x, y):
    figure = plt.figure(figsize=(12, 8))
    ax = figure.add_subplot(111)
    ax.plot(x[:, 0][y == 0], x[:, 1][y == 0], 'bo', ms=15)
    ax.plot(x[:, 0][y == 1], x[:, 1][y == 1], 'r^', ms=15)
    ax.set_xlabel('$x_0$', fontsize=15)
    ax.set_ylabel('$x_1$', fontsize=15)


plot_datasets(X, Y)
plt.show()

image.png

bo means a blue circle andr ^means a red triangle.

Let's summarize this part. The first indicates the color, and the acronyms such as'red',' blue',' green', and'cyan' indicate the color.

The second letter indicates the shape, and's','x','o','^', and'v'are squares, crosses, circles, upper triangles, and lower triangles in order from the left.

We will classify the above data using a random forest.

Below is the code.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


def plot_dexision_boundary(model, x, y, ax, margin=0.3):
    _x = np.linspace(x[:, 0].min() - margin, x[:, 0].max() + margin, 100)
    _y = np.linspace(x[:, 1].min() - margin, x[:, 1].max() + margin, 100)
    xx, yy = np.meshgrid(_x, _y)
    X = np.hstack((xx.ravel().reshape(-1, 1), yy.ravel().reshape(-1, 1)))
    y_pred = model.predict(X).reshape(yy.shape)
    custom_cmap = ListedColormap(['green', 'cyan'])
    ax.contourf(xx, yy, y_pred, alpha=0.3, cmap=custom_cmap)


def plot_datasets(x, y, ax):
    ax = figure.add_subplot(111)
    ax.plot(x[:, 0][y == 0], x[:, 1][y == 0], 'gs', ms=15)
    ax.plot(x[:, 0][y == 1], x[:, 1][y == 1], 'c^', ms=15)
    ax.set_xlabel('$x_0$', fontsize=15)
    ax.set_ylabel('$x_1$', fontsize=15)


moons = make_moons(n_samples=200, noise=0.2, random_state=0)
X = moons[0]
Y = moons[1]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
random_clf = RandomForestClassifier()
random_clf.fit(X_train, Y_train)
figure = plt.figure(figsize=(12, 8))
ax = figure.add_subplot(111)
plot_datasets(X, Y, ax)
plot_dexision_boundary(random_clf, X, Y, ax)
plt.show()

image.png

You can see that it is classified as a pretty good feeling.

I will explain the code.

_x = np.linspace(x[:, 0].min() - margin, x[:, 0].max() + margin, 100)
_y = np.linspace(x[:, 1].min() - margin, x[:, 1].max() + margin, 100)
xx, yy = np.meshgrid(_x, _y)

I am creating a grid point with this code. Please refer to the article here for the grid points.

Create grid points with margins more than the minimum and maximum values of the data plot range.

X = np.hstack((xx.ravel().reshape(-1, 1), yy.ravel().reshape(-1, 1)))
y_pred = model.predict(X).reshape(yy.shape)

After converting 100 × 100 data to a one-dimensional array with rabel (), it is converted to a 10000 × 1 vertical vector with reshape (-1, 1), and it is connected horizontally by p.hstack.

y_pred = model.predict (X) .reshape (yy.shape) predicts the model for 10000 × 2 data. It returns 0 on one side of the model and 1 on the other, so I'm converting it back to 100x100 data.

custom_cmap = ListedColormap(['green', 'cyan'])
ax.contourf(xx, yy, y_pred, alpha=0.3, cmap=custom_cmap)

The color used to create the contour line is specified by custom_cmap, and the contour line is drawn by ʻax.contourf (xx, yy, y_pred, alpha = 0.3, cmap = custom_cmap) `.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
random_clf = RandomForestClassifier()
random_clf.fit(X_train, Y_train)

This code classifies the data, creates a model of a random forest, and then trains it. Now let's evaluate the predictive model with the code below.

print(random_clf.score(X_test, Y_test))

0.96

At the end

That's all for this article.

Thank you for your relationship.

Recommended Posts

[Machine learning] Let's summarize random forest in an easy-to-understand manner
[Machine learning] Understanding random forest
Machine Learning: Supervised --Random Forest
[Machine learning] Try studying random forest
Random seed research in machine learning
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 1
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 2
[Python] I tried to summarize the set type (set) in an easy-to-understand manner.
I tried to summarize Cpaw Level1 & Level2 Write Up in an easy-to-understand manner
I tried to summarize Cpaw Level 3 Write Up in an easy-to-understand manner
View logs in an easy-to-understand manner with Ansible
Build an interactive environment for machine learning in Python
Balanced Random Forest in python
Machine learning in Delemas (practice)
An introduction to machine learning
[For beginners] I want to explain the number of learning times in an easy-to-understand manner.
[Deep Learning from scratch] I tried to explain the gradient confirmation in an easy-to-understand manner.
Use Random Forest in Python
Used in machine learning EDA
Check if the configuration file is read in an easy-to-understand manner
I will explain how to use Pandas in an easy-to-understand manner.
Learn machine learning anytime, anywhere in an on-demand Jupyter Notebook environment
Automate routine tasks in machine learning
Classification and regression in machine learning
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6
You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
Comparing the basic grammar of Python and Go in an easy-to-understand manner
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
[python] Frequently used techniques in machine learning
An introduction to OpenCV for machine learning
Python: Preprocessing in machine learning: Data acquisition
[Python] When an amateur starts machine learning
An introduction to Python for machine learning
Disease classification in Random Forest using Python
[Python] Saving learning results (models) in machine learning
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process