A method of calculating the probability of belonging to an objective variable by combining multiple explanatory variables. The image below calculates the probability based on whether it belongs to conditions such as Yes / No.
Random forest is one of the ensemble learning methods (classifiers composed of multiple classifiers). Since multiple decision trees are collected and used, the trees are collected and used as a forest.
Prepare randomly created data.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn %matplotlib inline from sklearn.datasets import make_blobs #For generating dummy data X, y = make_blobs(n_samples=500, centers=4, random_state=8, cluster_std=2.4) # n_samples:Number of samples centers:Number of center points random_state:seed value cluster_std:Degree of variation
plt.figure(figsize =(10,10)) plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='jet')
The distribution of 500 data generated from the four center points looks like this.
The code for visualize_tree is written at the bottom.
from sklearn.tree import DecisionTreeClassifier #For decision trees clf = DecisionTreeClassifier(max_depth=2, random_state = 0) #Instance creation max_depth:Tree depth visualize_tree(clf, X, y) #Draw execution
You can see how it can be classified into four using straight lines. The accuracy changes depending on the number of depths (max_depth) of the decision tree, so when I set max_depth to 4, it became as follows.
From depth 2, you can see that you are trying to make a finer classification.
However, the greater the depth, the higher the accuracy of the training data, but the easier it is to overfit. To avoid this, try running a random forest.
from sklearn.ensemble import RandomForestClassifier #For random forest clf = RandomForestClassifier(n_estimators=100, random_state=0) #Instance creation n_estimators:Specifying the number of decision trees to make visualize_tree(clf, X, y, boundaries=False)
You can see that the classification is not a simple straight line, but it seems to be more accurate than one decision tree. By the way, doing this does not always prevent overfitting. For example, the red circle in the lower right of the above figure seems to be outlier, but it is grouped with red.
A function that visualizes the result.
#Try drawing a decision tree def visualize_tree(classifier, X, y, boundaries=True,xlim=None, ylim=None): """Visualization function of decision tree. INPUTS:Classification model, X, y, optional x/y limits. OUTPUTS:Visualization of decision trees using Meshgrid """ #Building a model using fit classifier.fit(X, y) #Automatic adjustment of axis if xlim is None: xlim = (X[:, 0].min() - 0.1, X[:, 0].max() + 0.1) if ylim is None: ylim = (X[:, 1].min() - 0.1, X[:, 1].max() + 0.1) x_min, x_max = xlim y_min, y_max = ylim #Create a mesh grid. xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),np.linspace(y_min, y_max, 100)) #Perform classifier predictions Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()]) #Shaped using meshgrid. Z = Z.reshape(xx.shape) #Colored by classification. plt.figure(figsize=(10,10)) plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='jet') #Drawing of training data. plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='jet') plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) def plot_boundaries(i, xlim, ylim): ''' Draw a border. ''' if i < 0: return tree = classifier.tree_ #Call recursively to draw the boundary. if tree.feature[i] == 0: plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k') plot_boundaries(tree.children_left[i], [xlim, tree.threshold[i]], ylim) plot_boundaries(tree.children_right[i], [tree.threshold[i], xlim], ylim) elif tree.feature[i] == 1: plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k') plot_boundaries(tree.children_left[i], xlim, [ylim, tree.threshold[i]]) plot_boundaries(tree.children_right[i], xlim, [tree.threshold[i], ylim]) if boundaries: plot_boundaries(0, plt.xlim(), plt.ylim())
Random forests can also regress.
Use sin to prepare data that makes small waves move in large waves
from sklearn.ensemble import RandomForestRegressor x = 10 * np.random.rand(100) def sin_model(x, sigma=0.2): """Dummy data consisting of large waves + small waves + noise.""" noise = sigma * np.random.randn(len(x)) return np.sin(5 * x) + np.sin(0.5 * x) + noise #Calculate y from x y = sin_model(x) #Try Plot. plt.figure(figsize=(16,8)) plt.errorbar(x, y, 0.1, fmt='o')
Wave-like dummy data. It gets smaller little by little.
from sklearn.ensemble import RandomForestRegressor #For random forest regression #Prepare 1000 pieces of data from 0 to 10 for confirmation xfit = np.linspace(0, 10, 1000) #1000 pieces from 0 to 10 #Random forest execution rfr = RandomForestRegressor(100) #Instance generation Specify the number of trees to 100 rfr.fit(x[:, None], y) #Learning execution yfit = rfr.predict(xfit[:, None]) #Predictive execution #Get the actual value for result comparison. ytrue = sin_model(xfit,0) #Feed xfit to the wave generation function and get the result #Check the result plt.figure(figsize = (16,8)) plt.errorbar(x, y, 0.1, fmt='o') plt.plot(xfit, yfit, '-r') #Predicted plot plt.plot(xfit, ytrue, '-k', alpha = 0.5) #Correct answer plot
You can see that the red line is the regression line of the prediction, and the result seems to be reasonably good.