[PYTHON] I tried to understand the decision tree (CART) that makes the classification carefully

Introduction

So far, we have summarized logistic regression, support vector machines, neural networks, etc. This time, I will summarize the basic decision trees such as XGBoost, LightGBM, and Random Forest.

What is a decision tree?

A decision tree is an algorithm that "separates data step by step and outputs tree-like analysis results."

021.png

The advantages of performing an analysis with this decision tree are:

There are features such as. In particular, I think the first advantage is great. Other classifiers (support vector machines, neural networks, etc.) are very complicated and black-boxed in the calculations performed inside, so it is difficult for anyone who is familiar with the contents of the model to understand the contents. There is a face. On the other hand, the decision tree is an easy-to-understand feature because the grounds for division are clear as shown in the above figure. I think this ** easy to understand ** is a huge advantage. This is because I don't think it's a good attitude for an engineer for a person who uses a model to produce results using something that he doesn't understand **.

This decision tree is an algorithm that can be applied to both classification and regression, but this time I would like to summarize the classification. In addition, the decision tree has an algorithm called CART and C4.5. This time we are talking about CART.

How to decide the criteria for division

Divide so that the elements after division (data to be divided) are for each element you want to divide (= the purity after division is the minimum). Impureness is an indicator of how different classes of elements are mixed together.

I would like to briefly consider it with an example. I would like to load and create the make_blobs () method of scikit learn that gives me a spray sample.

RF.ipynb


#Imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=500, centers=4,
                  random_state=8, cluster_std=2.2)

#Scatter plot the points
plt.figure(figsize=(10,10))
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='jet')

The original scatter plot is here.

022.png

Now, let's think about how to divide this scatter plot.

image.png

Due to the dividing line on the left, the element $ [2], opposite to the group containing many elements $ [0], [1] $ (conversely not including $ [2], [3] $). You can see that it can be divided into groups that contain a lot of [3] $ (conversely, $ [0] and [1] $ are not included). This kind of division is a method of division with less purity. On the other hand, in the case of the dividing line on the right, you can see that the elements of $ [0] to [3] $ are mixed in the group after the division. This is expressed as having a lot of impurities. Find the dividing line as shown on the left and classify it several times (= called the depth of the decision tree) to classify the purpose.

How to find purity

Now, let's summarize how to express this impureness. This time, we will focus on Gini's Diversity Index $. The literal translation is the Gini Diversity Index. I think you can understand the meaning in Japanese, but it is often called impure.

Consider $ t $ in a hierarchy (= node) with a decision tree. A node is a group after it has been split. Then, consider the case where there are $ n $ samples in the node and $ c $ classes in the node. At this time, assuming that the number of samples belonging to the class $ i $ in the node $ t $ is $ n_i $, the ratio $ p (i | t) $ of the samples belonging to the class $ i $ can be expressed as follows. I can do it.

p(i|t) = \frac{n_i}{n} \tag{1}

At this time, Gini Impure $ I_G (t) $ can be expressed as follows. $ I_G(t) = 1 - \sum_{i=1}^c {p(i|t)}^2 \tag{2} $

The sum of $ p (i | t) ^ 2 $ will increase if a good division is made. Therefore, $ I_G (t) $ becomes smaller. We will make a good classifier using this evaluation index.

Arranged to make the graph easier to see

Color each class to make it easier to understand.

RF.ipynb


def visualize_tree(classifier, X, y, boundaries=True,xlim=None, ylim=None):
    '''
Visualize the decision tree.
    INPUTS:Classification model, X, y, optional x/y limits.
    OUTPUTS:Visualization of decision trees using Meshgrid
    '''
    #Building a model using fit
    classifier.fit(X, y)
    
    #Automatic adjustment of axis
    if xlim is None:
        xlim = (X[:, 0].min() - 0.1, X[:, 0].max() + 0.1)
    if ylim is None:
        ylim = (X[:, 1].min() - 0.1, X[:, 1].max() + 0.1)

    x_min, x_max = xlim
    y_min, y_max = ylim
    
    
    #Make a mesh grid.
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    
    #Save classifier predictions as Z
    Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()])

    #Use meshgrid to shape it.
    Z = Z.reshape(xx.shape)
    
    #Color each category.
    plt.figure(figsize=(10,10))
    plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='jet')
    
    #It also draws training data.
    plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='jet')
    
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)        
    
    def plot_boundaries(i, xlim, ylim):
        '''
Draw a border.
        '''
        if i < 0:
            return

        tree = classifier.tree_
        
        #Call recursively to draw the boundary.
        if tree.feature[i] == 0:
            plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k')
            plot_boundaries(tree.children_left[i],
                            [xlim[0], tree.threshold[i]], ylim)
            plot_boundaries(tree.children_right[i],
                            [tree.threshold[i], xlim[1]], ylim)
        
        elif tree.feature[i] == 1:
            plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k')
            plot_boundaries(tree.children_left[i], xlim,
                            [ylim[0], tree.threshold[i]])
            plot_boundaries(tree.children_right[i], xlim,
                            [tree.threshold[i], ylim[1]])
    
    if boundaries:
        plot_boundaries(0, plt.xlim(), plt.ylim())

Try to classify by decision tree

RF.ipynb


from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='entropy',max_depth=2,random_state=0)

visualize_tree(clf,X,y)

023.png

I was able to divide it into a good feeling. In the above, determine the depth of the decision tree with max_depth. Increasing (= deepening) the value of this max_depth will result in overfitting. Try it with max_depth = 6.

024.png

You can see that it is over-divided (especially the red class). The point is that you need to adjust this depth yourself.

Visualization of the basis of the decision tree

By the way, the image of the decision tree when verified at depth 2 is shown below.

image.png

The numerical values and Gini impureness that serve as the criteria for classification are listed. value is the number of elements in classes [0] to [3].

How to output a decision tree image

By the way, it is a lightly posted image, but it is necessary to import the library and install the software.

  1. Install pydotplus
  2. Install graphviz
  3. Through path

The above three actions were necessary for me to save.

  1. Install pydotplus

This is a library that saves the contents divided by the decision tree to a .dot file.

console


pip install pydotplus

Just like any other library, you can install it with pip.

  1. Install graphviz

I downloaded the installer (graphviz-2.38.msi) from this URL.

https://graphviz.gitlab.io/_pages/Download/Download_windows.html

After the download is complete, double-click "graphviz-2.38.msi" to install it. In addition, install graphviz on pip.

  1. pass the path

Then, pass path to convert it to pdf. Specify the folder where dot.exe is located. I have summarized the method before, so please refer to here.

https://qiita.com/Fumio-eisan/items/340de9fe220a90607013

Finally, you need to rewrite the syntax in graphviz.py as follows:

image.png

At this point, the decision tree can be converted to .dot and pdf below.

RF.ipynb



import pydotplus
import os
from graphviz import Source
from sklearn.tree import export_graphviz

export_graphviz(
        clf,
        out_file=os.path.join("text_classification.dot"),
        class_names=['1', '2','3','4'],
        rounded=True,
        filled=True
    )
with open("random.dot", 'w') as f:
    f = export_graphviz(clf, out_file=f)

data = export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(data)
graph.write_pdf("random.pdf")

I referred to this site.

GraphViz error handling (GraphViz ’s executables not found) https://niwakomablog.com/graphviz-error-handling/

At the end

This time, we have summarized the contents and implementation of the classification of decision trees. The idea is easy to understand and easy to implement. However, at the end, I felt a little hurdle to make the decision tree into pdf. Next, I would like to make a regression.

The full program is here. https://github.com/Fumio-eisan/RF_20200423

Recommended Posts

I tried to understand the decision tree (CART) that makes the classification carefully
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
[Introduction] I tried to implement it by myself while explaining to understand the binary tree
I tried to move the ball
I tried to estimate the interval.
(Python) Expected value ・ I tried to understand Monte Carlo sampling carefully
I tried to make a site that makes it easy to see the update information of Azure
I tried to summarize the umask command
I tried to recognize the wake word
Understand the Decision Tree and classify documents
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to solve the inverted pendulum problem (Cart Pole) by Q-learning.
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
I tried web scraping to analyze the lyrics.
I tried to optimize while drying the laundry
I tried to save the data with discord
Decision tree (classification)
I tried to predict the horses that will be in the top 3 with LightGBM
I tried to touch the API of ebay
I tried to correct the keystone of the image
Qiita Job I tried to analyze the job offer
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I didn't understand the Resize of TensorFlow so I tried to summarize it visually.
LeetCode I tried to summarize the simple ones
I tried to implement the traveling salesman problem
I tried to understand the support vector machine carefully (Part 1: I tried the polynomial / RBF kernel using MakeMoons as an example).
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
[LPIC 101] I tried to summarize the command options that are easy to make a mistake
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
[Introduction] I tried to implement it by myself while explaining the binary search tree.
I tried to understand how to use Pandas and multicollinearity based on the Affairs dataset.
mong --I tried porting the code that randomly generates Docker container names to Python -
I tried to learn the sin function with chainer
I tried to graph the packages installed in Python
I tried to summarize the basic form of GPLVM
I tried to touch the CSV file with Python
I tried to predict the J-League match (data analysis)
How to visualize the decision tree model of scikit-learn
I tried to solve the soma cube with python
I want to fully understand the basics of Bokeh
I tried to approximate the sin function using chainer
I tried to put pytest into the actual battle
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I felt that I ported the Python code to C ++ 98.
I tried to simulate the dollar cost averaging method
I tried to redo the non-negative matrix factorization (NMF)
I tried the simplest method of multi-label document classification
I tried to identify the language using CNN + Melspectogram
I tried to notify the honeypot report on LINE
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)
I tried to debug.
I tried to paste