1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** You can see that the explanation "I don't know the background, but I got this result" is obviously weak **.

In this article, ** 2 to 3 are aimed at "the theory is good, so try using scikit-learn first", and 4 and later are aimed at "understanding the background from mathematics" **.

I am from a private liberal arts school, so I am not good at mathematics. I tried to explain it so that it is easy to understand even for those who are not good at math as much as possible.
A similar article has been posted as a series "Understanding from Mathematics", so I hope you can read it as well. [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics [[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/b84a0d669bcf5267e750) [[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d) [[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1] (https://qiita.com/Hawaii/items/3f4e91cf9b86676c202f)

2. What is a decision tree?

In Wikipedia, there is a description as follows.

Decision trees are created to help make decisions. Decision trees are a special form of tree structure.

It's a little difficult to put into words, but in a nutshell, it's a model that subdivides a person's judgment like a tree structure and advances the judgment.

This is still difficult to understand, so let's give a concrete example.

Specific example

Assuming you are a real estate agent, you want to pre-select rooms that are easy to choose for young women and rooms that are not. Therefore, from the past 13 data, we collected information on rooms that were actually selected from young women, rooms that were not, and those rooms.

	Some floors	Room sizem^2	auto lock	If you borrow it
Property 1	4	30	Yes	〇
Property 2	5	45	Nothing	〇
Property 3	3	32	Yes	〇
Property 4	1	20	Yes	〇
Property 5	6	35	Yes	〇
Property 6	3	40	Yes	〇
Property 7	4	38	Yes	〇
Property 8	1	20	Nothing	×
Property 9	2	18	Nothing	×
Property 10	1	20	Nothing	×
Property 11	1	22	Nothing	×
Property 12	1	24	Yes	×
Property 13	3	25	Nothing	×

This time, I gave an easy-to-understand example, but when deciding on a room, I think we ourselves think as follows.

Which condition comes to the top is different for each person, but for example, I hate the first floor a little, first think about whether it is the second floor or higher, and if it is the second floor or higher, it would be nice to have an auto lock. Rent a room. On the contrary, even if it is not on the 2nd floor or above (the room on the 1st floor), if the room is a certain size, you can rent it, or if it is on the 1st floor and it is small, do not rent it. I think I will think about it.

** This is exactly the structure of the tree, and it is the decision tree that branches the conditions in this way to make decisions. ** **

So how is the branch of this condition determined? The example I just gave is an intuitive explanation, and I think it had no basis.

What comes out here is "impureness". Details will be given in the latter chapter of mathematics, but the decision tree determines the branching of conditions based on this impureness. In short, I want to divide the original data as cleanly as possible (in this example, rented or not rented a room). The index of this "clean" is "impurity".

For those who are not sure, in order to clearly classify the objective variable (borrowed / not borrowed this time), the decision tree uses the index of impureness to determine the structure of the tree that branches nicely. It's okay if you think you're making it.

3. Decision tree with scikit-learn

(1) Import of required libraries

Import the following required to make a decision tree. (In addition, import numpy etc.)

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_graphviz
#Below are tools for decision tree visualization
import graphviz
import pydotplus
from IPython.display import Image
from sklearn.externals.six import StringIO

(2) Data preparation

Set the information such as the number of floors, the size of the room, whether it is auto-locked, and whether or not the room was rented as data as shown below (the contents are the same as the data table shown at the beginning).

For example, in the following, Property 1 is on the 4th floor, the room size is 30 $ m ^ 2 $, there is an auto lock, and the room was rented.

data = pd.DataFrame({
        "buy(y)":[True,True,True,True,True,True,True,False,False,False,False,False,False],
        "high":[4, 5, 3, 1, 6, 3, 4, 1, 2, 1, 1,1,3],
        "size":[30, 45, 32, 20, 35, 40, 38, 20, 18, 20, 22,24,25],
        "autolock":[1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,1,0]
    })

(3) Model construction

(I) Data shaping

First of all, we will arrange the shape of the data to build the model.

y = data.loc[:,["buy(y)"]]
X = data.loc[:,["high", "size","autolock"]]

Since this is not an article on python grammar, I will omit the details, but I will arrange X and y into a shape for decision tree with scikit-learn.

I think this is a code that cannot be written unless you understand it to some extent, so I would like to summarize it somewhere.

(Ii) Model construction

It's finally the model building code.

clf = DecisionTreeClassifier()
clf = clf.fit(X, y)

That's it for a simple model. We will create a decision tree model in a variable called clf! The image is that the clf is fitted (= learned) with the prepared X and y in the next line.

(3) Model visualization

◆ Visualization code

If it is a simple model, it ends with (2), but one of the advantages of the decision tree is "high readability". Simply put, "it's easy for people who don't know much about machine learning to understand why the model produced this result."

Let's visualize the tree structure judgment process.


dot_data = StringIO() #Storage location of dot file information
export_graphviz(clf, out_file=dot_data,  
                     feature_names=["high", "size","autolock"],#Here to edit
                     class_names=["False","True"],#Here to edit (Why Fase,I will touch on whether it is the true order later)
                     filled=True, rounded=True,  
                     special_characters=True) 
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

When you run the above code, you will see a figure like the one below. キャプチャ5.PNG

◆ How to see the branch

I was able to visualize it above! There are many articles and sites that end with, but I didn't know how to read this figure, and I had a hard time at first, so I will add a brief view as well.

The terms gini coefficient and impureness are used, but details will be dealt with in the chapter on mathematics.

** (a) Light blue box at the top ** This is the very first state. The following gini shows the current state. ―――――――――――――――――――――――――――――――――――――――――――― gini coefficient: 0.497 sample (number of data): 13 value: The data is divided into 6 and 7, and the more True is displayed as class. ―――――――――――――――――――――――――――――――――――――――――――― ** * About the order of values ** Since there is little data this time, you can see that 7 is True, but if there is a lot of data, even if the number is displayed as value, you have to know which (True is False in this case) data. think. In that case, write as follows.


clf = DecisionTreeClassifier()#This is the same as before
clf = clf.fit(X, y)#This is the same as before
print(clf.classes_)#Add here

Then, in this case, [False, True] is displayed. In other words, you can see that the order of values is False, True.

This is the reason why I wrote `class_names = ["False "," True "], # in the visualization code here (I will explain why the order is Fase, True later)` **. Since the order is False, True in DecisionTreeClassifier (), if you do not make class_names in the same order, you will end up with the opposite name when you visualize it, so be careful. (I stumbled quite a bit here)

** (b) 2nd line, blue box on the right ** It refers to the case where size (room size) is not less than 27.5 $ m ^ 2 $ (= 27.5 $ m ^ 2 $ or more) at the first branch, in which case the gini coefficient is 0 and sample (number of data). 6 and True are divided into 6 pieces.

This means that if a room is larger than 27.5 $ m ^ 2 $, you can always rent it! Since the gini coefficient is 0, that is, the purity is 0, no further branching is possible, and this is the end.

Below, you can see the other branches in the same way.

As a supplement, if you set a binary value of 0 or 1 like autolock, you can see it by looking at the branching condition, but the condition is whether it is 0.5 or less (or or more). This means that 0.5 or more means 1 (with auto lock in this case), and 0.5 or less means 0 (without auto lock).

This is the end of implementing the decision tree with scikit-learn and the visualization flow.

(4) In the real world ...

It doesn't make sense to finish making a model. In the real world, it is necessary to use this prediction model to predict whether a room will be rented when data for a new room is obtained in the future. You made a note of the data for two new rooms. Store it in a variable as shown below.

z = pd.DataFrame({
        "high":[2, 3],
        "size":[25, 18],
        "autolock":[1,0]
    })
z2 = z[["high", "size","autolock"]].values

What I want to do is to apply the above additional data to the decision tree model (clf) built with scikit-learn earlier and predict whether the room is likely to be rented.

y_est = clf.predict(z2)
y_est

If you do this, the result will be displayed as "([False, False])". In other words, unfortunately, it seems likely that these two rooms will not be rented, so it is necessary to take a strategy of selling to target groups other than young women.

There are many other details, but I think it's good to try implementing an orthodox decision tree first.

4. Understanding decision trees from mathematics

By the way, up to 3, I tried to implement the flow of building a decision tree model using scikit-learn → visualization → prediction using new data. Here, I would like to clarify how the decision tree model of this flow is calculated mathematically.

If you do not need this knowledge at present, you can skip it.

For that purpose, it is necessary to talk about "impurity" and "information amount" introduced in the first half, and "entropy" and "Gini coefficient" as indicators for measuring impureness.

(I) Impure

In the first half, Decision Tree stated that it would build a judgment process based on impureness. In a little more detail, impureness is ** "an indicator of how much the data varies" **.

For example, in classification, if it can be completely classified, the purity will be 0, and in regression, if the variance of the data within one node is 0, the purity will be 0.

Therefore, the decision tree makes this impureness as small as possible (close to 0). There are "entropy" and "Gini coefficient" as indexes to measure this impureness. To get into that explanation, we'll cover the amount of information below.

(Ii) Amount of information

Here, I would like to roughly suppress the idea of the amount of information. The amount of information is ** "an index that quantifies the degree of surprise when you know the event" **.

For example, you were talking to a friend and heard: Which is more surprising? キャプチャ3.PNG

First: It seems that it will be fine in Okinawa tomorrow. Second: Mr. Habu of figure skating seems to be as great as a professional in rugby. (It's just a fantasy story)

Obviously, the second story is more surprising. Because, in the second story, I got information on an event that is unlikely to happen. In this case, it can be said that the second one has more information.

This amount of information is defined as follows in the world of information theory.

I(x) = -\log_{2} P(x)

$ I (x) $ is the amount of information of event $ x $ (unit is bits)
$ P (x) $ is the probability that an event will occur

In the previous case, for example, if the probability that Okinawa will clear is 0.8, and if Hanyu actually has a great probability that rugby is as good as a professional, the amount of information can be calculated as follows.

Amount of information on the event that Okinawa clears up: $-\ log_ {2} 0.8 = 0.322 $ Information on Hanyu's event: $-\ log_ {2} 0.01 = 6.644 $

The amount of information about Hanyu's events is overwhelmingly large!

※reference※ You can easily calculate $ \ log $ with Wolfram Alpha at the following site. (Enter "-log2 (0.8)" in the bar that appears above and the value will be returned.) https://www.wolframalpha.com/

So far, we have explained the amount of information. With this content, let's move on to the story of entropy.

(Iii) Entropy

◆ About entropy

Entropy is an index that averages the amount of information in (i) and shows the degree of variation in the amount of information.

Most have low entropy when the same event is observed many times. On the other hand, if a different event occurs each time you observe, the entropy is large.

For example, the weather in Okinawa like the one I mentioned earlier is always sunny, so the entropy is small. On the other hand, the weather in Kanto is not always sunny, and it is cloudy and rainy, so it can be said that the entropy is large.

Entropy is defined as follows.

H = \sum_{i=1}P(x_i)I(x_i) = -\sum_{i=1}P(x_i)\log_{2}P(x_i)

$ H $ indicates entropy, and $ I (x_i) $ indicates the amount of information for a certain event $ x_i $.

In the previous case, for example, the probability that Okinawa will clear is 0.8, the probability of cloudy weather is 0.05, and the probability of rain is 0.15. And if the probability of clearing Kanto is 0.6, the probability of cloudy is 0.2, and the probability of rain is 0.2, each entropy can be calculated as follows.

Okinawa weather entropy: 0.884

Enter the following in WolframAlpha. 80/100(log2(0.8)) +5/100(log2(0.05)) + 15/100(log2(0.15))

Kanto weather entropy: 0.953

Enter the following in WolframAlpha. 60/100(log2(0.6)) +20/100(log2(0.2)) + 20/100(log2(0.2))

Since the entropy of the weather in Kanto is larger, you can see that the amount of information varies.

◆ About the relationship between entropy and purity

The decision tree stated that it tries to reduce the purity as much as possible. As for the relationship between entropy and purity, ** the lowest purity means 0 entropy, and the higher the purity, the higher entropy **.

Impureness showed the degree of variation in the data, and entropy also showed the degree of variation in the amount of information, so it may be easy to understand intuitively.

◆ About decision trees and entropy

So how is the entropy specifically calculated and the model built in the decision tree?

It ** calculates the difference between the entropy before branching and the entropy after branching, searches for the branching condition that maximizes the difference, and builds the model **.

It would be lengthy to write in words, but the point is that in order to make a good model of a decision tree, we want to reduce the impureness, which is the variation of data, as much as possible, that is, to reduce the entropy.

For that purpose, it is necessary to search for a branching condition in which the entropy after the next branch is smaller than that of the previous branch, that is, the difference between the entropy before the branch and the entropy after the branch is as large as possible. It will be.

(Iv) Gini coefficient

◆ About Gini coefficient

In addition to entropy, there is the Gini coefficient as an index to measure impurities. The Gini coefficient is an index that averages the probability of misclassification.

In a little more detail, if the probability that class $ x_i $ is selected at a node $ t $ is $ P (X_i \ mid t) $, the Gini coefficient is defined as the expected value of the probability of misclassification at a node. .. In other words, it is an indicator of how much you are likely to misclassify **.

The Gini coefficient is defined as follows.

G(t) = \sum_{i=1}^{K}P(x_i \mid t)\left(1-P(x_i \mid t)\right) = 1-\sum_{i=1}^{K}P(x_i \mid t)^2

$ G $ is the Gini coefficient, $ K $ is the number of classes, and $ P (x_i \ mid t) $ is the probability that class $ x_i $ will be selected at a node $ t $.

◆ About the relationship between Gini coefficient and impureness

The decision tree stated that it tries to reduce the purity as much as possible. As for the relationship between the Gini coefficient and the purity, ** the lowest purity has 0 entropy, and the higher the purity, the higher the entropy (maximum 1) **.

◆ About decision tree and Gini coefficient

This range is omitted because the idea is the same as the above "◆ About decision trees and entropy". Construct a conditional branch so that the difference between the Gini coefficient before branching and the Gini coefficient after branching is as large as possible.

(V) Summary

As described above, the decision tree is a model that constructs a judgment process such as the structure of a tree, and the judgment criterion was to minimize the "impurity" that represents the variation in data.

The indexes to measure the impureness are "entropy indicating the degree of variability of events" and "Gini coefficient indicating how much misclassification is likely to occur", and the difference between before and after branching is the largest in both cases. It was behind the scenes that the decision tree set such branching conditions.

We also looked at specific formulas for entropy and Gini coefficient.

Supplement * When actually running the model with python, you can set the code introduced in 3 as an argument as follows (the following is the case where the impure index is the Gini coefficient. If you want to make it entropy, set this to entropy) ..

clf = DecisionTreeClassifier(criterion="gini")
clf = clf.fit(X, y)

5. Conclusion

How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.

However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.

I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.

[PYTHON] [Machine learning] Understanding decision trees from both scikit-learn and mathematics