A decision tree is an algorithm used for ** supervised learning **. By branching the given data like a tree, we make predictions and summarize the data. A learning model that can be used for both regression and classification.
The figure below shows a decision tree for classifying data into "dogs," "people," "birds," and "mosquitoes." Each branch uses features to classify the data. The part that becomes the last leaf determines the classification result.
Here, as a point to consider when extending each branch
Becomes important.
First of all, the criterion ** Information Gain ** is used as the criterion for judging 1.
To put it simply, the information gain is a value that indicates ** how well the child node was able to classify the data compared to the parent node **. Alternatively, it is a value that represents ** how much the standard deviation is reduced at each node **. A value called ** impure ** is used to calculate how much this information gain is. There are several types of impureness, but this time we will introduce the most representative ** Gini ** and ** Entropy **. Gini Gini is expressed by the following formula.
G = 1 - \sum_{n = 1}^{classes}P(i|t)^2
You can see that at each node, the higher the probability that the data will be classified into a class, the closer the Gini will be to 0. If there is only one class, the Gini will be 0. Conversely, when all samples belong to different classes, the Gini approximates 1. In addition, each node calculates ** Information Gain (IG) ** from Gini.
IG = G(parent) - \sum_{children}\frac{N_j}{N}G(child_j)
Here, the difference between the weighted average of the Gini of the parent branch and the Gini of the child branch (the ratio of the number of data contained in each class) is acquired as the information gain.
Entropy Entropy is expressed by the following formula.
E = - \sum_{i = 1}^{N}P(i|t)*log(P(i|t))
Where P(i|t)Is 0.The closer it is to 5(I don't know if it's 1 or 0; I can't classify)You can see that the higher the entropy, the higher the entropy. On the contrary, P(i|t)Is 0か1の時、エントロピーは0となります。
IG = E(parent) - \sum_{children}\frac{N_j}{N}E(child_j)
As before, the difference between the weighted average of the cross entropy of the parent branch and the cross entropy of the child branch is acquired as the information gain.
A division method with a large information gain is selected for each node.
Gigi is good for regression problems and Entropy is good for classification problems.
The deeper the ** tree of the decision tree, the more the model ** that fits the training data is selected. In fact, when the last child node has 1 data, all the data can be perfectly classified. However, this would ** overfit ** the sample data, making the model meaningless. Therefore, when creating a learning model, it is necessary to limit the depth of the tree. In skitlearn, the depth of the tree is set by a parameter.
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor(criterion="entropy", max_depth=3)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
Parameters- | Overview | option | Default |
---|---|---|---|
criterion | Split criteria | "gini", "entropy" | "gini" |
splitter | Split selection strategy | "best", "random" | "best" |
max_depth | The deepest depth of the tree | int | None |
min_samples_split | Minimum sample size of post-split node(Smaller tends to overfit) | int(The number of samples)/float(Percentage of all samples) | 2 |
min_samples_leaf | leaf(Last node)Minimum sample size required for(Smaller tends to overfit) | int/float | 2 |
max_features | Number of features used for division(The larger it is, the more likely it is to overfit) | int/float, auto, log2 | None |
class_weight | Class weight | "balanced", none | none |
presort | Pre-sorting data(Calculation speed changes depending on the data size) | bool | False |
min_impurity_decrease | Limit impureness and control node elongation | float | 0. |
--Easy visualization and summarization. --Also available when the data is not a linear pattern. --No need for normalization preprocessing.
—— Sensitive to outliers. ――Even with a small variance, the result will change significantly. --Computational calculation is complicated and time complexity is large.
Recommended Posts