Decision Tree

A decision tree is an algorithm used for ** supervised learning **. By branching the given data like a tree, we make predictions and summarize the data. A learning model that can be used for both regression and classification.

The figure below shows a decision tree for classifying data into "dogs," "people," "birds," and "mosquitoes." Each branch uses features to classify the data. The part that becomes the last leaf determines the classification result.

Here, as a point to consider when extending each branch

** Which features are used and how much **
** How deep do you want to grow the tree **

Becomes important.

First of all, the criterion ** Information Gain ** is used as the criterion for judging 1.

Information Gain

To put it simply, the information gain is a value that indicates ** how well the child node was able to classify the data compared to the parent node **. Alternatively, it is a value that represents ** how much the standard deviation is reduced at each node **. A value called ** impure ** is used to calculate how much this information gain is. There are several types of impureness, but this time we will introduce the most representative ** Gini ** and ** Entropy **. Gini Gini is expressed by the following formula.

G = 1 - \sum_{n = 1}^{classes}P(i|t)^2

You can see that at each node, the higher the probability that the data will be classified into a class, the closer the Gini will be to 0. If there is only one class, the Gini will be 0. Conversely, when all samples belong to different classes, the Gini approximates 1. In addition, each node calculates ** Information Gain (IG) ** from Gini.

IG = G(parent) - \sum_{children}\frac{N_j}{N}G(child_j)

Here, the difference between the weighted average of the Gini of the parent branch and the Gini of the child branch (the ratio of the number of data contained in each class) is acquired as the information gain.

Entropy Entropy is expressed by the following formula.

E =  - \sum_{i = 1}^{N}P(i|t)*log(P(i|t))

Where P(i|t)Is 0.The closer it is to 5(I don't know if it's 1 or 0; I can't classify)You can see that the higher the entropy, the higher the entropy. On the contrary, P(i|t)Is 0か1の時、エントロピーは０となります。

IG = E(parent) - \sum_{children}\frac{N_j}{N}E(child_j)

As before, the difference between the weighted average of the cross entropy of the parent branch and the cross entropy of the child branch is acquired as the information gain.

A division method with a large information gain is selected for each node.

Proper use of Gini and Entropy

Gigi is good for regression problems and Entropy is good for classification problems.

Tree depth

The deeper the ** tree of the decision tree, the more the model ** that fits the training data is selected. In fact, when the last child node has 1 data, all the data can be perfectly classified. However, this would ** overfit ** the sample data, making the model meaningless. Therefore, when creating a learning model, it is necessary to limit the depth of the tree. In skitlearn, the depth of the tree is set by a parameter.

Scikit-learn decision tree

Regression

from sklearn.tree import DecisionTreeRegressor  

clf = DecisionTreeRegressor(criterion="entropy", max_depth=3)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

Classification

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

Decision tree parameters

Parameters-	Overview	option	Default
criterion	Split criteria	"gini", "entropy"	"gini"
splitter	Split selection strategy	"best", "random"	"best"
max_depth	The deepest depth of the tree	int	None
min_samples_split	Minimum sample size of post-split node(Smaller tends to overfit)	int(The number of samples)/float(Percentage of all samples)	2
min_samples_leaf	leaf(Last node)Minimum sample size required for(Smaller tends to overfit)	int/float	2
max_features	Number of features used for division(The larger it is, the more likely it is to overfit)	int/float, auto, log2	None
class_weight	Class weight	"balanced", none	none
presort	Pre-sorting data(Calculation speed changes depending on the data size)	bool	False
min_impurity_decrease	Limit impureness and control node elongation	float	0.

Advantages and disadvantages of decision trees

Pros

--Easy visualization and summarization. --Also available when the data is not a linear pattern. --No need for normalization preprocessing.

Disadvantages

—— Sensitive to outliers. ――Even with a small variance, the result will change significantly. --Computational calculation is complicated and time complexity is large.

[PYTHON] [Machine learning] Try studying decision trees