[PYTHON] [Machine learning] Try studying decision trees

Decision Tree

A decision tree is an algorithm used for ** supervised learning **. By branching the given data like a tree, we make predictions and summarize the data. A learning model that can be used for both regression and classification.

The figure below shows a decision tree for classifying data into "dogs," "people," "birds," and "mosquitoes." Each branch uses features to classify the data. The part that becomes the last leaf determines the classification result.

Screenshot 2019-11-24 at 21.29.44.png

Here, as a point to consider when extending each branch

  1. ** Which features are used and how much **
  2. ** How deep do you want to grow the tree **

Becomes important.

First of all, the criterion ** Information Gain ** is used as the criterion for judging 1.

Information Gain

To put it simply, the information gain is a value that indicates ** how well the child node was able to classify the data compared to the parent node **. Alternatively, it is a value that represents ** how much the standard deviation is reduced at each node **. A value called ** impure ** is used to calculate how much this information gain is. There are several types of impureness, but this time we will introduce the most representative ** Gini ** and ** Entropy **. Gini Gini is expressed by the following formula.

G = 1 - \sum_{n = 1}^{classes}P(i|t)^2

You can see that at each node, the higher the probability that the data will be classified into a class, the closer the Gini will be to 0. If there is only one class, the Gini will be 0. Conversely, when all samples belong to different classes, the Gini approximates 1. In addition, each node calculates ** Information Gain (IG) ** from Gini.

IG = G(parent) - \sum_{children}\frac{N_j}{N}G(child_j)

Here, the difference between the weighted average of the Gini of the parent branch and the Gini of the child branch (the ratio of the number of data contained in each class) is acquired as the information gain.

Entropy Entropy is expressed by the following formula.

E =  - \sum_{i = 1}^{N}P(i|t)*log(P(i|t))

Where P(i|t)Is 0.The closer it is to 5(I don't know if it's 1 or 0; I can't classify)You can see that the higher the entropy, the higher the entropy. On the contrary, P(i|t)Is 0か1の時、エントロピーは0となります。

IG = E(parent) - \sum_{children}\frac{N_j}{N}E(child_j)

As before, the difference between the weighted average of the cross entropy of the parent branch and the cross entropy of the child branch is acquired as the information gain.

A division method with a large information gain is selected for each node.

Proper use of Gini and Entropy

Gigi is good for regression problems and Entropy is good for classification problems.

Tree depth

The deeper the ** tree of the decision tree, the more the model ** that fits the training data is selected. In fact, when the last child node has 1 data, all the data can be perfectly classified. However, this would ** overfit ** the sample data, making the model meaningless. Therefore, when creating a learning model, it is necessary to limit the depth of the tree. In skitlearn, the depth of the tree is set by a parameter.

Scikit-learn decision tree

Regression

from sklearn.tree import DecisionTreeRegressor  

clf = DecisionTreeRegressor(criterion="entropy", max_depth=3)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

Classification

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

Decision tree parameters

Parameters- Overview option Default
criterion Split criteria "gini", "entropy" "gini"
splitter Split selection strategy "best", "random" "best"
max_depth The deepest depth of the tree int None
min_samples_split Minimum sample size of post-split node(Smaller tends to overfit) int(The number of samples)/float(Percentage of all samples) 2
min_samples_leaf leaf(Last node)Minimum sample size required for(Smaller tends to overfit) int/float 2
max_features Number of features used for division(The larger it is, the more likely it is to overfit) int/float, auto, log2 None
class_weight Class weight "balanced", none none
presort Pre-sorting data(Calculation speed changes depending on the data size) bool False
min_impurity_decrease Limit impureness and control node elongation float 0.

Advantages and disadvantages of decision trees

Pros

--Easy visualization and summarization. --Also available when the data is not a linear pattern. --No need for normalization preprocessing.

Disadvantages

—— Sensitive to outliers. ――Even with a small variance, the result will change significantly. --Computational calculation is complicated and time complexity is large.

Recommended Posts

[Machine learning] Try studying decision trees
[Machine learning] Try studying random forest
[Machine learning] FX prediction using decision trees
Studying Machine Learning ~ matplotlib ~
Try machine learning with Kaggle
Machine Learning: Supervised --Decision Tree
Machine learning beginners try to make a decision tree
Machine learning beginners try linear regression
[Machine learning] Understanding decision trees from both scikit-learn and mathematics
Machine learning ③ Summary of decision tree
Machine learning
Try machine learning with scikit-learn SVM
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Try to forecast power demand by machine learning
Try using Jupyter Notebook of Azure Machine Learning
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Try to predict forex (FX) with non-deep machine learning
Machine learning support vector machine
Studying Machine Learning-Pandas Edition-
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
[Machine learning] Try to detect objects using Selective Search
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
[Machine learning] Start Spark with iPython Notebook and try MLlib
Try to evaluate the performance of machine learning / regression model
Try to evaluate the performance of machine learning / classification model
Machine learning beginners try to reach out to Naive Bayes (2) --Implementation
Try to predict if tweets will burn with machine learning
Machine learning beginners try to reach out to Naive Bayes (1) --Theory
Machine learning model considering maintainability
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
Try deep learning with TensorFlow
An introduction to machine learning
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Basics of Machine Learning (Notes)
Machine learning beginners tried RBM
[Machine learning] Understanding random forest
Machine learning with Python! Preparation
Try Deep Learning with FPGA
Reinforcement learning 5 Try programming CartPole?
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Understand machine learning ~ ridge regression ~.
Machine learning article summary (self-authored)
About machine learning mixed matrices