Last time, I posted Machine learning beginners try to reach out to Naive Bayes, but this time I would like to write about decision trees. (I implemented it myself in Python this time as well)
There are many details in other places, so here I would like to break it down and explain from the perspective of explanation and programming.
This is a method of classifying input data by generating a tree-structured classifier from a dataset of objective variables and explanatory variables.
I didn't understand, so I'll start by explaining the terms.
In general, I think the above is often a vector. (It seems to be machine learning in general)
The point is to make a lot of ʻif --elest`. That is.
I will write the flow in a little more detail.
that's all.
I think this is the heart of this time.
I will explain how to divide the data. The following data is created based on sklearn's iris data.
from sklearn import datasets
iris = datasets.load_iris() #iris data reading
iris.data #Explanatory variable
iris.target #Objective variable
Consider the following dataset. The fifth dimension represents the objective variable (0, 1, 2: flower type), and the others represent the explanatory variables (petal length, etc.).
[
[5.1, 3.5, 1.4, 0.2, 0],
[4.9, 3.0, 1.4, 0.2, 1],
[4.7, 3.2, 1.3, 0.2, 0]
...
]
Now, what to do when splitting this data is as follows.
It seems that there are several methods used to evaluate decision trees, but this time we will use "Gini Impureness".
The word "impure" first appeared, but it's easy. It's nice to split between the explanatory variables, but is this the correct way to split? I think the question remains. Gini impure is the one that quantifies it.
By the way, the word "Gini coefficient" was used on other sites, but please be careful when searching because the derivative somehow comes out there (a graph of the Lorenz curve of population and income comes out. * I don't have much knowledge, so please tell me who is familiar with it ..)
Regarding Gini purity, it is also posted for reference, but the following site was easy to understand. It is divided into the whole part / the second part.
Statistical Honey Gini Coefficient ...? (Part 1)
This Gini impureness is calculated for the sets L and R obtained by dividing the set A, and the average value is taken.
G (A)> Mean of G (L) and G (R)
If so, it means that it was successfully divided.
It's still rough, but I'll post the source code for the time being. Please comment if you have any! !!
# -*- coding: utf-8 -*-
from collections import defaultdict
class Sample:
def __init__(self, label=0, features=[]):
self.label = label
self.features = features
# ----------------------------------------------------------------------------
class Node:
def __init__(self, level=0):
self.level = level #Hierarchy depth
self.l_node = None #Child node(left)
self.r_node = None #Child node(right)
self.t_value = None #Threshold
self.feature_idx = None #What value to divide
self.label = None #Labels to classify
self.samples = [] #Sample to which
def __str__(self):
if self.is_leaf():
return "[%3s] Samples:%3s" % (self.level, len(self.samples))
else:
return "[%3s] Feature Idx:%3s, Threashold: %5s" % (self.level, self.feature_idx, self.t_value)
def is_leaf(self):
return (self.l_node is None) and (self.r_node is None)
def child_nodes(self):
return filter(None, [self.l_node, self.r_node])
def classify(self, feature):
node = self
label = None
f = feature[node.feature_idx]
while node:
if node.is_leaf():
label = node.label
break
elif f < node.t_value:
node = node.l_node
else:
node = node.r_node
return label
def build_child_nodes(self, n_class, samples, max_depth, min_samples):
self.samples = samples
n_features = len(samples[0].features) #Number of explanatory variables
#When you reach the maximum depth, you're done
if self.level >= max_depth:
self.build_leaf()
return
#The best result of classification
best_feature_idx = None #The best classified explanatory variables
best_t_value = None #Best classified threshold
best_gini = 1 #Equality as it approaches 0
best_l_samples = []
best_r_samples = []
#Evaluate by classifying as many as the number of explanatory variables
#Classify by each explanatory variable(idx =Explanatory variable index)
for idx in range(0, n_features-1):
#Make the explanatory variables to be classified unique and sort them in ascending order
features = map(lambda x: x.features[idx] , samples)
features = list(set(features))
features.sort()
for i in range(0, len(features)-2):
#Classify by the intermediate value of each explanatory variable.
t_value = (features[i] + features[i+1]) / 2
l_samples = []; l_sample_labels = defaultdict(lambda: 0)
r_samples = []; r_sample_labels = defaultdict(lambda: 0)
for s in samples:
if s.features[idx] < t_value:
l_samples.append(s)
l_sample_labels[s.label] += 1
else:
r_samples.append(s)
r_sample_labels[s.label] += 1
#There is no point in dividing if it is biased to either side,Calculation of the following explanatory variables
if len(l_samples) == 0 or len(r_samples) == 0:
continue
#Evaluation for the divided(Gini coefficient,Cross entropy)
l_gini = 0; r_gini = 0
for idx in range(0, n_class):
l_gini += (float(l_sample_labels[idx]) / len(l_samples)) ** 2
r_gini += (float(r_sample_labels[idx]) / len(r_samples)) ** 2
#Overall Gini coefficient(l,Find the average value of the gini coefficient of r)
gini = (((1-l_gini) * len(l_samples)) + ((1-r_gini) * len(r_samples))) / len(samples)
#Update if better than best
if gini < best_gini:
best_gini = gini
best_t_value = t_value
best_feature_idx = idx
best_l_samples = l_samples
best_r_samples = r_samples
#If you are biased to either side, do not make a new tree
if len(best_l_samples) == 0 or len(best_r_samples) == 0:
self.build_leaf()
return
# min_No need to separate if samples are not reached.Overfitting measures
if max(len(best_l_samples), len(best_r_samples)) < min_samples:
self.build_leaf()
return
#Current node settings
self.samples = []
self.t_value = best_t_value
self.feature_idx = best_feature_idx
#Create a new node from the best,Move to the next node
next_level = self.level + 1
self.r_node = Node(next_level)
self.r_node.build_child_nodes(n_class, best_r_samples, max_depth, min_samples)
self.l_node = Node(next_level)
self.l_node.build_child_nodes(n_class, best_l_samples, max_depth, min_samples)
def build_leaf(self):
self.label = self.samples[0].label
# ----------------------------------------------------------------------------
class DecisionTree:
def __init__(self, max_depth=10, min_samples=3):
self.root_node = Node(level=1) # root node
self.max_depth = max_depth #Maximum depth of Tree
self.min_samples = min_samples #Minimum number of elements belonging to Node
def fit(self, data, target):
#Number of unique classification classes
labels = list(set(target))
#Data generation for learning
samples = []
for idx, sample in enumerate(data):
samples.append(Sample(features=data[idx], label=target[idx]))
#Learning
self.root_node.build_child_nodes(n_class=len(labels), samples=samples, max_depth=self.max_depth, min_samples=self.min_samples)
def features(self):
#Output Node features for each level
print self.root_node
current_nodes = self.root_node.child_nodes()
child_nodes = []
while current_nodes:
node = current_nodes.pop(0)
print node
child_nodes.extend(node.child_nodes())
if len(current_nodes) == 0 and len(child_nodes) > 0:
current_nodes = child_nodes
child_nodes = []
def predict(self, data):
return self.root_node.classify(data)
The following page was very helpful.
Recommended Posts