Last time, I posted Machine learning beginners try to reach out to Naive Bayes, but this time I would like to write about decision trees. (I implemented it myself in Python this time as well)

There are many details in other places, so here I would like to break it down and explain from the perspective of explanation and programming.

What is a decision tree?

This is a method of classifying input data by generating a tree-structured classifier from a dataset of objective variables and explanatory variables.

I didn't understand, so I'll start by explaining the terms.

Objective variable: Like the variable to be classified (whether it is spam or not for email, 0 or 1)
Explanatory variable: A variable that represents the characteristics of the data (destination? Body? I'm not sure)

In general, I think the above is often a vector. (It seems to be machine learning in general)

Chew the decision tree a little more

The point is to make a lot of ʻif --elest`. That is.

I will write the flow in a little more detail.

Divide the data into two by an explanatory variable
I don't know if the division is correct, so I will evaluate it using a certain function.
Divide by the most correct division
Repeat the above two for the divided ones (two)
... (continue until it can no longer be split)

that's all.

How to divide / evaluation function

I think this is the heart of this time.

How to divide

I will explain how to divide the data. The following data is created based on sklearn's iris data.

from sklearn import datasets
iris = datasets.load_iris() #iris data reading
iris.data #Explanatory variable
iris.target #Objective variable

Consider the following dataset. The fifth dimension represents the objective variable (0, 1, 2: flower type), and the others represent the explanatory variables (petal length, etc.).

[
[5.1, 3.5, 1.4, 0.2, 0], 
[4.9, 3.0, 1.4, 0.2, 1], 
[4.7, 3.2, 1.3, 0.2, 0]
...
]

Now, what to do when splitting this data is as follows.

Determine the explanatory variables to divide
You can order from 1 to 4 dimensions.
Arrange the explanatory variables to be divided in ascending (?) Order (= A)
Take a space between each of A, classify and evaluate
(A [0] + A [1]) / 2
After division, digitize and save under the evaluation function
I will explain later
When the division is completed to the end, the most separated one (evaluation function) is used.

Evaluation function

It seems that there are several methods used to evaluate decision trees, but this time we will use "Gini Impureness".

Impure ??

The word "impure" first appeared, but it's easy. It's nice to split between the explanatory variables, but is this the correct way to split? I think the question remains. Gini impure is the one that quantifies it.

By the way, the word "Gini coefficient" was used on other sites, but please be careful when searching because the derivative somehow comes out there (a graph of the Lorenz curve of population and income comes out. * I don't have much knowledge, so please tell me who is familiar with it ..)

Regarding Gini purity, it is also posted for reference, but the following site was easy to understand. It is divided into the whole part / the second part.

Statistical Honey Gini Coefficient ...? (Part 1)

This Gini impureness is calculated for the sets L and R obtained by dividing the set A, and the average value is taken.

G (A)> Mean of G (L) and G (R)

If so, it means that it was successfully divided.

Source code

It's still rough, but I'll post the source code for the time being. Please comment if you have any! !!

# -*- coding: utf-8 -*-

from collections import defaultdict

class Sample:

    def __init__(self, label=0, features=[]):
        self.label    = label
        self.features = features

# ----------------------------------------------------------------------------

class Node:

    def __init__(self, level=0):
        self.level       = level #Hierarchy depth
        self.l_node      = None  #Child node(left)
        self.r_node      = None  #Child node(right)
        self.t_value     = None  #Threshold
        self.feature_idx = None  #What value to divide
        self.label       = None  #Labels to classify
        self.samples     = []    #Sample to which


    def __str__(self):
        if self.is_leaf():
            return "[%3s] Samples:%3s" % (self.level, len(self.samples))
        else:
            return "[%3s] Feature Idx:%3s, Threashold: %5s" % (self.level, self.feature_idx, self.t_value)


    def is_leaf(self):
        return (self.l_node is None) and (self.r_node is None)


    def child_nodes(self):
        return filter(None, [self.l_node, self.r_node])


    def classify(self, feature):
        node  = self
        label = None

        f = feature[node.feature_idx]
        while node:
            if node.is_leaf():
                label = node.label
                break
            elif f < node.t_value:
                node = node.l_node
            else:
                node = node.r_node

        return label


    def build_child_nodes(self, n_class, samples, max_depth, min_samples):
        self.samples = samples
        n_features   = len(samples[0].features) #Number of explanatory variables

        #When you reach the maximum depth, you're done
        if self.level >= max_depth:
            self.build_leaf()
            return

        #The best result of classification
        best_feature_idx = None #The best classified explanatory variables
        best_t_value     = None #Best classified threshold
        best_gini        = 1    #Equality as it approaches 0
        best_l_samples   = []
        best_r_samples   = []

        #Evaluate by classifying as many as the number of explanatory variables
        #Classify by each explanatory variable(idx =Explanatory variable index)
        for idx in range(0, n_features-1):

            #Make the explanatory variables to be classified unique and sort them in ascending order
            features = map(lambda x: x.features[idx] , samples)
            features = list(set(features))
            features.sort()

            for i in range(0, len(features)-2):

                #Classify by the intermediate value of each explanatory variable.
                t_value = (features[i] + features[i+1]) / 2

                l_samples = []; l_sample_labels = defaultdict(lambda: 0)
                r_samples = []; r_sample_labels = defaultdict(lambda: 0)
                for s in samples:

                    if s.features[idx] < t_value:
                        l_samples.append(s)
                        l_sample_labels[s.label] += 1
                    else:
                        r_samples.append(s)
                        r_sample_labels[s.label] += 1

                #There is no point in dividing if it is biased to either side,Calculation of the following explanatory variables
                if len(l_samples) == 0 or len(r_samples) == 0:
                    continue

                #Evaluation for the divided(Gini coefficient,Cross entropy)
                l_gini = 0; r_gini = 0
                for idx in range(0, n_class):
                    l_gini += (float(l_sample_labels[idx]) / len(l_samples)) ** 2
                    r_gini += (float(r_sample_labels[idx]) / len(r_samples)) ** 2

                #Overall Gini coefficient(l,Find the average value of the gini coefficient of r)
                gini = (((1-l_gini) * len(l_samples)) + ((1-r_gini) * len(r_samples))) / len(samples)

                #Update if better than best
                if gini < best_gini:
                    best_gini        = gini
                    best_t_value     = t_value
                    best_feature_idx = idx
                    best_l_samples   = l_samples
                    best_r_samples   = r_samples

        #If you are biased to either side, do not make a new tree
        if len(best_l_samples) == 0 or len(best_r_samples) == 0:
            self.build_leaf()
            return

        # min_No need to separate if samples are not reached.Overfitting measures
        if max(len(best_l_samples), len(best_r_samples)) < min_samples:
            self.build_leaf()
            return

        #Current node settings
        self.samples     = []
        self.t_value     = best_t_value
        self.feature_idx = best_feature_idx

        #Create a new node from the best,Move to the next node
        next_level = self.level + 1

        self.r_node = Node(next_level)
        self.r_node.build_child_nodes(n_class, best_r_samples, max_depth, min_samples)

        self.l_node = Node(next_level)
        self.l_node.build_child_nodes(n_class, best_l_samples, max_depth, min_samples)


    def build_leaf(self):
        self.label = self.samples[0].label

# ----------------------------------------------------------------------------

class DecisionTree:

    def __init__(self, max_depth=10, min_samples=3):
        self.root_node   = Node(level=1) # root node
        self.max_depth   = max_depth     #Maximum depth of Tree
        self.min_samples = min_samples   #Minimum number of elements belonging to Node


    def fit(self, data, target):
        #Number of unique classification classes
        labels = list(set(target))

        #Data generation for learning
        samples = []
        for idx, sample in enumerate(data):
            samples.append(Sample(features=data[idx], label=target[idx]))

        #Learning
        self.root_node.build_child_nodes(n_class=len(labels), samples=samples, max_depth=self.max_depth, min_samples=self.min_samples)


    def features(self):
        #Output Node features for each level
        print self.root_node

        current_nodes = self.root_node.child_nodes()
        child_nodes   = []
        while current_nodes:
            node = current_nodes.pop(0)
            print node
            child_nodes.extend(node.child_nodes())

            if len(current_nodes) == 0 and len(child_nodes) > 0:
                current_nodes = child_nodes
                child_nodes   = []


    def predict(self, data):
        return self.root_node.classify(data)

reference

The following page was very helpful.

[PYTHON] Machine learning beginners try to make a decision tree