[PYTHON] Machine learning beginners try to make a decision tree

Last time, I posted Machine learning beginners try to reach out to Naive Bayes, but this time I would like to write about decision trees. (I implemented it myself in Python this time as well)

There are many details in other places, so here I would like to break it down and explain from the perspective of explanation and programming.

What is a decision tree?

This is a method of classifying input data by generating a tree-structured classifier from a dataset of objective variables and explanatory variables.

I didn't understand, so I'll start by explaining the terms.

In general, I think the above is often a vector. (It seems to be machine learning in general)

Chew the decision tree a little more

The point is to make a lot of ʻif --elest`. That is.

I will write the flow in a little more detail.

that's all.

How to divide / evaluation function

I think this is the heart of this time.

How to divide

I will explain how to divide the data. The following data is created based on sklearn's iris data.

from sklearn import datasets
iris = datasets.load_iris() #iris data reading
iris.data #Explanatory variable
iris.target #Objective variable

Consider the following dataset. The fifth dimension represents the objective variable (0, 1, 2: flower type), and the others represent the explanatory variables (petal length, etc.).

[
[5.1, 3.5, 1.4, 0.2, 0], 
[4.9, 3.0, 1.4, 0.2, 1], 
[4.7, 3.2, 1.3, 0.2, 0]
...
]

Now, what to do when splitting this data is as follows.

Evaluation function

It seems that there are several methods used to evaluate decision trees, but this time we will use "Gini Impureness".

Impure ??

The word "impure" first appeared, but it's easy. It's nice to split between the explanatory variables, but is this the correct way to split? I think the question remains. Gini impure is the one that quantifies it.

By the way, the word "Gini coefficient" was used on other sites, but please be careful when searching because the derivative somehow comes out there (a graph of the Lorenz curve of population and income comes out. * I don't have much knowledge, so please tell me who is familiar with it ..)

Regarding Gini purity, it is also posted for reference, but the following site was easy to understand. It is divided into the whole part / the second part.

Statistical Honey Gini Coefficient ...? (Part 1)

This Gini impureness is calculated for the sets L and R obtained by dividing the set A, and the average value is taken.

G (A)> Mean of G (L) and G (R)

If so, it means that it was successfully divided.

Source code

It's still rough, but I'll post the source code for the time being. Please comment if you have any! !!

# -*- coding: utf-8 -*-

from collections import defaultdict

class Sample:

    def __init__(self, label=0, features=[]):
        self.label    = label
        self.features = features

# ----------------------------------------------------------------------------

class Node:

    def __init__(self, level=0):
        self.level       = level #Hierarchy depth
        self.l_node      = None  #Child node(left)
        self.r_node      = None  #Child node(right)
        self.t_value     = None  #Threshold
        self.feature_idx = None  #What value to divide
        self.label       = None  #Labels to classify
        self.samples     = []    #Sample to which


    def __str__(self):
        if self.is_leaf():
            return "[%3s] Samples:%3s" % (self.level, len(self.samples))
        else:
            return "[%3s] Feature Idx:%3s, Threashold: %5s" % (self.level, self.feature_idx, self.t_value)


    def is_leaf(self):
        return (self.l_node is None) and (self.r_node is None)


    def child_nodes(self):
        return filter(None, [self.l_node, self.r_node])


    def classify(self, feature):
        node  = self
        label = None

        f = feature[node.feature_idx]
        while node:
            if node.is_leaf():
                label = node.label
                break
            elif f < node.t_value:
                node = node.l_node
            else:
                node = node.r_node

        return label


    def build_child_nodes(self, n_class, samples, max_depth, min_samples):
        self.samples = samples
        n_features   = len(samples[0].features) #Number of explanatory variables

        #When you reach the maximum depth, you're done
        if self.level >= max_depth:
            self.build_leaf()
            return

        #The best result of classification
        best_feature_idx = None #The best classified explanatory variables
        best_t_value     = None #Best classified threshold
        best_gini        = 1    #Equality as it approaches 0
        best_l_samples   = []
        best_r_samples   = []

        #Evaluate by classifying as many as the number of explanatory variables
        #Classify by each explanatory variable(idx =Explanatory variable index)
        for idx in range(0, n_features-1):

            #Make the explanatory variables to be classified unique and sort them in ascending order
            features = map(lambda x: x.features[idx] , samples)
            features = list(set(features))
            features.sort()

            for i in range(0, len(features)-2):

                #Classify by the intermediate value of each explanatory variable.
                t_value = (features[i] + features[i+1]) / 2

                l_samples = []; l_sample_labels = defaultdict(lambda: 0)
                r_samples = []; r_sample_labels = defaultdict(lambda: 0)
                for s in samples:

                    if s.features[idx] < t_value:
                        l_samples.append(s)
                        l_sample_labels[s.label] += 1
                    else:
                        r_samples.append(s)
                        r_sample_labels[s.label] += 1

                #There is no point in dividing if it is biased to either side,Calculation of the following explanatory variables
                if len(l_samples) == 0 or len(r_samples) == 0:
                    continue

                #Evaluation for the divided(Gini coefficient,Cross entropy)
                l_gini = 0; r_gini = 0
                for idx in range(0, n_class):
                    l_gini += (float(l_sample_labels[idx]) / len(l_samples)) ** 2
                    r_gini += (float(r_sample_labels[idx]) / len(r_samples)) ** 2

                #Overall Gini coefficient(l,Find the average value of the gini coefficient of r)
                gini = (((1-l_gini) * len(l_samples)) + ((1-r_gini) * len(r_samples))) / len(samples)

                #Update if better than best
                if gini < best_gini:
                    best_gini        = gini
                    best_t_value     = t_value
                    best_feature_idx = idx
                    best_l_samples   = l_samples
                    best_r_samples   = r_samples

        #If you are biased to either side, do not make a new tree
        if len(best_l_samples) == 0 or len(best_r_samples) == 0:
            self.build_leaf()
            return

        # min_No need to separate if samples are not reached.Overfitting measures
        if max(len(best_l_samples), len(best_r_samples)) < min_samples:
            self.build_leaf()
            return

        #Current node settings
        self.samples     = []
        self.t_value     = best_t_value
        self.feature_idx = best_feature_idx

        #Create a new node from the best,Move to the next node
        next_level = self.level + 1

        self.r_node = Node(next_level)
        self.r_node.build_child_nodes(n_class, best_r_samples, max_depth, min_samples)

        self.l_node = Node(next_level)
        self.l_node.build_child_nodes(n_class, best_l_samples, max_depth, min_samples)


    def build_leaf(self):
        self.label = self.samples[0].label

# ----------------------------------------------------------------------------

class DecisionTree:

    def __init__(self, max_depth=10, min_samples=3):
        self.root_node   = Node(level=1) # root node
        self.max_depth   = max_depth     #Maximum depth of Tree
        self.min_samples = min_samples   #Minimum number of elements belonging to Node


    def fit(self, data, target):
        #Number of unique classification classes
        labels = list(set(target))

        #Data generation for learning
        samples = []
        for idx, sample in enumerate(data):
            samples.append(Sample(features=data[idx], label=target[idx]))

        #Learning
        self.root_node.build_child_nodes(n_class=len(labels), samples=samples, max_depth=self.max_depth, min_samples=self.min_samples)


    def features(self):
        #Output Node features for each level
        print self.root_node

        current_nodes = self.root_node.child_nodes()
        child_nodes   = []
        while current_nodes:
            node = current_nodes.pop(0)
            print node
            child_nodes.extend(node.child_nodes())

            if len(current_nodes) == 0 and len(child_nodes) > 0:
                current_nodes = child_nodes
                child_nodes   = []


    def predict(self, data):
        return self.root_node.classify(data)

reference

The following page was very helpful.

Recommended Posts

Machine learning beginners try to make a decision tree
Machine Learning: Supervised --Decision Tree
[Machine learning] Try studying decision trees
Machine learning ③ Summary of decision tree
Machine learning beginners try to reach out to Naive Bayes (2) --Implementation
Machine learning beginners try to reach out to Naive Bayes (1) --Theory
Try to make a kernel of Jupyter
Try to make a blackjack strategy by reinforcement learning ((1) Implementation of blackjack)
How to make a dialogue system dedicated to beginners
Try to forecast power demand by machine learning
AI beginners try to make professional student bots
Try to make a "cryptanalysis" cipher with Python
Try to make a dihedral group with Python
[For beginners] Introduction to vectorization in machine learning
Introduction to machine learning
Try to draw a "weather map-like front" by machine learning based on weather data (5)
Try to draw a "weather map-like front" by machine learning based on weather data (3)
Try to draw a "weather map-like front" by machine learning based on weather data (1)
Try to draw a "weather map-like front" by machine learning based on weather data (4)
Try to draw a "weather map-like front" by machine learning based on weather data (2)
Try to make a Python module in C language
Try to make a command standby tool with python
Try to predict forex (FX) with non-deep machine learning
[Machine learning] Try to detect objects using Selective Search
An introduction to machine learning from a simple perceptron
Everything for beginners to be able to do machine learning
I tried to make a real-time sound source separation mock with Python machine learning
Try to make a blackjack strategy by reinforcement learning (② Register the environment in gym)
An introduction to machine learning
Machine learning beginners tried RBM
Build a machine learning environment
Super introduction to machine learning
Try machine learning with Kaggle
Try to select a language
Try to build a deep learning / neural network with scratch
Try to evaluate the performance of machine learning / regression model
Try to evaluate the performance of machine learning / classification model
A beginner's summary of Python machine learning is super concise.
Try to predict if tweets will burn with machine learning
[Introduction to Tensorflow] Understand Tensorflow properly and try to make a model
Try to make a blackjack strategy by reinforcement learning (③ Reinforcement learning in your own OpenAI Gym environment)
How to make a Japanese-English translation
Introduction to machine learning Note writing
Try to draw a Bezier curve
[Tutorial] Make a named entity extractor in 30 minutes using machine learning
Creating a decision tree with scikit-learn
[Machine learning] Try running Spark MLlib with Python and make recommendations
Machine learning summary by Python beginners
How to make a slack bot
How to make a recursive function
Try to make a web service-like guy with 3D markup language
Try machine learning with scikit-learn SVM
Inversely analyze a machine learning model
[Introduction] I want to make a Mastodon Bot with Python! 【Beginners】
[Blender] How to make a Blender plugin
Introduction to Machine Learning Library SHOGUN
[Machine learning] Try studying random forest
I want to create a machine learning service without programming! WebAPI
How to make a crawler --Basic
How to collect machine learning data
How to create a serverless machine learning API with AWS Lambda