Make a decision tree from 0 with Python and understand it (4. Data structure)

** Create and understand decision trees from scratch in Python ** 1. Overview-2. Python Program Basics-3. Data Analysis Library Pandas --4 Data Structure

For learning AI (machine learning) and data mining, we will understand by creating a decision tree from scratch in Python.

4.1 Data structure

A data structure is a representation of how individual data are arranged.


The individual data are lined up in a row. To identify one piece of data, you need one identifier, for example, what number of data.

#Implementation example of array in python
a = [2,6,4,5,1,8]

Table, two-dimensional array

There are multiple columns of data, with individual data lined up on a plane. To identify one piece of data, you need two identifiers, like the number in the column.
#Implementation example of table in python
a = [

Tree, tree structure

It is a data structure that connects individual data with a line. However, the line does not circulate, for example, when considering the route from one data to another, the one with one route is the data of the tree structure. It is often illustrated as a tree that extends from top to bottom, as shown below. Lines are called edges and branches, data is called nodes, data with lines below them is called trunks and nodes, data without lines is called leaves and leaves, and top data is called root nodes and roots. I will call you. The line may be one-way with an arrow.

#Implementation example of tree in python Holds a list of child nodes.
# [value,Array of child node list]For example, the upper tree is implemented from top to bottom and from left to right as follows.
#Other than this implementation method, there are methods such as using a class and holding a parent node.
tree = \

#Tree structure display function
def tstr(node,indent=""):
    for c in node[1]: #Loop on child node
# 2
# +-6
# +-4
# +-+-5
# +-+-+-6
# +-+-8
# +-+-1

#If you don't want to make the whole tree at once, but one by one
#Create all leaf nodes that have no child nodes. The variable name is the number from the row (column) and the left.
n10 = [6,[]]
n21 = [8,[]]
n22 = [1,[]]
n40 = [6,[]]
#Create all the child nodes in order from the generated node.
n20 = [5,[n40]]
n11 = [4,[n20,n21,n22]]
n00 = [2,[n10,n11]]

#Display tree structure, display the specified node as the root node.
# 4
# +-5
# +-+-6
# +-8
# +-1

Network, graph

It is a data structure that connects data with a circular line with a line. Similar to the tree structure, the line can also be one-way with an arrow. Individual data are called nodes, and lines are called edges. Since there is often no starting point for data like a tree structure, we don't often call routes, trunks, leaves, etc.
#Implementation example of network in python
import pandas as pd

#It is assumed that the node name and the value match.
#If the name and value do not match, you need data to subtract the value from the name.
nodes = [2,6,4,5,8,1]
#Define the connection status of nodes in the form of a matrix. Node 2(The first line)From node 6(2nd line)If there is an edge
#The 1-by-2 value of the matrix is 1, and 0 if there are no edges. This matrix is called an adjacency matrix.
df = pd.DataFrame(
    # 2,6,4,5,8,Is there a connection to one node?
    [ 0,1,1,0,0,0], #From 2 nodes
    [ 1,0,0,1,0,0], #From 6 nodes
    [ 1,0,0,1,1,1], #From 4 nodes
    [ 0,1,1,0,0,0], #From 5 nodes
    [ 0,0,1,0,0,0], #From 8 nodes
    [ 0,0,1,0,0,0], #From one node
#    2  6  4  5  8  1
# 2  0  1  1  0  0  0
# 6  1  0  0  1  0  0
# 4  1  0  0  1  1  1
# 5  0  1  1  0  0  0
# 8  0  0  1  0  0  0
# 1  0  0  1  0  0  0

#The network is drawn by matplotlib and a library called networkx.
import matplotlib.pyplot as plt
import networkx as nx

Network output example

4.2 Example of implementation of decision tree by python

Decision trees, as the name implies, can be represented by a tree structure. The data held by the node includes the rules for branching and the data list that reaches this node in the decision tree, in addition to the list of child nodes that have a tree structure.

Place an empty node on the root node and associate all the data as shown below. The [...] number attached to the node represents the data number of the original data from which this decision tree is created. Then, from the root node, only the data that meets the conditions of the child node can be expressed as going down the tree. The nodes in the decision tree that go golf and don't go golf can be found by looking at the data associated with the node.


The implementation in python is as follows, for example. One node is an associative array, name is a character representation of the condition of that node, df is the data associated with that node, and edges is a list of child nodes.

#Tree structure data
tree = {
    # name:This node(stem)s name
    "name":"decision tree "+df0.columns[-1]+" "+str(cstr(df0.iloc[:,-1])),
    # df:Data associated with this node
    # edges:Edge coming out of this node(branch)In the list, if the leaf node has no edge below, it will be an empty array.

The function tstr that characterizes this tree structure looks like this:

#Lambda expression for value distribution, argument is pandas.Series, return value is an array containing the number of each value
#Input s to value_counts()Get the frequency of each value with, and loop items of dictionary type data()To call.
#sorted sorts in ascending order of frequency so that the output result does not change with each execution.
#And the element is the key(k)And value(v)Generate an array that is a character string of.
cstr = lambda s:[k+":"+str(v) for k,v in sorted(s.value_counts().items())]

#Method to convert tree to characters, argument is tree:Tree data structure, indent:Indentation on child nodes,
#The return value is a character representation of tree. This method is called recursively to convert everything in the tree to characters.
def tstr(tree,indent=""):
    #Create a character representation for this node. If this node is a leaf node(The number of elements in the edges array is 0)To
    #Characterize the frequency distribution in the last column of the data df associated with the tree.
    s = indent+tree["name"]+str(cstr(tree["df"].iloc[:,-1]) if len(tree["edges"])==0 else "")+"\n"
    #Loop on all edges from this node.
    for e in tree["edges"]:
        #Add the character representation of the child node to the character representation of this node.
        #For indent, add more characters to the indent of this node.
        s += tstr(e,indent+"  ")
    return s

The decision tree transcribed by this tstr function looks like this: The root node shows the character (decision tree golf) set when the tree variable was first created and the frequency distribution of going / not going to golf for all data. In each node below it, the rules used for branching and, in the case of leaf nodes, the frequency distribution of going / not going golf in the data associated with that node (eg ['○: 2']). Is displayed.

decision tree golf['×:5', '○:9']

Recommended Posts

Make a decision tree from 0 with Python and understand it (4. Data structure)
Create a decision tree from 0 with Python and understand it (5. Information Entropy)
Create a decision tree from 0 with Python and understand it (3. Data analysis library Pandas edition)
Create a decision tree from 0 with Python (1. Overview)
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Extract data from a web page with Python
WEB scraping with python and try to make a word cloud from reviews
Let's make a simple game with Python 3 and iPhone
Get mail from Gmail and label it with Python3
Make a fortune with Python
Quickly create a Python data analysis dashboard with Streamlit and deploy it to AWS
Hash with python and escape from a certain minister's egosa
Creating a decision tree with scikit-learn
Let's make a GUI with python.
python / Make a dict from a list.
I made a server with Python socket and ssl and tried to access it from a browser
Make a recommender system with python
Let's make a graph with python! !!
Precautions when inputting from CSV with Python and outputting to json to make it an exe
How to make a surveillance camera (Security Camera) with Opencv and Python
Make a thermometer with Raspberry Pi and make it viewable with a browser Part 4
I tried to make a periodical process with Selenium and Python
I made a segment tree with python, so I will introduce it
Data pipeline construction with Python and Luigi
Receive textual data from mysql with python
[Note] Get data from PostgreSQL with Python
Add a Python data source with Redash
Fractal to make and play with Python
A memo with Python2.7 and Python3 on CentOS
Python data structure and internal implementation ~ List ~
Let's make a voice slowly with Python
Python data structure and operation (Python learning memo ③)
Understand the Decision Tree and classify documents
Let's make a web framework with Python! (1)
Make a desktop app with Python with Electron
Let's make a Twitter Bot with Python!
Let's make a web framework with Python! (2)
Algorithm learned with Python 11th: Tree structure
Clogged when getting data from DB and making it a return value
Deploy a Python app on Google App Engine and integrate it with GitHub
[Personal memo] Get data on the Web and make it a DataFrame
[Concept] A strategy to analyze data with python and aim for a decline after shareholder benefits to make a profit
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
Make a thermometer with Raspberry Pi and make it visible on the browser Part 3
[Python] I introduced Word2Vec and played with it.
Make a Python program a daemon and run it automatically when the OS starts
Make a Twitter trend bot with heroku + Python
[Python] Make a game with Pyxel-Use an editor-
Building a python environment with virtualenv and direnv
Divide each PowerPoint slide into a JPG file and output it with python
Python --Read data from a numeric data file and find the multiple regression line.
How to make a dictionary with a hierarchical structure.
I want to make a game with Python
[Python] How to read data from CIFAR-10 and CIFAR-100
Try to make a "cryptanalysis" cipher with Python
A story stuck with handling Python binary data
[Python] Make a simple maze game with Pyxel
Folium: Visualize data on a map with Python
Launch a web server with Python and Flask
Implementation of TRIE tree with Python and LOUDS
Let's replace UWSC with Python (5) Let's make a Robot