[Python] Decision Tree Personal Tutorial

It will be used as a cheat sheet.

What is a Decision Tree?

Decision trees are machine learning models widely used in classification and regression prediction. It has a ** hierarchical structure ** consisting of questions that can be answered Yes / No. In the decision tree, you can see how much each explanatory variable affects the objective variable. It branches by repeating the division, but the variable that is divided first has more influence. It can be regarded as large.

決定木.png

This classifier can be expressed as a classification model ** that discriminates ** 4 classes of data by three features. By using a machine learning algorithm, such a model can learn training data and actually draw a tree as described above.

Characteristics of decision trees

merit

  • Relatively easy to interpret as the decision tree can ** visualize ** the results
  • Not affected by the scale difference of features, no preprocessing like standardization is required **

Demerit

  • ** Reliance on training data heavily **, no matter how you tune the parameters, you may not get the desired level of tree structure
  • ** Easy to overfit ** Tends to have low generalization performance

Evaluation method

Confusion matrix

The confusion matrix is a matrix that is the basis for considering the evaluation of a classification model, and represents the relationship between the predicted value and the observed value of the model. Specifically, as shown in the figure below, there are four categories: ** true positive ** (true positive), ** true negative ** (true negative), ** false positive ** (false positive), ** false. Has a negative ** (false negative).

混同行列.png

Accuracy

It is the ratio of the prediction to the whole, and can be calculated as follows. $ Correct answer rate = \ frac {TP + TN} {TP + FP + FN + TN} $

Precision

It is the ratio of the data predicted to be positive that is actually positive, and can be calculated as follows. $ Fit rate = \ frac {TP} {TP + FP} $

Recall

It is the ratio of those that are actually positive and those that are predicted to be positive, and can be calculated as follows. $ Recall rate = \ frac {TP} {TP + FN} $

Implemented in Python

Data dataset

A dataset that summarizes the diagnostic data for breast cancer in scikit-learn. It is benign (1) and malignant (0).

Library to use

[In]


#Library used for data processing
import pandas as pd
import numpy as np

#Library used for data visualization
import matplotlib.pyplot as plt; plt.style.use('ggplot')
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inline

#Machine learning library
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics

Constant system

[In]


#constant
RESPONSE_VARIABLE = 'cancer' #Objective variable
TEST_SIZE = 0.2
RANDOM_STATE = 42

Data reading

[In]


#Data reading(scikit-learn cancer data)
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
cancer = pd.DataFrame(data=data.data, columns=data.feature_names)
cancer[RESPONSE_VARIABLE] = data.target

#Show first 5 lines
cancer.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... cancer
0 17.99 10.38 122.8 1001 0.1184 0.2776 0.3001 0.1471 0.2419 0.07871 ... 0
1 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 0
2 19.69 21.25 130 1203 0.1096 0.1599 0.1974 0.1279 0.2069 0.05999 ... 0
3 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414 0.1052 0.2597 0.09744 ... 0
4 20.29 14.34 135.1 1297 0.1003 0.1328 0.198 0.1043 0.1809 0.05883 ... 0

Basic tabulation

[In]


#Check statistics
cancer.describe()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... cancer
count 569 569 569 569 569 569 569 569 569 569 ... 569
mean 14.12729 19.28965 91.96903 654.8891 0.09636 0.104341 0.088799 0.048919 0.181162 0.062798 ... 0.627417
std 3.524049 4.301036 24.29898 351.9141 0.014064 0.052813 0.07972 0.038803 0.027414 0.00706 ... 0.483918
min 6.981 9.71 43.79 143.5 0.05263 0.01938 0 0 0.106 0.04996 ... 0
25% 11.7 16.17 75.17 420.3 0.08637 0.06492 0.02956 0.02031 0.1619 0.0577 ... 0
50% 13.37 18.84 86.24 551.1 0.09587 0.09263 0.06154 0.0335 0.1792 0.06154 ... 1
75% 15.78 21.8 104.1 782.7 0.1053 0.1304 0.1307 0.074 0.1957 0.06612 ... 1
max 28.11 39.28 188.5 2501 0.1634 0.3454 0.4268 0.2012 0.304 0.09744 ... 1

[In]


#Objective variable count
cancer[RESPONSE_VARIABLE].value_counts()

[Out]


1    357
0    212
Name: cancer, dtype: int64

[In]


#Confirmation of missing values
cancer.isnull().sum()

[Out]


mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
cancer                     0
dtype: int64

Data split

[In]


#Divided into training data and test data
train, test = train_test_split(cancer, test_size=TEST_SIZE, random_state=RANDOM_STATE)

#Divide into explanatory variables and objective variables
X_train = train.drop(RESPONSE_VARIABLE, axis=1)
y_train = train[RESPONSE_VARIABLE].copy()

X_test = test.drop(RESPONSE_VARIABLE, axis=1)
y_test = test[RESPONSE_VARIABLE].copy()

Data visualization

[In]


#Visualize the distribution of objective variables for each feature
features = X_train.columns
legend= ['Benign','Malignant']
plt.figure(figsize=(20,32*4))
gs = gridspec.GridSpec(32, 1)
for i, col in enumerate(train[features]):
    ax = plt.subplot(gs[i])
    sns.distplot(train[col][train.cancer == 0],bins=50, color='crimson')
    sns.distplot(train[col][train.cancer == 1],bins=50, color='royalblue')
    plt.legend(legend)

ダウンロード.png

Feature selection

By using RandomForestClassifier () of Scikit-learn, it is possible to confirm the "importance" of each feature as feature_importances_.

[In]


#Feature selection
RF = RandomForestClassifier(n_estimators = 250, random_state = 42)
RF.fit(X_train, y_train)
:[Out]
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

[In]


#Features are output in descending order of importance
features = X_train.columns
importances = RF.feature_importances_

importances_features = sorted(zip(map(lambda x: round(x, 2), RF.feature_importances_), features), reverse=True)

for i in importances_features:
    print(i)

[Out]


(0.13, 'worst perimeter')
(0.13, 'worst concave points')
(0.13, 'worst area')
(0.11, 'mean concave points')
(0.07, 'worst radius')
(0.05, 'mean radius')
(0.05, 'mean concavity')
(0.04, 'worst concavity')
(0.04, 'mean perimeter')
(0.04, 'mean area')
(0.02, 'worst texture')
(0.02, 'worst compactness')
(0.02, 'radius error')
(0.02, 'mean compactness')
(0.02, 'area error')
(0.01, 'worst symmetry')
(0.01, 'worst smoothness')
(0.01, 'worst fractal dimension')
(0.01, 'perimeter error')
(0.01, 'mean texture')
(0.01, 'mean smoothness')
(0.01, 'fractal dimension error')
(0.01, 'concavity error')
(0.0, 'texture error')
(0.0, 'symmetry error')
(0.0, 'smoothness error')
(0.0, 'mean symmetry')
(0.0, 'mean fractal dimension')
(0.0, 'concave points error')
(0.0, 'compactness error')

Top 5 results of random forest feature selection

[In]


#Get the top 5 as a list
feature_list = [value for key, value in important_features if key >= 0.06]
feature_list

[Out]


['worst perimeter',
 'worst concave points',
 'worst area',
 'mean concave points',
 'worst radius']

[In]


#Focus training and test data on only the most important features
X_train = X_train[feature_list]
X_test = X_test[feature_list]

[In]


#Check the distribution of the objective variable again
legend= ['Benign','Malignant']
plt.figure(figsize=(20,32*4))
gs = gridspec.GridSpec(32, 1)
for i, col in enumerate(train[feature_list]):
    ax = plt.subplot(gs[i])
    sns.distplot(train[col][train.cancer == 0],bins=50, color='crimson')
    sns.distplot(train[col][train.cancer == 1],bins=50, color='royalblue')
    plt.legend(legend)

ダウンロード (1).png

Learning / prediction / evaluation

[In]


#Learning
clf = DecisionTreeClassifier(max_depth=4)
clf = clf.fit(X_train, y_train)

[In]


#Prediction using features of training data
y_pred = clf.predict(X_train)

[In]


def drawing_confusion_matrix(y: pd.Series, pre: np.ndarray) -> None:
    """
A function that draws the confusion matrix
    
    @param y:Objective variable
    @param pre:Predicted value
    """
    confmat = confusion_matrix(y, pre)
    fig, ax = plt.subplots(figsize=(5, 5))
    ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
    for i in range(confmat.shape[0]):
        for j in range(confmat.shape[1]):
            ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')
    plt.title('Predicted value')
    plt.ylabel('Measured value')
    plt.rcParams["font.size"] = 15
    plt.tight_layout() 
    plt.show()

[In]


def calculation_evaluations(y: pd.Series, pre: np.ndarray) -> None:
    """
A function that calculates and outputs the correct answer rate, precision rate, and recall rate, respectively.
    
    @param y:Objective variable
    @param pre:Predicted value
    """
    print('Correct answer rate: {:.3f}'.format(metrics.accuracy_score(y, pre)))
    print('Compliance rate: {:.3f}'.format(metrics.precision_score(y, pre)))
    print('Recall: {:.3f}'.format(metrics.recall_score(y, pre)))

[In]


drawing_confusion_matrix(y_train, y_pred)
calculation_evaluations(y_train, y_pred)

ダウンロード (2).png

:[Out]
Correct answer rate: 0.969
Compliance rate: 0.979
Recall: 0.972

163 in TP (upper left) is the actual number of malignant tumors that the model predicted to be malignant. 9 in FP (lower right) is a number that is predicted to be malignant and not actually malignant. The FN (upper right) of 6 is actually malignant but predicted to be benign.

[In]


#Predict test data with a trained model
y_pred_test = clf.predict(X_test)

[In]


drawing_confusion_matrix(y_test, y_pred_test)
calculation_evaluations(y_test, y_pred_test)

ダウンロード (3).png

[Out]


Correct answer rate: (TP + TN)/(TP + TN + FP + FN)
Correct answer rate: 0.939
Compliance rate: TP/(TP + FP)
Compliance rate: 0.944
Recall: TP/(TP + FN)
Recall: 0.958

Recommended Posts

[Python] Decision Tree Personal Tutorial
Python tutorial
Python Django Tutorial (5)
Python Django Tutorial (2)
Python tutorial summary
Decision tree (classification)
Python Django Tutorial (8)
Python Django Tutorial (6)
Python personal Q.A
python personal notes
Python Django Tutorial (7)
Python Django Tutorial (1)
Python Django tutorial tutorial
Python Django Tutorial (3)
Python Django Tutorial (4)
Create a decision tree from 0 with Python (1. Overview)
[Docker] Tutorial (Python + php)
Python Django tutorial summary
missingintegers python personal notes
Python memorandum (personal bookmark)
[Personal notes] Python, Django
Python OpenCV tutorial memo
[Python tutorial] Data structure
Cloud Run tutorial (python)
2. Multivariate analysis spelled out in Python 7-3. Decision tree [regression tree]
2. Multivariate analysis spelled out in Python 7-1. Decision tree (scikit-learn)
Decision tree and random forest
What is a decision tree?
Scikit-learn decision Generate Python code from tree / random forest rules
Python Django Tutorial Cheat Sheet
[AtCoder] ABC165C Personal Note [Python]
Machine Learning: Supervised --Decision Tree
2. Multivariate analysis spelled out in Python 7-2. Decision tree [difference in division criteria]
EEG analysis in Python: Python MNE tutorial
Decision tree (for beginners) -Code edition-
[Personal memo] Python virtual environment command memo
Machine learning ③ Summary of decision tree
Compiler in Python: PL / 0 syntax tree
Personal notes for python image processing
Python package management tool personal summary
Git & Github & python & VScode Personal memorandum
Python Pandas Data Preprocessing Personal Notes
[Personal memo] Python sequence type / mapping type
Python implementation of non-recursive Segment Tree
Algorithm (segment tree) in Python (practice)
(Personal notes) Python metaclasses and metaprogramming
[Python Tutorial] An Easy Introduction to Python
2. Make a decision tree from 0 with Python and understand it (2. Python program basics)
Make a decision tree from 0 with Python and understand it (4. Data structure)
Create a decision tree from 0 with Python and understand it (5. Information Entropy)