[Python] Decision Tree Personal Tutorial

It will be used as a cheat sheet.

What is a Decision Tree?

Decision trees are machine learning models widely used in classification and regression prediction. It has a ** hierarchical structure ** consisting of questions that can be answered Yes / No. In the decision tree, you can see how much each explanatory variable affects the objective variable. It branches by repeating the division, but the variable that is divided first has more influence. It can be regarded as large.

決定木.png

This classifier can be expressed as a classification model ** that discriminates ** 4 classes of data by three features. By using a machine learning algorithm, such a model can learn training data and actually draw a tree as described above.

Characteristics of decision trees

merit

Relatively easy to interpret as the decision tree can ** visualize ** the results

Not affected by the scale difference of features, no preprocessing like standardization is required **

Demerit

** Reliance on training data heavily **, no matter how you tune the parameters, you may not get the desired level of tree structure

** Easy to overfit ** Tends to have low generalization performance

Evaluation method

Confusion matrix

The confusion matrix is a matrix that is the basis for considering the evaluation of a classification model, and represents the relationship between the predicted value and the observed value of the model. Specifically, as shown in the figure below, there are four categories: ** true positive ** (true positive), ** true negative ** (true negative), ** false positive ** (false positive), ** false. Has a negative ** (false negative).

混同行列.png

Accuracy

It is the ratio of the prediction to the whole, and can be calculated as follows. $ Correct answer rate = \ frac {TP + TN} {TP + FP + FN + TN} $

Precision

It is the ratio of the data predicted to be positive that is actually positive, and can be calculated as follows. $ Fit rate = \ frac {TP} {TP + FP} $

Recall

It is the ratio of those that are actually positive and those that are predicted to be positive, and can be calculated as follows. $ Recall rate = \ frac {TP} {TP + FN} $

Implemented in Python

Assuming execution with jupyter notebook

Data dataset

A dataset that summarizes the diagnostic data for breast cancer in scikit-learn. It is benign (1) and malignant (0).

Library to use

`[In]`


#Library used for data processing
import pandas as pd
import numpy as np

#Library used for data visualization
import matplotlib.pyplot as plt; plt.style.use('ggplot')
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inline

#Machine learning library
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics

Constant system

`[In]`


#constant
RESPONSE_VARIABLE = 'cancer' #Objective variable
TEST_SIZE = 0.2
RANDOM_STATE = 42

Data reading

`[In]`


#Data reading(scikit-learn cancer data)
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
cancer = pd.DataFrame(data=data.data, columns=data.feature_names)
cancer[RESPONSE_VARIABLE] = data.target

#Show first 5 lines
cancer.head()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...
0	17.99	10.38	122.8	1001	0.1184	0.2776	0.3001	0.1471	0.2419	0.07871	...
1	20.57	17.77	132.9	1326	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...
2	19.69	21.25	130	1203	0.1096	0.1599	0.1974	0.1279	0.2069	0.05999	...
3	11.42	20.38	77.58	386.1	0.1425	0.2839	0.2414	0.1052	0.2597	0.09744	...
4	20.29	14.34	135.1	1297	0.1003	0.1328	0.198	0.1043	0.1809	0.05883	...

Basic tabulation

`[In]`


#Check statistics
cancer.describe()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	cancer
count	569	569	569	569	569	569	569	569	569	569	...	569
mean	14.12729	19.28965	91.96903	654.8891	0.09636	0.104341	0.088799	0.048919	0.181162	0.062798	...	0.627417
std	3.524049	4.301036	24.29898	351.9141	0.014064	0.052813	0.07972	0.038803	0.027414	0.00706	...	0.483918
min	6.981	9.71	43.79	143.5	0.05263	0.01938	0	0	0.106	0.04996	...	0
25%	11.7	16.17	75.17	420.3	0.08637	0.06492	0.02956	0.02031	0.1619	0.0577	...	0
50%	13.37	18.84	86.24	551.1	0.09587	0.09263	0.06154	0.0335	0.1792	0.06154	...	1
75%	15.78	21.8	104.1	782.7	0.1053	0.1304	0.1307	0.074	0.1957	0.06612	...	1
max	28.11	39.28	188.5	2501	0.1634	0.3454	0.4268	0.2012	0.304	0.09744	...	1

`[In]`


#Objective variable count
cancer[RESPONSE_VARIABLE].value_counts()

`[Out]`


1    357
0    212
Name: cancer, dtype: int64

`[In]`


#Confirmation of missing values
cancer.isnull().sum()

`[Out]`


mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
cancer                     0
dtype: int64

Data split

`[In]`


#Divided into training data and test data
train, test = train_test_split(cancer, test_size=TEST_SIZE, random_state=RANDOM_STATE)

#Divide into explanatory variables and objective variables
X_train = train.drop(RESPONSE_VARIABLE, axis=1)
y_train = train[RESPONSE_VARIABLE].copy()

X_test = test.drop(RESPONSE_VARIABLE, axis=1)
y_test = test[RESPONSE_VARIABLE].copy()

Data visualization

`[In]`


#Visualize the distribution of objective variables for each feature
features = X_train.columns
legend= ['Benign','Malignant']
plt.figure(figsize=(20,32*4))
gs = gridspec.GridSpec(32, 1)
for i, col in enumerate(train[features]):
    ax = plt.subplot(gs[i])
    sns.distplot(train[col][train.cancer == 0],bins=50, color='crimson')
    sns.distplot(train[col][train.cancer == 1],bins=50, color='royalblue')
    plt.legend(legend)

ダウンロード.png

Feature selection

By using RandomForestClassifier () of Scikit-learn, it is possible to confirm the "importance" of each feature as feature_importances_.

`[In]`


#Feature selection
RF = RandomForestClassifier(n_estimators = 250, random_state = 42)
RF.fit(X_train, y_train)

:[Out]
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

`[In]`


#Features are output in descending order of importance
features = X_train.columns
importances = RF.feature_importances_

importances_features = sorted(zip(map(lambda x: round(x, 2), RF.feature_importances_), features), reverse=True)

for i in importances_features:
    print(i)

`[Out]`


(0.13, 'worst perimeter')
(0.13, 'worst concave points')
(0.13, 'worst area')
(0.11, 'mean concave points')
(0.07, 'worst radius')
(0.05, 'mean radius')
(0.05, 'mean concavity')
(0.04, 'worst concavity')
(0.04, 'mean perimeter')
(0.04, 'mean area')
(0.02, 'worst texture')
(0.02, 'worst compactness')
(0.02, 'radius error')
(0.02, 'mean compactness')
(0.02, 'area error')
(0.01, 'worst symmetry')
(0.01, 'worst smoothness')
(0.01, 'worst fractal dimension')
(0.01, 'perimeter error')
(0.01, 'mean texture')
(0.01, 'mean smoothness')
(0.01, 'fractal dimension error')
(0.01, 'concavity error')
(0.0, 'texture error')
(0.0, 'symmetry error')
(0.0, 'smoothness error')
(0.0, 'mean symmetry')
(0.0, 'mean fractal dimension')
(0.0, 'concave points error')
(0.0, 'compactness error')

Top 5 results of random forest feature selection

worst perimeter
worst concave points
worst area
mean concave points
worst radius

`[In]`


#Get the top 5 as a list
feature_list = [value for key, value in important_features if key >= 0.06]
feature_list

`[Out]`


['worst perimeter',
 'worst concave points',
 'worst area',
 'mean concave points',
 'worst radius']

`[In]`


#Focus training and test data on only the most important features
X_train = X_train[feature_list]
X_test = X_test[feature_list]

`[In]`


#Check the distribution of the objective variable again
legend= ['Benign','Malignant']
plt.figure(figsize=(20,32*4))
gs = gridspec.GridSpec(32, 1)
for i, col in enumerate(train[feature_list]):
    ax = plt.subplot(gs[i])
    sns.distplot(train[col][train.cancer == 0],bins=50, color='crimson')
    sns.distplot(train[col][train.cancer == 1],bins=50, color='royalblue')
    plt.legend(legend)

ダウンロード (1).png

Learning / prediction / evaluation

`[In]`


#Learning
clf = DecisionTreeClassifier(max_depth=4)
clf = clf.fit(X_train, y_train)

`[In]`


#Prediction using features of training data
y_pred = clf.predict(X_train)

`[In]`


def drawing_confusion_matrix(y: pd.Series, pre: np.ndarray) -> None:
    """
A function that draws the confusion matrix
    
    @param y:Objective variable
    @param pre:Predicted value
    """
    confmat = confusion_matrix(y, pre)
    fig, ax = plt.subplots(figsize=(5, 5))
    ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
    for i in range(confmat.shape[0]):
        for j in range(confmat.shape[1]):
            ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')
    plt.title('Predicted value')
    plt.ylabel('Measured value')
    plt.rcParams["font.size"] = 15
    plt.tight_layout() 
    plt.show()

`[In]`


def calculation_evaluations(y: pd.Series, pre: np.ndarray) -> None:
    """
A function that calculates and outputs the correct answer rate, precision rate, and recall rate, respectively.
    
    @param y:Objective variable
    @param pre:Predicted value
    """
    print('Correct answer rate: {:.3f}'.format(metrics.accuracy_score(y, pre)))
    print('Compliance rate: {:.3f}'.format(metrics.precision_score(y, pre)))
    print('Recall: {:.3f}'.format(metrics.recall_score(y, pre)))

`[In]`


drawing_confusion_matrix(y_train, y_pred)
calculation_evaluations(y_train, y_pred)

ダウンロード (2).png

:[Out]
Correct answer rate: 0.969
Compliance rate: 0.979
Recall: 0.972

163 in TP (upper left) is the actual number of malignant tumors that the model predicted to be malignant. 9 in FP (lower right) is a number that is predicted to be malignant and not actually malignant. The FN (upper right) of 6 is actually malignant but predicted to be benign.

`[In]`


#Predict test data with a trained model
y_pred_test = clf.predict(X_test)

`[In]`


drawing_confusion_matrix(y_test, y_pred_test)
calculation_evaluations(y_test, y_pred_test)

ダウンロード (3).png

`[Out]`


Correct answer rate: (TP + TN)/(TP + TN + FP + FN)
Correct answer rate: 0.939
Compliance rate: TP/(TP + FP)
Compliance rate: 0.944
Recall: TP/(TP + FN)
Recall: 0.958