It will be used as a cheat sheet.
Decision trees are machine learning models widely used in classification and regression prediction. It has a ** hierarchical structure ** consisting of questions that can be answered Yes / No. In the decision tree, you can see how much each explanatory variable affects the objective variable. It branches by repeating the division, but the variable that is divided first has more influence. It can be regarded as large.
This classifier can be expressed as a classification model ** that discriminates ** 4 classes of data by three features. By using a machine learning algorithm, such a model can learn training data and actually draw a tree as described above.
- Relatively easy to interpret as the decision tree can ** visualize ** the results
- Not affected by the scale difference of features, no preprocessing like standardization is required **
- ** Reliance on training data heavily **, no matter how you tune the parameters, you may not get the desired level of tree structure
- ** Easy to overfit ** Tends to have low generalization performance
The confusion matrix is a matrix that is the basis for considering the evaluation of a classification model, and represents the relationship between the predicted value and the observed value of the model. Specifically, as shown in the figure below, there are four categories: ** true positive ** (true positive), ** true negative ** (true negative), ** false positive ** (false positive), ** false. Has a negative ** (false negative).
It is the ratio of the prediction to the whole, and can be calculated as follows.
It is the ratio of the data predicted to be positive that is actually positive, and can be calculated as follows.
It is the ratio of those that are actually positive and those that are predicted to be positive, and can be calculated as follows.
A dataset that summarizes the diagnostic data for breast cancer in scikit-learn. It is benign (1) and malignant (0).
[In]
#Library used for data processing
import pandas as pd
import numpy as np
#Library used for data visualization
import matplotlib.pyplot as plt; plt.style.use('ggplot')
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inline
#Machine learning library
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics
[In]
#constant
RESPONSE_VARIABLE = 'cancer' #Objective variable
TEST_SIZE = 0.2
RANDOM_STATE = 42
[In]
#Data reading(scikit-learn cancer data)
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
cancer = pd.DataFrame(data=data.data, columns=data.feature_names)
cancer[RESPONSE_VARIABLE] = data.target
#Show first 5 lines
cancer.head()
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | cancer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.8 | 1001 | 0.1184 | 0.2776 | 0.3001 | 0.1471 | 0.2419 | 0.07871 | ... | 0 |
1 | 20.57 | 17.77 | 132.9 | 1326 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 0 |
2 | 19.69 | 21.25 | 130 | 1203 | 0.1096 | 0.1599 | 0.1974 | 0.1279 | 0.2069 | 0.05999 | ... | 0 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.1425 | 0.2839 | 0.2414 | 0.1052 | 0.2597 | 0.09744 | ... | 0 |
4 | 20.29 | 14.34 | 135.1 | 1297 | 0.1003 | 0.1328 | 0.198 | 0.1043 | 0.1809 | 0.05883 | ... | 0 |
[In]
#Check statistics
cancer.describe()
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | cancer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 569 | 569 | 569 | 569 | 569 | 569 | 569 | 569 | 569 | 569 | ... | 569 |
mean | 14.12729 | 19.28965 | 91.96903 | 654.8891 | 0.09636 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | 0.062798 | ... | 0.627417 |
std | 3.524049 | 4.301036 | 24.29898 | 351.9141 | 0.014064 | 0.052813 | 0.07972 | 0.038803 | 0.027414 | 0.00706 | ... | 0.483918 |
min | 6.981 | 9.71 | 43.79 | 143.5 | 0.05263 | 0.01938 | 0 | 0 | 0.106 | 0.04996 | ... | 0 |
25% | 11.7 | 16.17 | 75.17 | 420.3 | 0.08637 | 0.06492 | 0.02956 | 0.02031 | 0.1619 | 0.0577 | ... | 0 |
50% | 13.37 | 18.84 | 86.24 | 551.1 | 0.09587 | 0.09263 | 0.06154 | 0.0335 | 0.1792 | 0.06154 | ... | 1 |
75% | 15.78 | 21.8 | 104.1 | 782.7 | 0.1053 | 0.1304 | 0.1307 | 0.074 | 0.1957 | 0.06612 | ... | 1 |
max | 28.11 | 39.28 | 188.5 | 2501 | 0.1634 | 0.3454 | 0.4268 | 0.2012 | 0.304 | 0.09744 | ... | 1 |
[In]
#Objective variable count
cancer[RESPONSE_VARIABLE].value_counts()
[Out]
1 357
0 212
Name: cancer, dtype: int64
[In]
#Confirmation of missing values
cancer.isnull().sum()
[Out]
mean radius 0
mean texture 0
mean perimeter 0
mean area 0
mean smoothness 0
mean compactness 0
mean concavity 0
mean concave points 0
mean symmetry 0
mean fractal dimension 0
radius error 0
texture error 0
perimeter error 0
area error 0
smoothness error 0
compactness error 0
concavity error 0
concave points error 0
symmetry error 0
fractal dimension error 0
worst radius 0
worst texture 0
worst perimeter 0
worst area 0
worst smoothness 0
worst compactness 0
worst concavity 0
worst concave points 0
worst symmetry 0
worst fractal dimension 0
cancer 0
dtype: int64
[In]
#Divided into training data and test data
train, test = train_test_split(cancer, test_size=TEST_SIZE, random_state=RANDOM_STATE)
#Divide into explanatory variables and objective variables
X_train = train.drop(RESPONSE_VARIABLE, axis=1)
y_train = train[RESPONSE_VARIABLE].copy()
X_test = test.drop(RESPONSE_VARIABLE, axis=1)
y_test = test[RESPONSE_VARIABLE].copy()
[In]
#Visualize the distribution of objective variables for each feature
features = X_train.columns
legend= ['Benign','Malignant']
plt.figure(figsize=(20,32*4))
gs = gridspec.GridSpec(32, 1)
for i, col in enumerate(train[features]):
ax = plt.subplot(gs[i])
sns.distplot(train[col][train.cancer == 0],bins=50, color='crimson')
sns.distplot(train[col][train.cancer == 1],bins=50, color='royalblue')
plt.legend(legend)
By using RandomForestClassifier ()
of Scikit-learn, it is possible to confirm the "importance" of each feature as feature_importances_
.
[In]
#Feature selection
RF = RandomForestClassifier(n_estimators = 250, random_state = 42)
RF.fit(X_train, y_train)
:[Out]
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=None,
oob_score=False, random_state=42, verbose=0, warm_start=False)
[In]
#Features are output in descending order of importance
features = X_train.columns
importances = RF.feature_importances_
importances_features = sorted(zip(map(lambda x: round(x, 2), RF.feature_importances_), features), reverse=True)
for i in importances_features:
print(i)
[Out]
(0.13, 'worst perimeter')
(0.13, 'worst concave points')
(0.13, 'worst area')
(0.11, 'mean concave points')
(0.07, 'worst radius')
(0.05, 'mean radius')
(0.05, 'mean concavity')
(0.04, 'worst concavity')
(0.04, 'mean perimeter')
(0.04, 'mean area')
(0.02, 'worst texture')
(0.02, 'worst compactness')
(0.02, 'radius error')
(0.02, 'mean compactness')
(0.02, 'area error')
(0.01, 'worst symmetry')
(0.01, 'worst smoothness')
(0.01, 'worst fractal dimension')
(0.01, 'perimeter error')
(0.01, 'mean texture')
(0.01, 'mean smoothness')
(0.01, 'fractal dimension error')
(0.01, 'concavity error')
(0.0, 'texture error')
(0.0, 'symmetry error')
(0.0, 'smoothness error')
(0.0, 'mean symmetry')
(0.0, 'mean fractal dimension')
(0.0, 'concave points error')
(0.0, 'compactness error')
Top 5 results of random forest feature selection
[In]
#Get the top 5 as a list
feature_list = [value for key, value in important_features if key >= 0.06]
feature_list
[Out]
['worst perimeter',
'worst concave points',
'worst area',
'mean concave points',
'worst radius']
[In]
#Focus training and test data on only the most important features
X_train = X_train[feature_list]
X_test = X_test[feature_list]
[In]
#Check the distribution of the objective variable again
legend= ['Benign','Malignant']
plt.figure(figsize=(20,32*4))
gs = gridspec.GridSpec(32, 1)
for i, col in enumerate(train[feature_list]):
ax = plt.subplot(gs[i])
sns.distplot(train[col][train.cancer == 0],bins=50, color='crimson')
sns.distplot(train[col][train.cancer == 1],bins=50, color='royalblue')
plt.legend(legend)
[In]
#Learning
clf = DecisionTreeClassifier(max_depth=4)
clf = clf.fit(X_train, y_train)
[In]
#Prediction using features of training data
y_pred = clf.predict(X_train)
[In]
def drawing_confusion_matrix(y: pd.Series, pre: np.ndarray) -> None:
"""
A function that draws the confusion matrix
@param y:Objective variable
@param pre:Predicted value
"""
confmat = confusion_matrix(y, pre)
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confmat.shape[0]):
for j in range(confmat.shape[1]):
ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')
plt.title('Predicted value')
plt.ylabel('Measured value')
plt.rcParams["font.size"] = 15
plt.tight_layout()
plt.show()
[In]
def calculation_evaluations(y: pd.Series, pre: np.ndarray) -> None:
"""
A function that calculates and outputs the correct answer rate, precision rate, and recall rate, respectively.
@param y:Objective variable
@param pre:Predicted value
"""
print('Correct answer rate: {:.3f}'.format(metrics.accuracy_score(y, pre)))
print('Compliance rate: {:.3f}'.format(metrics.precision_score(y, pre)))
print('Recall: {:.3f}'.format(metrics.recall_score(y, pre)))
[In]
drawing_confusion_matrix(y_train, y_pred)
calculation_evaluations(y_train, y_pred)
:[Out]
Correct answer rate: 0.969
Compliance rate: 0.979
Recall: 0.972
163 in TP (upper left) is the actual number of malignant tumors that the model predicted to be malignant. 9 in FP (lower right) is a number that is predicted to be malignant and not actually malignant. The FN (upper right) of 6 is actually malignant but predicted to be benign.
[In]
#Predict test data with a trained model
y_pred_test = clf.predict(X_test)
[In]
drawing_confusion_matrix(y_test, y_pred_test)
calculation_evaluations(y_test, y_pred_test)
[Out]
Correct answer rate: (TP + TN)/(TP + TN + FP + FN)
Correct answer rate: 0.939
Compliance rate: TP/(TP + FP)
Compliance rate: 0.944
Recall: TP/(TP + FN)
Recall: 0.958
Recommended Posts