[PYTHON] Try to evaluate the performance of machine learning / classification model

1.First of all

This time, I will evaluate the performance of the classification model used for machine learning while creating code.

2. Data set

The dataset used is the breast cancer data that comes with sklearn.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score

# -----------Data set preparation--------------
dataset = load_breast_cancer()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.Series(dataset.target, name='y')
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.3, random_state=1)
print('X_train.shape = ', X_train.shape)
print('X_test.shape = ', X_test.shape)
print(y_test.value_counts())

スクリーンショット 2019-12-03 13.21.14.png Divide the dataset by training data: test data = 7: 3. There are 171 test data and 30 features, of which 171 data are ** 1: 108 for normal patients and 0: 63 for cancer patients **.

3. Classification model

There are eight classification models used this time. Later, we will put it together in the form of a pipeline for ease of use. Hyperparameters are the default.

# ----------Pipeline settings----------
pipelines = {    
    '1.KNN':
        Pipeline([('scl',StandardScaler()),
                  ('est',KNeighborsClassifier())]),    
    '2.Logistic':
        Pipeline([('scl',StandardScaler()),
                  ('est',LogisticRegression(solver='lbfgs', random_state=1))]), # solver    
    '3.SVM':
        Pipeline([('scl',StandardScaler()),
                  ('est',SVC(C=1.0, kernel='linear', class_weight='balanced', random_state=1, probability=True))]),
    '4.K-SVM':
        Pipeline([('scl',StandardScaler()),
                  ('est',SVC(C=1.0, kernel='rbf', class_weight='balanced', random_state=1, probability=True))]),
    '5.Tree':
        Pipeline([('scl',StandardScaler()),
                  ('est',DecisionTreeClassifier(random_state=1))]),
    '6.Random':
        Pipeline([('scl',StandardScaler()),
                  ('est',RandomForestClassifier(random_state=1, n_estimators=100))]),  ###
    '7.GBoost':
        Pipeline([('scl',StandardScaler()),
                  ('est',GradientBoostingClassifier(random_state=1))]),
    '8.MLP':
        Pipeline([('scl',StandardScaler()),
                  ('est',MLPClassifier(hidden_layer_sizes=(3,3),
                                       max_iter=1000,
                                       random_state=1))])
}

1.KNN This is the ** k-nearest neighbor method, which finds the k samples closest to the data you want to classify from the training data and classifies the data by majority voting of k samples.

2.Logistic It is ** Logistic Regression ** that converts the inner product result of the feature vector and the weight vector into a probability and classifies it.

3.SVM It is a ** Support Vector Machine ** that classifies for the purpose of maximizing the margin.

4.K-SVM It is a ** kernel Support Vector Machine ** that transforms training data into a higher dimensional feature space using a projection function and classifies it by SVM.

5.Tree It is a classification model by ** Decision Tree **.

6.Random It is a ** Random Forest ** that creates multiple decision trees from randomly selected features and outputs the average predictions of all decision trees.

7.GBoost It is ** gradient boosting ** (Gradinet Boosting) that improves the prediction accuracy by trying to explain the information (residual) that the existing tree group cannot explain by the succeeding tree.

8.MLP It is a ** multi-layer perceptron **, which is a type of feedforward neural network.

4. accuracy

# -------- accuracy ---------
scores = {}
for pipe_name, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    scores[(pipe_name,'train')] = accuracy_score(y_train, pipeline.predict(X_train))
    scores[(pipe_name,'test')] = accuracy_score(y_test, pipeline.predict(X_test))
print(pd.Series(scores).unstack())  

スクリーンショット 2019-12-03 10.59.17.png Accuracy of training data and test data. Looking at the accuracy of the test data, which shows the generalization performance, we can see that ** 2.Logistic ** is the best at 0.970760.

5. Confusion Matrix

The classification results are divided into four categories: ** true positive, false negative, false positive, and true negative **, and the square matrix is **. It is called the Confusion Matrix **.

The following is an example of the prediction result of cancer screening expressed by Confusion Matrix.

スクリーンショット 2019-12-03 11.58.27.png The accuracy is expressed by (TP + TN) / (TP + FN + FP + TN), but in the case of cancer screening, it is important to pay attention to the increase in FP and how much FN can be lowered. It is a viewpoint.

# ---------- Confusion Matrix ---------
from sklearn.metrics import confusion_matrix
import seaborn as sns
for pipe_name, pipeline in pipelines.items():   
    cmx_data = confusion_matrix(y_test, pipeline.predict(X_test)) 
    df_cmx = pd.DataFrame(cmx_data)
    plt.figure(figsize = (3,3))  
    sns.heatmap(df_cmx, fmt='d', annot=True, square=True)
    plt.title(pipe_name)  
    plt.xlabel('predicted label')
    plt.ylabel('true label')
    plt.show()

スクリーンショット 2019-12-03 12.23.15.png The output of the code is eight, but as a representative, if you look at the Confusion Matrix in 2.Lostic, you can see that two out of 63 cancer patients are mistakenly classified normally.

6.accuracy, recall, precision, f1-score The following four indicators can be obtained from the Confusion Matrix. スクリーンショット 2019-12-02 22.08.28.png A code that displays four indicators. In this dataset, 0 is a cancer patient and 1 is a normal person, so pos_label = 0 is added to the arguments of recall, precision, and f1-score.

# ------- accuracy, precision, recall, f1_score for test_data------
from sklearn.metrics import precision_score  
from sklearn.metrics import recall_score   
from sklearn.metrics import f1_score  

scores = {}
for pipe_name, pipeline in pipelines.items():
    scores[(pipe_name,'1.accuracy')] = accuracy_score(y_test, pipeline.predict(X_test))
    scores[(pipe_name,'2.recall')] = recall_score(y_test, pipeline.predict(X_test), pos_label=0)  
    scores[(pipe_name,'3.precision')] = precision_score(y_test, pipeline.predict(X_test), pos_label=0)  
    scores[(pipe_name,'4.f1_score')] = f1_score(y_test, pipeline.predict(X_test), pos_label=0)  
print(pd.Series(scores).unstack())

スクリーンショット 2019-12-03 14.18.52.png

In this comparison, first of all, cancer patients are rarely mistaken for normal patients (high recall) 2-4 are candidates, and among them, normal patients are rarely mistaken for cancer patients (high accuracy) ), ** 2.Logistic ** seems to be the best.

7. ROC curve, AUC

First, I will explain the ROC curve and AUC with concrete examples. スクリーンショット 2019-12-02 22.13.44.png

# --------ROC curve, AUC -----------
for pipe_name, pipeline in pipelines.items():    
    fpr, tpr, thresholds = metrics.roc_curve(y_test, pipeline.predict_proba(X_test)[:, 0], pos_label=0) # 0:Classification of cancer patients
    auc = metrics.auc(fpr, tpr)
    plt.figure(figsize=(3, 3), dpi=100)
    plt.plot(fpr, tpr, label='ROC curve (AUC = %.4f)'%auc)  
    x = np.arange(0, 1, 0.01)  
    plt.plot(x, x, c = 'red', linestyle = '--')  
    plt.legend()
    plt.title(pipe_name)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.grid(True)
    plt.show()

スクリーンショット 2019-12-03 14.06.52.png

The output of the code is eight, but as a representative, if you look at the ROC curve and AUC of ** 2.Lostic **, you can see that it is quite close to the ideal classification accuracy.

Recommended Posts

Try to evaluate the performance of machine learning / classification model
Try to evaluate the performance of machine learning / regression model
How to increase the number of machine learning dataset images
[Machine learning] I tried to summarize the theory of Adaboost
I tried to verify the yin and yang classification of Hololive members by machine learning
[Machine learning] Check the performance of the classifier with handwritten character data
Try to model the cumulative return of rollovers in futures trading
Evaluate the performance of a simple regression model using LeaveOneOut cross-validation
How to use machine learning for work? 01_ Understand the purpose of machine learning
Machine learning model management to avoid quarreling with the business side
Evaluate the accuracy of the learning model by cross-validation from scikit learn
Try to predict the triplet of boat race by ranking learning
Record the steps to understand machine learning
Machine learning algorithm (implementation of multi-class classification)
Machine learning classification
I tried to visualize the model with the low-code machine learning library "PyCaret"
Classification of guitar images by machine learning Part 1
Conformity and recall-Understanding how to evaluate classification performance ①-
Try to forecast power demand by machine learning
Python & Machine Learning Study Memo ⑤: Classification of irises
Machine learning algorithms (from two-class classification to multi-class classification)
About the development contents of machine learning (Example)
Improvement of performance metrix by two-step learning model
Classification of guitar images by machine learning Part 2
Try to simulate the movement of the solar system
Try using Jupyter Notebook of Azure Machine Learning
Arrangement of self-mentioned things related to machine learning
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
I tried to organize the evaluation indexes used in machine learning (regression model)
I tried to predict the presence or absence of snow by machine learning.
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
I'm an amateur on the 14th day of python, but I want to try machine learning with scikit-learn
Introduction to machine learning
Count the number of parameters in the deep learning model
Try to solve the problems / problems of "Matrix Programmer" (Chapter 1)
Try to predict forex (FX) with non-deep machine learning
About testing in the implementation of machine learning models
Try to estimate the number of likes on Twitter
Predict the gender of Twitter users with machine learning
Try to get the contents of Word with Golang
Machine learning beginners try to make a decision tree
Summary of the basic flow of machine learning with Python
Attempt to include machine learning model in python package
Record of the first machine learning challenge with Keras
[Machine learning] Try to detect objects using Selective Search
[Machine learning] Text classification using Transformer model (Attention-based classifier)
I tried to compress the image using machine learning
The first step of machine learning ~ For those who want to implement with python ~
Introduction to machine learning ~ Let's show the table of K-nearest neighbor method ~ (+ error handling)
Machine learning model considering maintainability
Try to get the function list of Python> os package
An introduction to machine learning
Machine learning / classification related techniques
Basics of Machine Learning (Notes)
The result of Java engineers learning machine learning in Python www
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
I tried to compare the accuracy of machine learning models using kaggle as a theme.
Predict the presence or absence of infidelity by machine learning
Matching app I tried to take statistics of strong people & tried to create a machine learning model
I made a function to check the model of DCGAN
Try to improve the accuracy of Twitter like number estimation