[PYTHON] [Scikit-learn] I played with the ROC curve

0. Introduction

In machine learning, ** ROC curve ** and ** AUC (Area Under the Curve) of the ROC curve are used as indicators of the goodness of the classifier when two classes are classified using a certain classifier. Area) ** is used.

Roughly speaking The ROC curve shows "how much the two distributions could be separated by using the classifier". You can also compare multiple ROC curves by using the amount AUC.

-Easy-to-understand explanation of the meaning and properties of AUC and ROC curves-Mathematics learned with concrete examples -[Machine learning evaluation index-ROC curve and AUC](https://techblog.gmo-ap.jp/2018/12/14/%E6%A9%9F%E6%A2%B0%E5%AD%A6 % E7% BF% 92% E3% 81% AE% E8% A9% 95% E4% BE% A1% E6% 8C% 87% E6% A8% 99-roc% E6% 9B% B2% E7% B7% 9A % E3% 81% A8auc /)

Depending on the model used for training, some ROC curves can be drawn and some cannot. A model in which the output (return value) is given by probability when using model.predict () etc. can draw a ROC curve, but a model in which the output is binary cannot draw a ROC curve.

This time, I'm going to play with this ROC curve using scikit-learn.

1. Try to prepare the data for the time being

As mentioned at the beginning, the ROC curve can be drawn only if the output is probabilistic, so we assume such an output. Then you will have y_true and y_pred like the ones below.

In[1]


%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics

In[2]


y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

y_pred = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5, 0.65, 0.7,
         0.35, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.9]

df = pd.DataFrame({'y_true':y_true, 'y_pred':y_pred})
df

image.png

The data frame looks like the above.

Let's visualize the distribution of this.

In[3]


x0 = df[df['y_true']==0]['y_pred']
x1 = df[df['y_true']==1]['y_pred']

fig = plt.figure(figsize=(6,5)) #
ax = fig.add_subplot(1, 1, 1)
ax.hist([x0, x1], bins=10, stacked=True)

plt.xticks(np.arange(0, 1.1, 0.1), fontsize = 13) #Axis labels are easier to write with arange
plt.yticks(np.arange(0, 6, 1), fontsize = 13)

plt.ylim(0, 4)
plt.show()

image.png

It is like this. It is an example of a partially quantified distribution, as is often the case.

2. Draw an ROC curve

ROC The documentation for roc_curve () is here (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html). There are three return values.

  1. fpr (False Positive Rate)
  2. tpr (True Positive Rate)
  3. thres (Thresholds)

If you decide one threshold and determine whether it is positive or negative based on it, you can find the false positive rate and the true positive rate. The above three return values are a list of them.

AUC auc represents the curved area of the ROC curve obtained above. Returns a value from 0 to 1.

In[4]


fpr, tpr, thres = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)
print('auc:', auc)

Out[4]


auc: 0.8400000000000001

In[5]


plt.figure(figsize = (5, 5)) #How to give the size ratio for a single graph
plt.plot(fpr, tpr, marker='o')
plt.xlabel('FPR: False Positive Rete', fontsize = 13)
plt.ylabel('TPR: True Positive Rete', fontsize = 13)
plt.grid()
plt.show()

image.png

Now you can draw the ROC curve.

3. Try different distributions

Now, let's draw a ROC curve with various distributions.

3.1. Completely separable distribution (AUC = 1.0)

Let's draw a ROC curve with a distribution that can be completely separated by setting a certain threshold.

In[6]


y_pred = [0, 0.15, 0.2, 0.2, 0.25,
          0.3, 0.35, 0.4, 0.4, 0.45,
          0.5, 0.55, 0.55, 0.65, 0.7,
          0.75, 0.8, 0.85, 0.9, 0.95]
          
y_true = [0, 0, 0, 0, 0,
          0, 0, 0, 0, 0,
          1, 1, 1, 1, 1,
          1, 1, 1, 1, 1]

df = pd.DataFrame({'y_true':y_true, 'y_pred':y_pred})
x0 = df[df['y_true']==0]['y_pred']
x1 = df[df['y_true']==1]['y_pred']

In[7]


#Calculate AUC etc.
fpr, tpr, thres = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)

#Mount(fig)Creation
fig = plt.figure(figsize = (12, 4))
fig.suptitle(' AUC = ' + str(auc), fontsize = 16)
fig.subplots_adjust(wspace=0.5, hspace=0.6) #Adjust the spacing between graphs

#Graph on the left(ax1)Creation
ax1 = fig.add_subplot(1, 2, 1)
ax1.hist([x0, x1], bins=10, stacked = True)
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 5)

#Graph on the right(ax2)Creation
ax2 = fig.add_subplot(1, 2, 2)
ax2.plot(fpr, tpr, marker='o')
ax2.set_xlabel('FPR: False Positive Rete', fontsize = 13)
ax2.set_ylabel('TPR: True Positive Rete', fontsize = 13)
ax2.set_aspect('equal')
ax2.grid()

plt.show();

#Mount
## fig, ax = plt.subplots(2, 2, figsize=(6, 4))
## ...

#left
## ax[0].set_
## ...

#Right side
## ax[1].set_
## ...  #But possible

image.png

The ROC curve looks like this.

3.2. Distribution that is very difficult to separate (AUC ≒ 0.5)

Next, let's draw a ROC curve with a distribution that is difficult to separate.

In[8]


y_pred = [0, 0.15, 0.2, 0.2, 0.25,
          0.3, 0.35, 0.4, 0.4, 0.45,
          0.5, 0.55, 0.55, 0.65, 0.7,
          0.75, 0.8, 0.85, 0.9, 0.95]
          
y_true = [0, 1, 0, 1, 0,
          1, 0, 1, 0, 1,
          0, 1, 0, 1, 0,
          1, 0, 1, 0, 1]

df = pd.DataFrame({'y_true':y_true, 'y_pred':y_pred})
x0 = df[df['y_true']==0]['y_pred']
x1 = df[df['y_true']==1]['y_pred']

In[9]


#Calculate AUC etc.
fpr, tpr, thres = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)

#Mount(fig)Creation
fig = plt.figure(figsize = (12, 4))
fig.suptitle(' AUC = ' + str(auc), fontsize = 16)
fig.subplots_adjust(wspace=0.5, hspace=0.6) #Adjust the spacing between graphs

#Graph on the left(ax1)Creation
ax1 = fig.add_subplot(1, 2, 1)
ax1.hist([x0, x1], bins=10, stacked = True)
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 5)

#Graph on the right(ax2)Creation
ax2 = fig.add_subplot(1, 2, 2)
ax2.plot(fpr, tpr, marker='o')
ax2.set_xlabel('FPR: False Positive Rete', fontsize = 13)
ax2.set_ylabel('TPR: True Positive Rete', fontsize = 13)
ax2.set_aspect('equal')
ax2.grid()

plt.show();

image.png

The ROC curve looks like this.

4. Examine the return value of the function roc_curve ()

Let's examine the contents of the function roc_curve (). You can see that it is fpr, tpr, thresholds as explained earlier. The 0th threshold s is 1.95, which is the 1st threshold plus 1 and seems to be devised to include a pair where both fpr and tpr are 0.

In[10]


print(fpr.shape, tpr.shape, thres.shape)
ROC_df = pd.DataFrame({'fpr':fpr, 'tpr':tpr, 'thresholds':thres})
ROC_df

image.png

In the first example, let's look at the drop_intermeditate argument. This is False by default, but you can set it to True to remove points that are not related to the shape of the ROC curve.

In[11]


y_pred = [0, 0.15, 0.2, 0.2, 0.25,
          0.3, 0.35, 0.4, 0.4, 0.45,
          0.5, 0.55, 0.55, 0.65, 0.7,
          0.75, 0.8, 0.85, 0.9, 0.95]
          
y_true = [0, 0, 0, 0, 0,
          0, 0, 0, 0, 0,
          1, 1, 1, 1, 1,
          1, 1, 1, 1, 1]

fpr, tpr, thres = metrics.roc_curve(y_true, y_pred, drop_intermediate =True)
print(fpr.shape, tpr.shape, thres.shape)

Out[11]


(10,) (10,) (10,)

Therefore, the actual number of points is also reduced.

5. Summary

This time, I summarized the ROC curve when visualizing the result of machine learning.

We are looking for questions and articles!

Recommended Posts

[Scikit-learn] I played with the ROC curve
I played with wordcloud!
I wanted to play with the Bezier curve
I played with Floydhub for the time being
I liked the tweet with python. ..
I played with PyQt5 and Python3
I played with Mecab (morphological analysis)!
I tried cross-validation based on the grid search results with scikit-learn
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
[Introduction to sinGAN-Tensorflow] I played with the super-resolution "Challenge Big Imayuyu" ♬
The most basic clustering analysis with scikit-learn
I played with DragonRuby GTK (Game Toolkit)
Let's tune the model hyperparameters with scikit-learn!
[Introduction to Pytorch] I played with sinGAN ♬
I tried playing with the image with Pillow
I can't install the package with pip.
I tried handwriting recognition of runes with scikit-learn
I tried "smoothing" the image with Python + OpenCV
Isomap with Scikit-learn
[Python] I introduced Word2Vec and played with it.
I tried "differentiating" the image with Python + OpenCV
Plot ROC Curve for Binary Classification with Matplotlib
I tried to save the data with discord
[Python] I played with natural language processing ~ transformers ~
DBSCAN with scikit-learn
I tried "binarizing" the image with Python + OpenCV
Clustering with scikit-learn (1)
Predict the second round of summer 2016 with scikit-learn
Clustering with scikit-learn (2)
PCA with Scikit-learn
I played with Diamond, a metrics collection tool
kmeans ++ with scikit-learn
I tried playing with the calculator on tkinter
When I crawl the webapi that appears during rendering, it was played with CORS
Consideration of the difference between ROC curve and PR curve
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I want to inherit to the back with python dataclass
I measured the performance of 1 million documents with mongoDB
A memo that I touched the Datastore with python
Solving the iris problem with scikit-learn ver1.0 (logistic regression)
I tried to solve the problem with Python Vol.1
I moved the automatic summarization API "summpy" with python3.
I tried hitting the API with echonest's python client
I wrote you to watch the signal with Go
Cross Validation with scikit-learn
Multi-class SVM with scikit-learn
Clustering with scikit-learn + DBSCAN
Learn with chemoinformatics scikit-learn
DBSCAN (clustering) with scikit-learn
I counted the grains
I learned the basics of reinforcement learning and played with Cart Pole (implementing simple Q Learning)
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
I tried to find the entropy of the image with python
I replaced the Windows PowerShell cookbook with a python script.
I tried "gamma correction" of the image with Python + OpenCV
I tried to simulate how the infection spreads with Python
I tried to analyze the whole novel "Weathering with You" ☔️
I was hooked for 2 minutes with the Python debugger pdb
I wrote the code for Japanese sentence generation with DeZero