[PYTHON] Consideration of the difference between ROC curve and PR curve

Introduction

In the machine learning classification task, the lower area (AUC) of the "ROC curve" and "Precision-Recall curve (hereinafter, PR curve)" is used to measure the performance of the algorithm. To be honest, I haven't really distinguished it so much, but @ ogamiki's here article has hints on how to use it properly. ..

The PR curve is generally suitable when the TN value is likely to be large or when there are many negative cases. In this case, the PR curve can express the difference more clearly.

I was a little curious about what kind of reason it would be, so I considered it.

What are ROC curve and PR curve?

For the ROC curve and PR curve, first see the article here.

The ROC curve and PR curve can be seen as an index showing the accuracy of the ranking, such as "when the test samples were ranked in the order in which they were predicted to be positive, did they actually consolidate the positive samples at the top?"

Ranking truth
1 1=positive
2 0=Negative
3 1
4 1
5 0
6 0
7 0

For example, in such a ranking, there is a misprediction that the originally negative sample is in the second place and is higher than the other positive samples. When this ranking is obtained, TPR and FPR up to each ranking, or Precision and Recall are calculated.

Ranking truth TPR=Recall FPR Precision
1 1=positive 1/3=0.333 0/4=0.000 1/1=1.000
2 0=Negative 1/3=0.333 1/4=0.250 1/2=0.500
3 1 0.666 0.250 0.666
4 1 1.000 0.250 0.750
5 0 1.000 0.250 0.600
6 0 1.000 0.250 0.500
7 0 1.000 0.250 0.286

For ROC curves, FPR is on the horizontal axis, TPR is on the vertical axis, and for PR curves, Recall is on the horizontal axis and Precision is on the vertical axis, and the lines are drawn in order from the top of the ranking.

sample_roc.png

sample_pr.png

My conclusion about the difference between the ROC curve and the PR curve

Regarding the difference between the ROC curve and the PR curve, my conclusion is as follows.

Intuitively, the AUC of the PR curve has the image of __ magnifying the accuracy of the top ranking with a magnifying glass. From this, the following suggestions can be obtained.

Qualitative commentary

A hint to consider the difference between the ROC curve and the PR curve is that both the ROC curve and the PR curve share the axis of TPR = Recall. However, the TPR = Recall axis is placed on the vertical axis in the ROC curve and on the horizontal axis in the PR curve. I think this is the miso.

For example, in the previous example, consider the time when TPR reaches 0.666 (ranking 3rd).

Ranking truth TPR=Recall FPR Precision
1 1=positive 1/3=0.333 0/4=0.000 1/1=1.000
2 0=Negative 1/3=0.333 1/4=0.250 1/2=0.500
3 1 0.666 0.250 0.666
4 1 1.000 0.250 0.750
5 0 1.000 0.250 0.600
6 0 1.000 0.250 0.500
7 0 1.000 0.250 0.286

When the TPR reaches 0.666, it is located at the coordinates (0.250,0.666) on the ROC curve, and up to that point, it is in charge of at most 1/4 of the total area. Therefore, no matter how bad the prediction is before reaching (0.250, 0.666), the effect on AUC will be small. On the other hand, in the PR curve, it is located at the coordinates (0.666, 0.666), and up to that coordinate, it is in charge of 2/3 of the total AUC. Therefore, if you do poorly before reaching (0.666, 0.666), it will be 8/3 times more influential than the ROC curve.

In fact, the AUC of the ROC curve and PR curve when the 1st and 2nd positions are exchanged is as follows.

sample_roc_2.png

sample_pr_2.png

Experimental verification

I tried to show the evidence from the theory and the experiment. The experimental procedure is as follows.

  1. The program used was Python, and I used "boston house-prices dataset" which is included in scikit-learn by default as test data. --The entire verification program is here (GitHub)
  2. This dataset has a total of 506 samples, of which 84 samples (17%) are for $ 30 or more and the rest are for less than $ 30 and are labeled as positive or negative.
  3. We set up a model to predict this positive / negative only from the explanatory variables (the model is a logistic regression model), and made rankings in order of prediction score. --The AUC of the ROC curve created from this ranking is 0.985, and the PR curve is 0.928 (it is quite high performance because the data used for training itself is used for prediction).
  4. Here, create a new ranking that randomly shuffles only a part of the predicted score ranking from 0 to 100, and calculate the AUC of the ROC and PR curve in the same way. --In the new ranking, the rankings from 0 to 100 are completely at random, so the AUC should deteriorate from the beginning. ――Actually, in order to suppress the stochastic fluctuation, the average of AUC calculated by shuffling 10 times is taken.
  5. Similarly, from 5th to 105th, 10th to 110th in the original ranking ,. .. .. So, we will calculate the AUC (average) for the ranking that reshuffled some of the rankings.

If the hypothesis is correct, PR-AUC should be more degraded than ROC-AUC when shuffling higher rankings. The verification result is shown in the following figure.

result.png

The horizontal axis shows from which rank in the ranking the shuffle was started, and the vertical axis shows what percentage of the original AUC value deteriorated. As hypothesized, PR-AUC deteriorates significantly (that is, on the left side of the graph) when shuffled in a higher ranking than ROC-AUC (up to 3% deterioration). It is. Conversely, it suggests that PR-AUC will improve dramatically over ROC-AUC if it can be predicted accurately in the higher rankings.

Conclusion and impression

If we repeat the conclusion from the above verification results,

With such a conclusion, I was able to get a sense of conviction. On the other hand, as I wrote in the comment of @ ogamiki's here article, the ROC curve and PR curve are practical. There are also different advantages and disadvantages.

(Hereafter, quoted)

1. Axis interpretability

First of all, Precision-Recall is a trade-off between the interpretations of the axes, and it is easy for people who are not familiar with __statistics to understand __.

For example, when it comes to determining the best customers to approach from all customers with priority, "Precision is high but recall is low" = "There is little waste, but the judgment is often missed = Opportunity loss is occurring" "Precision is low but recall is high" = "There are few omissions, but it is judged that there are many wasted shots = There is a high possibility that the budget of the approach will be wasted" So, you can talk in business terms while surrounding the PR curve.

On the contrary, in the case of ROC, FPR is particularly difficult to understand, and there are many experiences that it is difficult for people to understand it no matter how much they explain. In the end, I think that the degree of conviction of the decision maker does not reach the PR curve because it calms down with the level of understanding that "it is a diagram for measuring accuracy for the time being, and please remember that it is a happy diagram if you move to the upper left" ..

2. Absolute level of interpretability

On the other hand, I think that ROC has the advantage that it is easier to give a clear meaning to the absolute level of __AUC compared to the PR curve. For any prediction problem, ROC-AUC has a maximum value of 1 and 0.5 for random prediction. On the other hand, the maximum value of the PR curve is still 1, but the value of the random prediction depends on the ratio of positive and negative examples of the problem.

If you say "ROC-AUC got 0.9!", You can say "It was a good prediction" for any problem, but "PR-AUC got 0.4!" I think it's difficult to judge how great it is when you are asked, without a little more information.

In that sense, I think that the ROC curve is more suitable as a common language in the field where it is necessary to judge "whether this prediction is sufficiently accurate" in a short time.

(Quote so far)

What is adopted as the common language for the accuracy of classification is more important than which algorithm is used. I hope this article will give you a sense of conviction for those who fight in the field of analysis.

Recommended Posts

Consideration of the difference between ROC curve and PR curve
I investigated the behavior of the difference between hard links and symbolic links
What is the difference between `pip` and` conda`?
Summary of the differences between PHP and Python
The answer of "1/2" is different between python2 and 3
About the difference between "==" and "is" in python
Bayesian modeling-estimation of the difference between the two groups-
About the difference between PostgreSQL su and sudo
What is the difference between Unix and Linux?
The rough difference between Unicode and UTF-8 (and their friends)
Can BERT tell the difference between "candy (candy)" and "candy (rain)"?
Difference between Ruby and Python in terms of variables
What is the difference between usleep, nanosleep and clock_nanosleep?
Visualization of the connection between malware and the callback server
How to use argparse and the difference between optparse
Difference between process and job
Difference between "categorical_crossentropy" and "sparse_categorical_crossentropy"
Difference between regression and classification
Difference between np.array and np.arange
Difference between MicroPython and CPython
Difference between ps a and ps -a
Difference between return and print-Python
What is the difference between a symbolic link and a hard link?
Understand the difference between cumulative assignment to variables and cumulative assignment to objects
A rough summary of the differences between Windows and Linux
ROC curve and PR curve-Understanding how to evaluate classification performance ②-
The difference between foreground and background processes understood by the principle
Difference between Ruby and Python split
Difference between java and python (memo)
The story of Python and the story of NaN
Difference between list () and [] in Python
Difference between SQLAlchemy filter () and filter_by ()
Difference between == and is in python
Memorandum (difference between csv.reader and csv.dictreader)
(Note) Difference between gateway and default gateway
Difference between Numpy randint and Random randint
Difference between sort and sorted (memorial)
Difference between python2 series and python3 series dict.keys ()
[Python] Difference between function and method
Difference between SQLAlchemy flush () and commit ()
Python --Difference between exec and eval
[Python] Difference between randrange () and randint ()
[Python] Difference between sorted and sorted (Colaboratory)
Python> Extract (unpack) the value of list> Add *> You taught me the difference between Python 2 and Python 3 regarding print (* mylist) / print ().
[Introduction to Python] What is the difference between a list and a tuple?
This and that of the inclusion notation.
[Xg boost] Difference between softmax and softprob
difference between statements (statements) and expressions (expressions) in Python
[Django ORM] Difference between values () and only ()
Difference between PHP and Python finally and exit
[Scikit-learn] I played with the ROC curve
Review the concept and terminology of regression
Difference between @classmethod and @staticmethod in Python
Difference between append and + = in Python list
Difference between nonlocal and global in Python
Difference between linear regression, Ridge regression and Lasso regression
[Python] Difference between class method and static method
Difference between docker-compose env_file and .env file
The subtle relationship between Gentoo and pip
About the relationship between Git and GitHub
The story of trying deep3d and losing