Introduction

In the machine learning classification task, the lower area (AUC) of the "ROC curve" and "Precision-Recall curve (hereinafter, PR curve)" is used to measure the performance of the algorithm. To be honest, I haven't really distinguished it so much, but @ ogamiki's here article has hints on how to use it properly. ..

The PR curve is generally suitable when the TN value is likely to be large or when there are many negative cases. In this case, the PR curve can express the difference more clearly.

I was a little curious about what kind of reason it would be, so I considered it.

What are ROC curve and PR curve?

For the ROC curve and PR curve, first see the article here.

The ROC curve and PR curve can be seen as an index showing the accuracy of the ranking, such as "when the test samples were ranked in the order in which they were predicted to be positive, did they actually consolidate the positive samples at the top?"

Ranking	truth
1	1=positive
2	0=Negative
3	1
4	1
5	0
6	0
7	0

For example, in such a ranking, there is a misprediction that the originally negative sample is in the second place and is higher than the other positive samples. When this ranking is obtained, TPR and FPR up to each ranking, or Precision and Recall are calculated.

Ranking	truth	TPR=Recall	FPR	Precision
1	1=positive	1/3=0.333	0/4=0.000	1/1=1.000
2	0=Negative	1/3=0.333	1/4=0.250	1/2=0.500
3	1	0.666	0.250	0.666
4	1	1.000	0.250	0.750
5	0	1.000	0.250	0.600
6	0	1.000	0.250	0.500
7	0	1.000	0.250	0.286

For ROC curves, FPR is on the horizontal axis, TPR is on the vertical axis, and for PR curves, Recall is on the horizontal axis and Precision is on the vertical axis, and the lines are drawn in order from the top of the ranking.

ROC-AUC is 0.833.

PR-AUC is 0.764.

My conclusion about the difference between the ROC curve and the PR curve

Regarding the difference between the ROC curve and the PR curve, my conclusion is as follows.

The AUC of the PR curve places more importance on "the accuracy of prediction of the sample with the highest ranking", while the AUC of the ROC curve evaluates "the accuracy of the entire ranking" evenly.

Intuitively, the AUC of the PR curve has the image of __ magnifying the accuracy of the top ranking with a magnifying glass. From this, the following suggestions can be obtained.

If you want to measure with more emphasis on the accuracy of the top ranking, you should adopt AUC of PR curve
The theory that "the PR curve is suitable when there are abundant negative cases" is derived from "human sexuality who wants to prioritize higher rankings when there are abundant negative cases".
However, the AUC of the PR curve fluctuates greatly due to the slight difference in the test sample at the top of the ranking, so the reliability remains questionable.

Qualitative commentary

A hint to consider the difference between the ROC curve and the PR curve is that both the ROC curve and the PR curve share the axis of TPR = Recall. However, the TPR = Recall axis is placed on the vertical axis in the ROC curve and on the horizontal axis in the PR curve. I think this is the miso.

For example, in the previous example, consider the time when TPR reaches 0.666 (ranking 3rd).

Ranking	truth	TPR=Recall	FPR	Precision
1	1=positive	1/3=0.333	0/4=0.000	1/1=1.000
2	0=Negative	1/3=0.333	1/4=0.250	1/2=0.500
3	1	0.666	0.250	0.666
4	1	1.000	0.250	0.750
5	0	1.000	0.250	0.600
6	0	1.000	0.250	0.500
7	0	1.000	0.250	0.286

When the TPR reaches 0.666, it is located at the coordinates (0.250,0.666) on the ROC curve, and up to that point, it is in charge of at most 1/4 of the total area. Therefore, no matter how bad the prediction is before reaching (0.250, 0.666), the effect on AUC will be small. On the other hand, in the PR curve, it is located at the coordinates (0.666, 0.666), and up to that coordinate, it is in charge of 2/3 of the total AUC. Therefore, if you do poorly before reaching (0.666, 0.666), it will be 8/3 times more influential than the ROC curve.

In fact, the AUC of the ROC curve and PR curve when the 1st and 2nd positions are exchanged is as follows.

ROC-AUC is 0.750 (10% deterioration).

PR-AUC is 0.514 (33% deterioration).

Experimental verification

I tried to show the evidence from the theory and the experiment. The experimental procedure is as follows.

The program used was Python, and I used "boston house-prices dataset" which is included in scikit-learn by default as test data. --The entire verification program is here (GitHub)
This dataset has a total of 506 samples, of which 84 samples (17%) are for $ 30 or more and the rest are for less than $ 30 and are labeled as positive or negative.
We set up a model to predict this positive / negative only from the explanatory variables (the model is a logistic regression model), and made rankings in order of prediction score. --The AUC of the ROC curve created from this ranking is 0.985, and the PR curve is 0.928 (it is quite high performance because the data used for training itself is used for prediction).
Here, create a new ranking that randomly shuffles only a part of the predicted score ranking from 0 to 100, and calculate the AUC of the ROC and PR curve in the same way. --In the new ranking, the rankings from 0 to 100 are completely at random, so the AUC should deteriorate from the beginning. ――Actually, in order to suppress the stochastic fluctuation, the average of AUC calculated by shuffling 10 times is taken.
Similarly, from 5th to 105th, 10th to 110th in the original ranking ,. .. .. So, we will calculate the AUC (average) for the ranking that reshuffled some of the rankings.

If the hypothesis is correct, PR-AUC should be more degraded than ROC-AUC when shuffling higher rankings. The verification result is shown in the following figure.

The horizontal axis shows from which rank in the ranking the shuffle was started, and the vertical axis shows what percentage of the original AUC value deteriorated. As hypothesized, PR-AUC deteriorates significantly (that is, on the left side of the graph) when shuffled in a higher ranking than ROC-AUC (up to 3% deterioration). It is. Conversely, it suggests that PR-AUC will improve dramatically over ROC-AUC if it can be predicted accurately in the higher rankings.

Conclusion and impression

If we repeat the conclusion from the above verification results,

The AUC of the PR curve places more importance on "the accuracy of prediction of the sample with the highest ranking", while the AUC of the ROC curve evaluates "the accuracy of the entire ranking" evenly.

With such a conclusion, I was able to get a sense of conviction. On the other hand, as I wrote in the comment of @ ogamiki's here article, the ROC curve and PR curve are practical. There are also different advantages and disadvantages.

(Hereafter, quoted)

1. Axis interpretability

First of all, Precision-Recall is a trade-off between the interpretations of the axes, and it is easy for people who are not familiar with __statistics to understand __.

For example, when it comes to determining the best customers to approach from all customers with priority, "Precision is high but recall is low" = "There is little waste, but the judgment is often missed = Opportunity loss is occurring" "Precision is low but recall is high" = "There are few omissions, but it is judged that there are many wasted shots = There is a high possibility that the budget of the approach will be wasted" So, you can talk in business terms while surrounding the PR curve.

On the contrary, in the case of ROC, FPR is particularly difficult to understand, and there are many experiences that it is difficult for people to understand it no matter how much they explain. In the end, I think that the degree of conviction of the decision maker does not reach the PR curve because it calms down with the level of understanding that "it is a diagram for measuring accuracy for the time being, and please remember that it is a happy diagram if you move to the upper left" ..

2. Absolute level of interpretability

On the other hand, I think that ROC has the advantage that it is easier to give a clear meaning to the absolute level of __AUC compared to the PR curve. For any prediction problem, ROC-AUC has a maximum value of 1 and 0.5 for random prediction. On the other hand, the maximum value of the PR curve is still 1, but the value of the random prediction depends on the ratio of positive and negative examples of the problem.

If you say "ROC-AUC got 0.9!", You can say "It was a good prediction" for any problem, but "PR-AUC got 0.4!" I think it's difficult to judge how great it is when you are asked, without a little more information.

In that sense, I think that the ROC curve is more suitable as a common language in the field where it is necessary to judge "whether this prediction is sufficiently accurate" in a short time.

(Quote so far)

What is adopted as the common language for the accuracy of classification is more important than which algorithm is used. I hope this article will give you a sense of conviction for those who fight in the field of analysis.

[PYTHON] Consideration of the difference between ROC curve and PR curve