[PYTHON] Why you can't use confusion matrix evaluations such as Accuracy

It's been a while since my last post, but I'll continue to update it from time to time. It is an index such as Accuracy that is often used to evaluate the accuracy of machine learning, but it is rarely used in the financial field. Before explaining why it is not used, let's talk about confusion matrices and metrics such as Accuracy. If you don't need an explanation, please see only the problems and conclusions.

Confusion Matrix

The confusion matrix is a matrix of predictions and results in the binary classification of machine learning. In the binary classification of machine learning, the prediction probability and the prediction classification based on it can be output as output. For example, in the Credit score, the probability of being overdue and whether or not it will be overdue are output as 0 and 1. Based on that, the matrix of the match between the prediction and the result is as follows.

- Forecast(Not overdue) - Positive Forecast(Be overdue) - Negative
result(Not overdue) - Positive TP(True Positive) FN(False Negative)
result(Overdue) - Negative FP(False Positive) TN(True Negative)

Predicting that the learning result will not be overdue, if the result is not overdue, it corresponds to TP, if it is predicted that it will not be overdue, it corresponds to FN if the result is overdue. If the prediction is correct, it is TP or TN.

Evaluation index

There are the following types of evaluation indexes using the above confusion matrix, and each has its own characteristics. It is said that the higher the value, the better the performance.

  1. Accuracy (accuracy rate, accuracy) It is a popular one and shows how much of the total the predictions and results were correct.

    Accuracy = \frac{TP + TN}{TP + FP + FN + TN}

  2. Precision Percentage of positive results predicted (not overdue).

    Precision = \frac{TP}{TP + FP}

  3. Recall Percentage of how much positive results could be predicted.

    Recaall = \frac{TP}{TP + FN}

  4. Specificity Percentage of how well you could predict what the result would be Negative.

    Specificity = \frac{TN}{FP + TN}

Problems and characteristics of evaluation indicators

For example, if the prediction result is as follows,

- Forecast(Not overdue) - Positive Forecast(Be overdue) - Negative
result(Not overdue) - Positive 980(TP) 0(FN)
result(Overdue) - Negative 20(FP) 0(TN)
Accuracy = \frac{TP + TN}{TP+ FP + FN + TN} = \frac{980}{1000} = 0.98

Therefore, the Accuracy value will be high. However, looking at the breakdown, all the forecasts are Positive, and we have not been able to predict any of the Negative results. Depending on the bias of the data, the Accuracy will be high even if all of them are randomly predicted to be Positive. So, when I calculate the Specificity

Specificity = \frac{TN}{FP + TN} = \frac{0}{20} = 0

You can see that there is no prediction that it will be Negative.

Other indicators

There are several types of evaluation indicators, and each has its own characteristics, which tends to complicate the evaluation. Therefore, an index called F value (F-score, F1 score, F-measure, F scale) may be used. This is the harmonic mean of Precision and Recall.

F1 = 2\frac{Precision * Recall}{Precision + Recall}

problem

I've mentioned various metrics that use the confusion matrix, but I don't use them. The root of the confusion matrix is classified into two terms, Positive and Negative, and the correctness of the classification is measured. The classification method is Positive if it is lower than the threshold value determined based on the predicted probability, and Negative if it is higher. The first problem is when the data is biased. If the original data has a Positive of 99% and a Negative of 1%, the metric values tend to be biased. The second is setting the threshold. Of course, the ratio of Positive and Negative changes depending on where the threshold is set, but it tends to be ambiguous whether the threshold is appropriate. For example, if this is used in the Credit score, etc., company A does not contract for overdue forecast of 5% or more, so the threshold value is set to 5% and company B uses 10% as the threshold value, and it is classified as Negative. Let's say you don't sign a contract. Then, it is not the case that all the people classified as Positive will be overdue, and there will always be cases where they will be overdue. This is because even if all the given information (for example, attributes and transaction history) is the same, there will be cases where it will not be overdue. The most important thing in the credit score is product design according to the degree of risk, so what is important is what percentage of probability (risk degree), not the accuracy of being classified as Positive or Negative. is.

Conclusion

Of course, the above evaluation index may be important when it is important to classify into two terms with high accuracy without considering the degree of risk. In conclusion, if you do not choose the index to be used depending on how it is used, the accuracy of the learning model will be measured based on the meaningless index, and the accuracy will be evaluated as good or bad. Learn how to evaluate your model and explain why you predicted it. A summary of how to evaluate and explain machine-learned models for creating an understood POC.

Recommended Posts

Why you can't use confusion matrix evaluations such as Accuracy
Why you should use Pandas apply ()