Machine learning classification tasks have several performance metrics depending on their purpose. The AUC (lower area of the air curve) of the ROC curve and PR curve exists as the main method as an evaluation index for binary classification, but as a first step in understanding this, in this article, "Recall-rate" And "Precision-rate".
I referred to the following in understanding the precision rate and recall rate.
-Discussion on the difference between ROC curve and PR curve -Practical machine learning with scikit-learn and TensorFlow
We will explain the performance evaluation method with specific examples of document classification tasks. As a first step, this chapter briefly describes how to perform the classification task, but since it is not an article about the classification task itself, a detailed explanation of the model is omitted.
This time, the dataset uses "livedoor news corpus". Please refer to Posted in the previously posted article for details of the dataset and its morphological analysis method.
In the case of Japanese, preprocessing that decomposes sentences into morphemes is required in advance, so after decomposing all sentences into morphemes, they are dropped into the following data frame.
The rightmost column is the one in which all sentences are morphologically analyzed and separated by half-width spaces. Use this to perform a classification task.
This time, we will classify "Peachy" articles and "German News" articles (both articles for women). Since this time it is a binary classification, it is synonymous with the classification task to determine whether it is an article of "German communication". The dataset is divided into 7: 3, 7 for training and 3 for evaluation.
import pandas as pd
import numpy as np
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
#It is assumed that the data frame after morpheme decomposition is already pickled and has
with open('df_wakati.pickle', 'rb') as f:
df = pickle.load(f)
#Verify if you can classify two types of articles this time
ddf = df[(df[1]=='peachy') | (df[1]=='dokujo-tsushin')].reset_index(drop = True)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(ddf[3])
def convert(x):
if x == 'peachy':
return 0
elif x == 'dokujo-tsushin':
return 1
target = ddf[1].apply(lambda x : convert(x))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, target, train_size= 0.7, random_state = 0)
import lightgbm as lgb
from sklearn.metrics import classification_report
train_data = lgb.Dataset(X_train, label=y_train)
eval_data = lgb.Dataset(X_test, label=y_test)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'random_state':0
}
gbm = lgb.train(
params,
train_data,
valid_sets=eval_data,
)
y_preds = gbm.predict(X_test)
The forecast is complete here. y_preds contains the value of the probability that the document is "German communication".
Below, we will summarize the important ideas in the performance evaluation of binary classification tasks.
A mixed matrix is a matrix that summarizes the output results of a binary classification task and is used to evaluate the performance of binary classification.
Predicted to be positive | Predicted to be Negative | |
---|---|---|
Actually belongs to the Positive class | TP(True positive) | FN(False negative) |
Actually belongs to the Negative class | FP:(false positive) | TN(True negative) |
** "TP / FP / FN / TN" ** The following is a verbal explanation of each. ** It is very important because we will evaluate the performance using these four values. ** **
--TP (true positive): Actually Positvie, and the classification model was predicted to be Positive. --FP (false positive): Actually Negative, but the classification model predicted Positive --FN (false negative): Actually Positive, but the classification model predicted Negative --TN (true negative): Actually Negative, and predicted that the classification model is Negative
When applied to the classification task carried out this time, it is as follows.
Predicted to be an article in "German News" | It was predicted that it was not an article of "German News" (=Predicted to be a "Peachy" article) | |
---|---|---|
It is actually an article of "German communication" | TP(True positive) | FN(False negative) |
It's not really a "German News" article(=It's actually a "Peachy" article) | FP:(false positive) | TN(True negative) |
This matrix is easy to create using sklearn. Since the prediction result of the machine learning model is output as a probability value, it is considered that the article with a value of $ 0.5 $ or more is once predicted as an article of "German communication".
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds>0.5))
Click here for output results
[[237 11]
[ 15 251]]
Applying it to the table above, it looks like this. Since there are many TPs and TNs, I understand that the performance seems to be good. Let's use this value to look at recall and precision.
Predicted to be an article in "German News" | It was predicted that it was not an article of "German News" (=Predicted to be "Peachy") | |
---|---|---|
It is actually an article of "German communication" | 237(TP) | 11(FN) |
It's not really a "German News" article(=It's actually a "Peachy" article) | 15(FP) | 251(TN) |
\text{recall} = \frac{TP}{TP + FN}
The above formula is the formula for calculating ** recall **. Also known as ** sensitivity ** or ** true positive rate (TPR) **. Applying this concrete example, it will be as follows.
{{\begin{eqnarray*}
\text{recall} &=& \frac{The number of cases where the result of predicting that the classification model was an article of "German communication" was correct}{Total number of actual "German News" articles} \\
&=& \frac{237}{237+11} \\
&\simeq& 96\%
\end{eqnarray*}}}
The recall rate indicates how much of the data you want to find (in this case, the article in "German News") that the classifier can find. In other words, it is an index that measures completeness **.
On the other hand, the weakness of this ** index is that you do not know how much misclassification it is **. To give an extreme example, if the classification model predicts that all sentences are "German News" articles, the recall rate will be $ 100 % $. ** This makes it possible to cover all the articles of "German News" **. Therefore, in order to measure the performance, it is essential to look at it together with the precision rate introduced below **.
Also, this time, the ones whose output result of the classification model is $ 0.5 $ or more are classified as articles of "German communication", but the threshold value of $ 0.5 $ is not invariant. Sometimes it is necessary to change it to the appropriate one.
Check the graph with the recall on the y-axis and the threshold on the x-axis.
import matplotlib.pyplot as plt
from sklearn import metrics
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_preds)
plt.plot(np.append(thresholds, 1), recall)
plt.legend()
plt.xlabel('Thresholds')
plt.ylabel('Recall')
plt.grid(True)
I think it is clear that the lower the threshold, the higher the recall rate. On the other hand, misclassification will increase, so it is necessary to set an appropriate threshold value while observing the relationship with the precision rate introduced below.
\text{precision} = \frac{TP}{TP + FP}
The above is the formula for calculating the ** Precision **. Applying this specific example, it will be as follows.
{{\begin{eqnarray*}
\text{recall} &=& \frac{The number of cases where the result of predicting that the classification model was an article of "German communication" was correct}{The total number of cases predicted that the classification model is an article of "German News"} \\
&=& \frac{237}{237+15} \\
&\simeq& 94\%
\end{eqnarray*}}}
The precision rate indicates how much of the data that the classifier really wants to find out of the data that the classifier has determined to be "this is the data you want to find (in this case, the" article on German newsletter ")". . ** In other words, it represents the certainty of the judgment when the classifier judges it as Positive ** (Note that the certainty when it is judged as Negative is ignored).
On the other hand, ** the weakness of this indicator is that we have no idea how wrong Negative's judgment is **. For example, this classifier predicts that only one article is a "German News" article, and if it is correct, the precision rate will be $ 100 % $. In this case, many of the articles that are judged not to be "German News" are actually "German News" articles.
Therefore, it is essential to look at this precision rate along with the recall rate. However, ** recall and precision are in a trade-off relationship. ** Check the fit rate graph on top of the threshold and recall graph.
import matplotlib.pyplot as plt
from sklearn import metrics
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_preds)
plt.plot(np.append(thresholds, 1), recall, label = 'Recall')
plt.plot(np.append(thresholds, 1), precision, label = 'Precision')
plt.legend()
plt.xlabel('Thresholds')
plt.ylabel('Rate')
plt.grid(True)
When the threshold is small and the recall is high, the precision is low, and when the threshold is large and the recall is high, the recall is low. It is necessary to consider which one is more important and set the threshold value, considering the purpose of the task.
There is ** F value (F1-score) ** as a method to measure the performance by combining the recall rate and the precision rate. The F value is the harmonic mean of the recall and precision, and is expressed by the following formula.
{{\begin{eqnarray*}
F_{1} &=& \frac{2}{\frac{1}{\text{recall}} + \frac{1}{\text{precision}}} \\
&=& 2\times\frac{\text{recall}\times\text{precision}}{\text{recall}+\text{precision}}
\end{eqnarray*}}}
The F value of this classification task (threshold is $ 0.5 $) is as follows.
{{\begin{eqnarray*}
F_{1} &=& \frac{2}{\frac{1}{0.96} + \frac{1}{0.94}} \\
&\simeq& 0.95
\end{eqnarray*}}}
It can be said that the classification is very accurate. F-number appreciates classifiers that have the same high precision and recall, but that is not always desirable. In some cases, the precision rate is more important or the recall rate is more important, so it is necessary to use different performance evaluation indicators according to the purpose of classification.
Next Next time, I would like to summarize the ROC curve and PR curve.
Recommended Posts