[PYTHON] Check the words that affect document classification

Introduction

I tried using machine learning, but why did this result? Have you ever thought? Depending on the method, the factors can be deciphered. In this article, we will use Naive Bayes as a theme and introduce the "likelihood" of words in document classification.

Please refer to the scikit-learn confirmation ** published on github **.

Document classification in Naive Bayes

What is document classification?

Document classification is a task of learning based on training data (documents) given in advance and classifying new documents into given categories. When learning a document, it is necessary to give characteristics as to what to classify based on. This time, I will explain the method of giving words as features.

Likelihood as a feature of the document

Even when dealing with words as a feature of a document, there is a method of using the frequency of occurrence. In learning, the frequency of occurrence is expressed as a likelihood for each category / word (strictly different, but think of it as a probability).

We will confirm how likelihood is related to classification by involving Bayes' theorem used in Naive Bayes.

Bayes' theorem

Document discrimination is determined by the probability that a document belongs to each category. It means that the category with the highest probability is the estimation result. Probability is calculated by Bayes' theorem, in which likelihood is used.

P(C|Fi) = \frac{P(C)P(Fi|C)}{P(Fi)}\\
(C: category, Fi: feature set ≒ word set, that is, document)

■ Explanation of each element

$ P (C | Fi) $
Probability that document Fi belongs to category C (posterior probability).
$ P (C) $
Probability that category C appears (prior probability). It is the probability that the corresponding category appears in the training data without considering the contents of the document. Example: If there are 1400 documents in the IT category out of 2000 documents, 0.7
$ P (Fi | C) $
If it is category C, it is the probability that it will appear as a feature set Fi. (Likelihood)
$ P (Fi) $
The probability that a document will appear as Fi. As C is not involved, posterior probabilities in any category are equal. In the case of discrimination, the posterior probabilities for each category can be compared, so it is often excluded from the calculation.

Confirmation of likelihood calculation in the example

Suppose that the likelihood of each word in each category is obtained from the learning result under the following conditions.

If the feature set of the document you want to distinguish is "apple release orange", the likelihood P of each category(Fi|C)Is calculated as follows. P(Fi|C)Is the product of each likelihood, but since it is logarithmic, it can be calculated by addition.

IT: -0.3 + -0.3 + -0.8 = -1.4 Agriculture: $ -0.4 + -0.4 + -0.3 = -1.1 $

This time, if you look only at the likelihood, it will be estimated to be the agricultural category. As mentioned above, it can be seen that the posterior probability increases as the likelihood of each word increases. In the example, there is a big difference in the likelihood of the word "mandarin orange", and it can be seen that it greatly affects the discrimination.

Depending on the prior probabilities, words with greater likelihood differences between categories can be said to affect discrimination with this classifier.

Confirmation of actual data

Let's check the likelihood with actual data. We categorized positive and negative reviews using the Movie Data Review, a movie review dataset. After learning, the top 10 words with the largest difference in likelihood are as follows. In addition to proper nouns such as "mulan", adjectives such as "worst" appear.

word Negative positive Likelihood difference(Absolute value)
mulan -10.83863242 -9.33020901 1.50842341
truman -10.42987203 -9.000858011 1.429014015
worst -8.809010658 -10.1341868 1.325176141
shrek -10.87230098 -9.598985497 1.273315479
seagal -9.529290176 -10.78823673 1.258946555
godzilla -9.264337631 -10.47190374 1.207566113
flynt -10.81220934 -9.627421483 1.184787854
lebowski -10.82237984 -9.664010458 1.158369383
waste -9.193245829 -10.34277587 1.149530044
stupid -8.96333841 -10.10326246 1.139924046

Consideration / impression

I feel that it is reasonable that proper nouns such as popular movies and actors influence the discrimination. We also found that positive and negative adjectives also influence the discrimination. I think you can be convinced that these words influence the discrimination.

in conclusion

By checking the likelihood of the words, we were able to confirm which words affected the document discrimination. It may not always be understandable, but I felt it was important to confirm that the desired estimation was made by confirming the factors.

References

For the explanation of Naive Bayes, please refer to here. → Text classification using naive Bayes It was easy to understand the difference between likelihood and probability. → Who is the likelihood? This is easy to understand about the sparse that came out in the implementation. → Internal data structure of scipy.sparse

Recommended Posts

Check the words that affect document classification
I tried the simplest method of multi-label document classification
Launch a simple WEB server that can check the header