I tried using machine learning, but why did this result? Have you ever thought? Depending on the method, the factors can be deciphered. In this article, we will use Naive Bayes as a theme and introduce the "likelihood" of words in document classification.
Please refer to the scikit-learn confirmation ** published on github **.
Document classification is a task of learning based on training data (documents) given in advance and classifying new documents into given categories. When learning a document, it is necessary to give characteristics as to what to classify based on. This time, I will explain the method of giving words as features.
Even when dealing with words as a feature of a document, there is a method of using the frequency of occurrence. In learning, the frequency of occurrence is expressed as a likelihood for each category / word (strictly different, but think of it as a probability).
We will confirm how likelihood is related to classification by involving Bayes' theorem used in Naive Bayes.
Document discrimination is determined by the probability that a document belongs to each category. It means that the category with the highest probability is the estimation result. Probability is calculated by Bayes' theorem, in which likelihood is used.
P(C|Fi) = \frac{P(C)P(Fi|C)}{P(Fi)}\\
(C: category, Fi: feature set ≒ word set, that is, document)
■ Explanation of each element
Suppose that the likelihood of each word in each category is obtained from the learning result under the following conditions.
If the feature set of the document you want to distinguish is "apple release orange", the likelihood P of each category(Fi|C)Is calculated as follows. P(Fi|C)Is the product of each likelihood, but since it is logarithmic, it can be calculated by addition.
IT:
This time, if you look only at the likelihood, it will be estimated to be the agricultural category. As mentioned above, it can be seen that the posterior probability increases as the likelihood of each word increases. In the example, there is a big difference in the likelihood of the word "mandarin orange", and it can be seen that it greatly affects the discrimination.
Depending on the prior probabilities, words with greater likelihood differences between categories can be said to affect discrimination with this classifier.
Let's check the likelihood with actual data. We categorized positive and negative reviews using the Movie Data Review, a movie review dataset. After learning, the top 10 words with the largest difference in likelihood are as follows. In addition to proper nouns such as "mulan", adjectives such as "worst" appear.
word | Negative | positive | Likelihood difference(Absolute value) |
---|---|---|---|
mulan | -10.83863242 | -9.33020901 | 1.50842341 |
truman | -10.42987203 | -9.000858011 | 1.429014015 |
worst | -8.809010658 | -10.1341868 | 1.325176141 |
shrek | -10.87230098 | -9.598985497 | 1.273315479 |
seagal | -9.529290176 | -10.78823673 | 1.258946555 |
godzilla | -9.264337631 | -10.47190374 | 1.207566113 |
flynt | -10.81220934 | -9.627421483 | 1.184787854 |
lebowski | -10.82237984 | -9.664010458 | 1.158369383 |
waste | -9.193245829 | -10.34277587 | 1.149530044 |
stupid | -8.96333841 | -10.10326246 | 1.139924046 |
I feel that it is reasonable that proper nouns such as popular movies and actors influence the discrimination. We also found that positive and negative adjectives also influence the discrimination. I think you can be convinced that these words influence the discrimination.
By checking the likelihood of the words, we were able to confirm which words affected the document discrimination. It may not always be understandable, but I felt it was important to confirm that the desired estimation was made by confirming the factors.
For the explanation of Naive Bayes, please refer to here. → Text classification using naive Bayes It was easy to understand the difference between likelihood and probability. → Who is the likelihood? This is easy to understand about the sparse that came out in the implementation. → Internal data structure of scipy.sparse