The algorithm of the naive Bayes classifier (Bayesian filter) is explained using concrete numerical values. I also implemented it in Python. I wrote it as my own study memo, but I hope it will be useful to other people.
It is one of the supervised learning methods of machine learning that allows you to determine which category a certain data (sentence) belongs to. It is often used for spam email filters and categorization of web news articles.
It is a simple method using Bayes' theorem, and the difficulty level is low. I tried to explain without using mathematical formulas as much as possible.
I will explain the calculation logic for determining which category the target sentence falls into, using concrete numerical values.
If the training data is as follows, calculate which category the target sentence falls into.
** Learning data **
*Football[ball|Sports|World Cup|ball] *baseball[ball|Sports|Gloves|bat] *tennis[ball|racket|coat]
Soccer [Ball | Sports]
It is assumed that the inside of [] is one connected document.
** Target document ** "Ball sports"
The value obtained by dividing the number of documents in each category of the training data by the total number of documents is defined as "** category appearance rate P (C) **".
Football | baseball | tennis | |
---|---|---|---|
Category appearance probability |
Count the number of words in each category in the training data. Duplicates are not excluded and football is total.
Football | baseball | tennis | |
---|---|---|---|
Number of words |
Count the number of occurrences in each category of "ball" and "sports" and divide by the number of words in each category counted above. Let this be "** word occurrence rate P (Wn | C) ** in the category".
Football | baseball | tennis | |
---|---|---|---|
ball | |||
Sports |
The value obtained by multiplying the calculated "ball value" and "sports value" is defined as "** document appearance rate P (D | C) ** in the category".
Football | baseball | tennis | |
---|---|---|---|
Document appearance rate in the category |
1 above.And 2.Issued in "Category appearance rate P(C)"When"**Document appearance rate in category P(D|C)**Probability that the product of "" belongs to each category for the document (文章内のCategory appearance rate P(C|D))is.
The one with the highest value is classified as the target document category.
Football | baseball | tennis | |
---|---|---|---|
Category appearance rate in sentences | $\frac{2}{4} \times \frac{6}{36} $ | $\frac{1}{4} \times \frac{1}{16} $ | $\frac{1}{4} \times \frac{0}{9} $ |
Calculation result |
As a result, "ball sports" was classified as "soccer"! !!
(Since it is just an example, it is through that it is all ball sports)
In this example, the probability of tennis is now $ 0 $. This is because the word "sports" was not included in the learning data for "tennis."
If there is a new word that does not exist in the training data, the probability of the category will be $ 0 $. This is called the "** zero frequency problem **".
To solve this, we take a method called additive smoothing and recalculate.
Simply add the processing in bold below to the "Word appearance rate P (Wn | C) in category" calculation part in 2.
Count the number of occurrences of each category of "ball" and "sports" and add ** 1 **, and divide by ** the number of words in each category counted above plus ** the total number of learning data words ** ..
The total number of words in the training data is 8 because it is deduplicated.
"Word appearance rate P (Wn | C) in the category" is summarized in the table again.
Football | baseball | tennis | |
---|---|---|---|
ball | |||
Sports |
Calculate the fraction.
Football | baseball | tennis | |
---|---|---|---|
ball | |||
Sports |
This shows that even if the word "sports" is not in the learning data for "tennis", the probability will not be zero.
And the continuation is calculated as it is.
The "category appearance rate P (C)" calculated in 1. remains the same. "Document appearance rate P (D | C) in category" is the product of the values of "ball" and "sports" calculated in fractions above.
Football | baseball | tennis | |
---|---|---|---|
Category appearance probability | |||
Document appearance rate in the category | |||
Category appearance rate in sentences | |||
Calculation result |
The result was classified as soccer, with a tennis probability of $ 0 $! !!
However, I think that the accuracy of additive smoothing deteriorates due to the large effect of the value of "Category appearance rate P (C)". It is necessary to carefully consider this when setting the training data.
The example has reduced the number of words and datasets for clarity, but it actually deals with more words. Therefore, the denominator of the calculation result becomes a very large number, and there is a high possibility that underflow will occur.
The workaround is the ** logarithmic comparison ** method.
$ 0.001 $ can also be expressed as $ \ frac {1} {10} $ × $ \ frac {1} {10} $ × $ \ frac {1} {10} $, which is $ 10 $ to the $ -3 $ power. Good, ** -3 ** is the logarithm of $ 0.001 $. (Bottom is $ 10 $) $ 0.0001 $ can also be expressed as $ \ frac {1} {10} $ × $ \ frac {1} {10} $ × $ \ frac {1} {10} $ × $ \ frac {1} {10} $ Yes, it is called $ 10 $ to the $ -4 $ power, and **-4 ** is the logarithm of $ 0.001 $. (Bottom is $ 10 $)
The magnitude relationship between $ X $, $ Y $, and $ Z $ is the same as the logarithm of $ X $, the logarithm of $ Y $, and the logarithm of $ Z $. (If the bottom is the same and greater than 1) The logarithm is greater than the original value. (If the value is less than 1 and the base is greater than 1)
Also, the magnitude of value multiplication is the same as the magnitude of logarithmic addition. 2 x 2 x 2: Logarithm 3 2 x 2 x 2 ** x 2 **: Logarithm 3 ** + 1 **
The story goes back to the example "Document appearance rate in category P(D|C)"Ball" and "Sports" "Word appearance rate in category P"(Wn|C)Calculated as the addition of the logarithms of "Category appearance probability P"(C)I will add it to the logarithm of.
Football | baseball | tennis | |
---|---|---|---|
Category appearance probability | |||
Word appearance rate in the category(ball) | |||
Word appearance rate in the category(Sports) | |||
Category appearance rate in sentences |
This completes the Naive Bayes classifier! !!
I actually wrote it in python with reference to the following site. http://yaju3d.hatenablog.jp/entry/2016/10/31/222307
I tried to describe in the comments as much as possible where the values explained above correspond. Janome is used for morphological analysis.
naivebayes.py
class NaiveBayes:
#constructor
def __init__(self):
#A set of all words of training data(For additive smoothing)
self.vocabularies = set()
#For word sets for each category of training data
self.word_count = {}
#For a set of documents for each category of training data
self.category_count = {}
#Learning
def train(self, document, category):
#Morphological analysis of learning documents
ma = MorphologicalAnalysis()
word_list = ma.get_word_list(document)
for word in word_list:
#Increase the number of word occurrences in the category
self.__word_count_up(word, category)
#Increase the number of documents in the category
self.__category_count_up(category)
#Increase the number of word occurrences in the training data category
def __word_count_up(self, word, category):
#Add new categories
self.word_count.setdefault(category, {})
#Add new words in the category
self.word_count[category].setdefault(word, 0)
#Increase the number of word occurrences in the category
self.word_count[category][word] += 1
#Add to the entire word set of training data(Deduplication)
self.vocabularies.add(word)
#Increase the number of documents in the training data category
def __category_count_up(self, category):
#Add new categories
self.category_count.setdefault(category, 0)
#Increase the number of documents in the category
self.category_count[category] += 1
#Classification
def classifier(self, document):
#Closest category
best_category = None
#Set minimum integer value
max_prob = -sys.maxsize
#Morphological analysis of the target document
ma = MorphologicalAnalysis()
word_list = ma.get_word_list(document)
#Category appearance rate in the document for each category P(C|D)Seeking
for category in self.category_count.keys():
#Category appearance rate in the document P(C|D)Seeking
prob = self.__score(word_list, category)
if prob > max_prob:
max_prob = prob
best_category = category
return best_category
#Category appearance rate in the document P(C|D)Calculate
def __score(self, word_list, category):
#Category appearance rate P(C)Get(Take the logarithm as an underflow measure and add)
score = math.log(self.__prior_prob(category))
#Find the word occurrence rate in a category for all words in a document
for word in word_list:
#Word appearance rate in category P(Wn|C)Calculate(Take the logarithm as an underflow measure and add)
score += math.log(self.__word_prob(word, category))
return score
#Category appearance rate P(C)Calculate
def __prior_prob(self, category):
#Number of documents in the target category of training data/Total number of learning data documents
return float(self.category_count[category] / sum(self.category_count.values()))
#Word appearance rate in category P(Wn|C)Calculate
def __word_prob(self, word, category):
#Number of occurrences in a word category+ 1 /Number of words in the category+Total number of words in training data(Additive smoothing)
prob = (self.__in_category(word, category) + 1.0) / (sum(self.word_count[category].values())
+ len(self.vocabularies) * 1.0)
return prob
#Returns the number of occurrences of a word in a category
def __in_category(self, word, category):
if word in self.word_count[category]:
#Number of occurrences in a word category
return float(self.word_count[category][word])
return 0.0
I have called the learning and classification of the above classes from View.
view.py
def matching(request):
if request.method == 'POST':
#Acquisition of target document
words = request.POST['words']
nb = NaiveBayes()
#Training dataset
nb.train('Ball sports world cup ball', 'Football')
nb.train('Ball sports glove bat', 'baseball')
nb.train('Ball racket coat', 'tennis')
nb.train('Ball sports', 'Football')
#Get the result of classification
category = nb.classifier(words)
dictionary = {'category': category}
return render(request, 'matching.html', dictionary)
elif request.method == 'GET':
return render(request, 'matching.html')
Machine learning is just beginning to study and there are many things I don't understand. Also, since I've been using Python for a few days, the writing style may be strange.
If you find any mistakes in the above explanation, I would appreciate it if you could point them out.
Also, janome didn't separate the words that follow katakana. I intended to make a note for myself, but I hope it helps someone: blush:
Recommended Posts