Overview

The algorithm of the naive Bayes classifier (Bayesian filter) is explained using concrete numerical values. I also implemented it in Python. I wrote it as my own study memo, but I hope it will be useful to other people.

What is a naive Bayes classifier?

It is one of the supervised learning methods of machine learning that allows you to determine which category a certain data (sentence) belongs to. It is often used for spam email filters and categorization of web news articles.

difficulty

It is a simple method using Bayes' theorem, and the difficulty level is low. I tried to explain without using mathematical formulas as much as possible.

Naive Bayes classifier calculation

I will explain the calculation logic for determining which category the target sentence falls into, using concrete numerical values.

If the training data is as follows, calculate which category the target sentence falls into.

** Learning data **

Soccer [Ball | Sports]
It is assumed that the inside of [] is one connected document.

** Target document ** "Ball sports"

1. Calculation of category appearance rate P (C)

The value obtained by dividing the number of documents in each category of the training data by the total number of documents is defined as "** category appearance rate P (C) **".

	Football	baseball	tennis
Category appearance probability	\frac{2}{4}	\frac{1}{4}	\frac{1}{4}

2. Calculation of document appearance rate in category P (D | C)

Count the number of words in each category in the training data. Duplicates are not excluded and football is total.

	Football	baseball	tennis
Number of words	6	4	3

Count the number of occurrences in each category of "ball" and "sports" and divide by the number of words in each category counted above. Let this be "** word occurrence rate P (Wn | C) ** in the category".

	Football	baseball	tennis
ball	\frac{3}{6}	\frac{1}{4}	\frac{1}{3}
Sports	\frac{2}{6}	\frac{1}{4}	\frac{0}{3}

The value obtained by multiplying the calculated "ball value" and "sports value" is defined as "** document appearance rate P (D | C) ** in the category".

	Football	baseball	tennis
Document appearance rate in the category	\frac{6}{36}	\frac{1}{16}	\frac{0}{9}

3. Calculation of category appearance rate in sentences P (C | D)

1 above.And 2.Issued in "Category appearance rate P(C)"When"**Document appearance rate in category P(D|C)**Probability that the product of "" belongs to each category for the document (文章内のCategory appearance rate P(C|D))is.

The one with the highest value is classified as the target document category.

	Football	baseball	tennis
Category appearance rate in sentences	$\frac{2}{4} \times \frac{6}{36} $	$\frac{1}{4} \times \frac{1}{16} $	$\frac{1}{4} \times \frac{0}{9} $
Calculation result	\frac{1}{12}	\frac{1}{64}	\frac{0}{36}

As a result, "ball sports" was classified as "soccer"! !!

(Since it is just an example, it is through that it is all ball sports)

4. Zero frequency problem

In this example, the probability of tennis is now $ 0 $. This is because the word "sports" was not included in the learning data for "tennis."

If there is a new word that does not exist in the training data, the probability of the category will be $ 0 $. This is called the "** zero frequency problem **".

To solve this, we take a method called additive smoothing and recalculate.

Recalculated with additive smoothing

Simply add the processing in bold below to the "Word appearance rate P (Wn | C) in category" calculation part in 2.

Count the number of occurrences of each category of "ball" and "sports" and add ** 1 **, and divide by ** the number of words in each category counted above plus ** the total number of learning data words ** ..

The total number of words in the training data is 8 because it is deduplicated.

"Word appearance rate P (Wn | C) in the category" is summarized in the table again.

	Football	baseball	tennis
ball	\frac{(3 + 1)}{(6 + 8)}	\frac{(1 + 1)}{(4 + 8)}	\frac{(1 + 1)}{(3 + 8)}
Sports	\frac{(2 + 1)}{(6 + 8)}	\frac{(1 + 1)}{(4 + 8)}	\frac{(0 + 1)}{(3 + 8)}

Calculate the fraction.

	Football	baseball	tennis
ball	\frac{4}{14}	\frac{2}{12}	\frac{2}{11}
Sports	\frac{3}{14}	\frac{2}{12}	\frac{1}{11}

This shows that even if the word "sports" is not in the learning data for "tennis", the probability will not be zero.

And the continuation is calculated as it is.

The "category appearance rate P (C)" calculated in 1. remains the same. "Document appearance rate P (D | C) in category" is the product of the values of "ball" and "sports" calculated in fractions above.

	Football	baseball	tennis
Category appearance probability	\frac{2}{4}	\frac{1}{4}	\frac{1}{4}
Document appearance rate in the category	\frac{12}{196}	\frac{4}{144}	\frac{2}{121}
Category appearance rate in sentences	\frac{2}{4} × \frac{12}{196}	\frac{1}{4} × \frac{4}{144}	\frac{1}{4}× \frac{2}{121}
Calculation result	\frac{3}{98}	\frac{1}{144}	\frac{1}{242}

The result was classified as soccer, with a tennis probability of $ 0 $! !!

However, I think that the accuracy of additive smoothing deteriorates due to the large effect of the value of "Category appearance rate P (C)". It is necessary to carefully consider this when setting the training data.

5. Underflow measures

The example has reduced the number of words and datasets for clarity, but it actually deals with more words. Therefore, the denominator of the calculation result becomes a very large number, and there is a high possibility that underflow will occur.

The workaround is the ** logarithmic comparison ** method.

$ 0.001 $ can also be expressed as $ \ frac {1} {10} $ × $ \ frac {1} {10} $ × $ \ frac {1} {10} $, which is $ 10 $ to the $ -3 $ power. Good, ** -3 ** is the logarithm of $ 0.001 $. (Bottom is $ 10 $) $ 0.0001 $ can also be expressed as $ \ frac {1} {10} $ × $ \ frac {1} {10} $ × $ \ frac {1} {10} $ × $ \ frac {1} {10} $ Yes, it is called $ 10 $ to the $ -4 $ power, and **-4 ** is the logarithm of $ 0.001 $. (Bottom is $ 10 $)

The magnitude relationship between $ X $, $ Y $, and $ Z $ is the same as the logarithm of $ X $, the logarithm of $ Y $, and the logarithm of $ Z $. (If the bottom is the same and greater than 1) The logarithm is greater than the original value. (If the value is less than 1 and the base is greater than 1)

Also, the magnitude of value multiplication is the same as the magnitude of logarithmic addition. 2 x 2 x 2: Logarithm 3 2 x 2 x 2 ** x 2 **: Logarithm 3 ** + 1 **

The story goes back to the example "Document appearance rate in category P(D|C)"Ball" and "Sports" "Word appearance rate in category P"(Wn|C)Calculated as the addition of the logarithms of "Category appearance probability P"(C)I will add it to the logarithm of.

	Football	baseball	tennis
Category appearance probability	\log {\frac{2}{4}}	\log {\frac{1}{4}}	\log {\frac{1}{ 4}}
Word appearance rate in the category(ball)	\log {\frac{4}{14}}	\log {\frac{2}{12}}	\log {\frac{2}{11}}
Word appearance rate in the category(Sports)	\log {\frac{3}{14}}	\log {\frac{2}{12}}	\log {\frac{1}{11}}
Category appearance rate in sentences	\log {\frac{2}{4}} + \log {\frac{4}{14}} + \log {\frac{3}{14}}	\log {\frac{1}{4}} + \log {\frac{2}{12}} + \log {\frac{2}{12}}	\log {\frac{1}{ 4}} + \log {\frac{2}{11}} + \log {\frac{1}{11}}

This completes the Naive Bayes classifier! !!

See here for how to write log http://kenyu.red/archives/3132.html

I wrote it in Python

I actually wrote it in python with reference to the following site. http://yaju3d.hatenablog.jp/entry/2016/10/31/222307

I tried to describe in the comments as much as possible where the values explained above correspond. Janome is used for morphological analysis.

`naivebayes.py`


class NaiveBayes:

    #constructor
    def __init__(self):

        #A set of all words of training data(For additive smoothing)
        self.vocabularies = set()

        #For word sets for each category of training data
        self.word_count = {}

        #For a set of documents for each category of training data
        self.category_count = {}

    #Learning
    def train(self, document, category):

        #Morphological analysis of learning documents
        ma = MorphologicalAnalysis()
        word_list = ma.get_word_list(document)

        for word in word_list:

            #Increase the number of word occurrences in the category
            self.__word_count_up(word, category)

        #Increase the number of documents in the category
        self.__category_count_up(category)

    #Increase the number of word occurrences in the training data category
    def __word_count_up(self, word, category):

        #Add new categories
        self.word_count.setdefault(category, {})

        #Add new words in the category
        self.word_count[category].setdefault(word, 0)

        #Increase the number of word occurrences in the category
        self.word_count[category][word] += 1

        #Add to the entire word set of training data(Deduplication)
        self.vocabularies.add(word)

    #Increase the number of documents in the training data category
    def __category_count_up(self, category):

        #Add new categories
        self.category_count.setdefault(category, 0)

        #Increase the number of documents in the category
        self.category_count[category] += 1

    #Classification
    def classifier(self, document):

        #Closest category
        best_category = None

        #Set minimum integer value
        max_prob = -sys.maxsize

        #Morphological analysis of the target document
        ma = MorphologicalAnalysis()
        word_list = ma.get_word_list(document)

        #Category appearance rate in the document for each category P(C|D)Seeking
        for category in self.category_count.keys():

            #Category appearance rate in the document P(C|D)Seeking
            prob = self.__score(word_list, category)

            if prob > max_prob:
                max_prob = prob
                best_category = category

        return best_category

    #Category appearance rate in the document P(C|D)Calculate
    def __score(self, word_list, category):

        #Category appearance rate P(C)Get(Take the logarithm as an underflow measure and add)
        score = math.log(self.__prior_prob(category))

        #Find the word occurrence rate in a category for all words in a document
        for word in word_list:

            #Word appearance rate in category P(Wn|C)Calculate(Take the logarithm as an underflow measure and add)
            score += math.log(self.__word_prob(word, category))

        return score

    #Category appearance rate P(C)Calculate
    def __prior_prob(self, category):

        #Number of documents in the target category of training data/Total number of learning data documents
        return float(self.category_count[category] / sum(self.category_count.values()))

    #Word appearance rate in category P(Wn|C)Calculate
    def __word_prob(self, word, category):

        #Number of occurrences in a word category+ 1 /Number of words in the category+Total number of words in training data(Additive smoothing)
        prob = (self.__in_category(word, category) + 1.0) / (sum(self.word_count[category].values())
                                                             + len(self.vocabularies) * 1.0)
        return prob

    #Returns the number of occurrences of a word in a category
    def __in_category(self, word, category):

        if word in self.word_count[category]:
            #Number of occurrences in a word category
            return float(self.word_count[category][word])
        return 0.0

I have called the learning and classification of the above classes from View.

`view.py`



def matching(request):

    if request.method == 'POST':

        #Acquisition of target document
        words = request.POST['words']

        nb = NaiveBayes()

        #Training dataset
        nb.train('Ball sports world cup ball', 'Football')
        nb.train('Ball sports glove bat', 'baseball')
        nb.train('Ball racket coat', 'tennis')
        nb.train('Ball sports', 'Football')

        #Get the result of classification
        category = nb.classifier(words)

        dictionary = {'category': category}

        return render(request, 'matching.html', dictionary)

    elif request.method == 'GET':

        return render(request, 'matching.html')

end

Machine learning is just beginning to study and there are many things I don't understand. Also, since I've been using Python for a few days, the writing style may be strange.

If you find any mistakes in the above explanation, I would appreciate it if you could point them out.

Also, janome didn't separate the words that follow katakana. I intended to make a note for myself, but I hope it helps someone: blush:

Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.

Overview

What is a naive Bayes classifier?

difficulty

Naive Bayes classifier calculation

** 1. Calculation of category appearance rate P (C) **

** 2. Calculation of document appearance rate in category P (D | C) **

** 3. Calculation of category appearance rate in sentences P (C | D) **

** 4. Zero frequency problem **

** Recalculated with additive smoothing **

** 5. Underflow measures **