Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.

Overview

The algorithm of the naive Bayes classifier (Bayesian filter) is explained using concrete numerical values. I also implemented it in Python. I wrote it as my own study memo, but I hope it will be useful to other people.

What is a naive Bayes classifier?

It is one of the supervised learning methods of machine learning that allows you to determine which category a certain data (sentence) belongs to. It is often used for spam email filters and categorization of web news articles.

difficulty

It is a simple method using Bayes' theorem, and the difficulty level is low. I tried to explain without using mathematical formulas as much as possible.

Naive Bayes classifier calculation

I will explain the calculation logic for determining which category the target sentence falls into, using concrete numerical values.

If the training data is as follows, calculate which category the target sentence falls into.

** Learning data **

*Football[ball|Sports|World Cup|ball] *baseball[ball|Sports|Gloves|bat] *tennis[ball|racket|coat]

** Target document ** "Ball sports"

** 1. Calculation of category appearance rate P (C) **

The value obtained by dividing the number of documents in each category of the training data by the total number of documents is defined as "** category appearance rate P (C) **".

Football baseball tennis
Category appearance probability \frac{2}{4} \frac{1}{4} \frac{1}{4}

** 2. Calculation of document appearance rate in category P (D | C) **

Count the number of words in each category in the training data. Duplicates are not excluded and football is total.

Football baseball tennis
Number of words 6 4 3

Count the number of occurrences in each category of "ball" and "sports" and divide by the number of words in each category counted above. Let this be "** word occurrence rate P (Wn | C) ** in the category".

Football baseball tennis
ball \frac{3}{6} \frac{1}{4} \frac{1}{3}
Sports \frac{2}{6} \frac{1}{4} \frac{0}{3}

The value obtained by multiplying the calculated "ball value" and "sports value" is defined as "** document appearance rate P (D | C) ** in the category".

Football baseball tennis
Document appearance rate in the category \frac{6}{36} \frac{1}{16} \frac{0}{9}

** 3. Calculation of category appearance rate in sentences P (C | D) **

1 above.And 2.Issued in "Category appearance rate P(C)"When"**Document appearance rate in category P(D|C)**Probability that the product of "" belongs to each category for the document (文章内のCategory appearance rate P(C|D))is.

The one with the highest value is classified as the target document category.

Football baseball tennis
Category appearance rate in sentences $\frac{2}{4} \times \frac{6}{36} $ $\frac{1}{4} \times \frac{1}{16} $ $\frac{1}{4} \times \frac{0}{9} $
Calculation result \frac{1}{12} \frac{1}{64} \frac{0}{36}

As a result, "ball sports" was classified as "soccer"! !!

(Since it is just an example, it is through that it is all ball sports)

** 4. Zero frequency problem **

In this example, the probability of tennis is now $ 0 $. This is because the word "sports" was not included in the learning data for "tennis."

If there is a new word that does not exist in the training data, the probability of the category will be $ 0 $. This is called the "** zero frequency problem **".

To solve this, we take a method called additive smoothing and recalculate.

** Recalculated with additive smoothing **

Simply add the processing in bold below to the "Word appearance rate P (Wn | C) in category" calculation part in 2.

Count the number of occurrences of each category of "ball" and "sports" and add ** 1 **, and divide by ** the number of words in each category counted above plus ** the total number of learning data words ** ..

The total number of words in the training data is 8 because it is deduplicated.

"Word appearance rate P (Wn | C) in the category" is summarized in the table again.

Football baseball tennis
ball \frac{(3 + 1)}{(6 + 8)} \frac{(1 + 1)}{(4 + 8)} \frac{(1 + 1)}{(3 + 8)}
Sports \frac{(2 + 1)}{(6 + 8)} \frac{(1 + 1)}{(4 + 8)} \frac{(0 + 1)}{(3 + 8)}

Calculate the fraction.

Football baseball tennis
ball \frac{4}{14} \frac{2}{12} \frac{2}{11}
Sports \frac{3}{14} \frac{2}{12} \frac{1}{11}

This shows that even if the word "sports" is not in the learning data for "tennis", the probability will not be zero.

And the continuation is calculated as it is.

The "category appearance rate P (C)" calculated in 1. remains the same. "Document appearance rate P (D | C) in category" is the product of the values of "ball" and "sports" calculated in fractions above.

Football baseball tennis
Category appearance probability \frac{2}{4} \frac{1}{4} \frac{1}{4}
Document appearance rate in the category \frac{12}{196} \frac{4}{144} \frac{2}{121}
Category appearance rate in sentences \frac{2}{4} × \frac{12}{196} \frac{1}{4} × \frac{4}{144} \frac{1}{4}× \frac{2}{121}
Calculation result \frac{3}{98} \frac{1}{144} \frac{1}{242}

The result was classified as soccer, with a tennis probability of $ 0 $! !!

However, I think that the accuracy of additive smoothing deteriorates due to the large effect of the value of "Category appearance rate P (C)". It is necessary to carefully consider this when setting the training data.

** 5. Underflow measures **

The example has reduced the number of words and datasets for clarity, but it actually deals with more words. Therefore, the denominator of the calculation result becomes a very large number, and there is a high possibility that underflow will occur.

The workaround is the ** logarithmic comparison ** method.

$ 0.001 $ can also be expressed as $ \ frac {1} {10} $ × $ \ frac {1} {10} $ × $ \ frac {1} {10} $, which is $ 10 $ to the $ -3 $ power. Good, ** -3 ** is the logarithm of $ 0.001 $. (Bottom is $ 10 $) $ 0.0001 $ can also be expressed as $ \ frac {1} {10} $ × $ \ frac {1} {10} $ × $ \ frac {1} {10} $ × $ \ frac {1} {10} $ Yes, it is called $ 10 $ to the $ -4 $ power, and **-4 ** is the logarithm of $ 0.001 $. (Bottom is $ 10 $)

The magnitude relationship between $ X $, $ Y $, and $ Z $ is the same as the logarithm of $ X $, the logarithm of $ Y $, and the logarithm of $ Z $. (If the bottom is the same and greater than 1) The logarithm is greater than the original value. (If the value is less than 1 and the base is greater than 1)

Also, the magnitude of value multiplication is the same as the magnitude of logarithmic addition. 2 x 2 x 2: Logarithm 3 2 x 2 x 2 ** x 2 **: Logarithm 3 ** + 1 **

The story goes back to the example "Document appearance rate in category P(D|C)"Ball" and "Sports" "Word appearance rate in category P"(Wn|C)Calculated as the addition of the logarithms of "Category appearance probability P"(C)I will add it to the logarithm of.

Football baseball tennis
Category appearance probability \log {\frac{2}{4}} \log {\frac{1}{4}} \log {\frac{1}{ 4}}
Word appearance rate in the category(ball) \log {\frac{4}{14}} \log {\frac{2}{12}} \log {\frac{2}{11}}
Word appearance rate in the category(Sports) \log {\frac{3}{14}} \log {\frac{2}{12}} \log {\frac{1}{11}}
Category appearance rate in sentences \log {\frac{2}{4}} + \log {\frac{4}{14}} + \log {\frac{3}{14}} \log {\frac{1}{4}} + \log {\frac{2}{12}} + \log {\frac{2}{12}} \log {\frac{1}{ 4}} + \log {\frac{2}{11}} + \log {\frac{1}{11}}

This completes the Naive Bayes classifier! !!

I wrote it in Python

I actually wrote it in python with reference to the following site. http://yaju3d.hatenablog.jp/entry/2016/10/31/222307

I tried to describe in the comments as much as possible where the values explained above correspond. Janome is used for morphological analysis.

naivebayes.py


class NaiveBayes:

    #constructor
    def __init__(self):

        #A set of all words of training data(For additive smoothing)
        self.vocabularies = set()

        #For word sets for each category of training data
        self.word_count = {}

        #For a set of documents for each category of training data
        self.category_count = {}

    #Learning
    def train(self, document, category):

        #Morphological analysis of learning documents
        ma = MorphologicalAnalysis()
        word_list = ma.get_word_list(document)

        for word in word_list:

            #Increase the number of word occurrences in the category
            self.__word_count_up(word, category)

        #Increase the number of documents in the category
        self.__category_count_up(category)

    #Increase the number of word occurrences in the training data category
    def __word_count_up(self, word, category):

        #Add new categories
        self.word_count.setdefault(category, {})

        #Add new words in the category
        self.word_count[category].setdefault(word, 0)

        #Increase the number of word occurrences in the category
        self.word_count[category][word] += 1

        #Add to the entire word set of training data(Deduplication)
        self.vocabularies.add(word)

    #Increase the number of documents in the training data category
    def __category_count_up(self, category):

        #Add new categories
        self.category_count.setdefault(category, 0)

        #Increase the number of documents in the category
        self.category_count[category] += 1

    #Classification
    def classifier(self, document):

        #Closest category
        best_category = None

        #Set minimum integer value
        max_prob = -sys.maxsize

        #Morphological analysis of the target document
        ma = MorphologicalAnalysis()
        word_list = ma.get_word_list(document)

        #Category appearance rate in the document for each category P(C|D)Seeking
        for category in self.category_count.keys():

            #Category appearance rate in the document P(C|D)Seeking
            prob = self.__score(word_list, category)

            if prob > max_prob:
                max_prob = prob
                best_category = category

        return best_category

    #Category appearance rate in the document P(C|D)Calculate
    def __score(self, word_list, category):

        #Category appearance rate P(C)Get(Take the logarithm as an underflow measure and add)
        score = math.log(self.__prior_prob(category))

        #Find the word occurrence rate in a category for all words in a document
        for word in word_list:

            #Word appearance rate in category P(Wn|C)Calculate(Take the logarithm as an underflow measure and add)
            score += math.log(self.__word_prob(word, category))

        return score

    #Category appearance rate P(C)Calculate
    def __prior_prob(self, category):

        #Number of documents in the target category of training data/Total number of learning data documents
        return float(self.category_count[category] / sum(self.category_count.values()))

    #Word appearance rate in category P(Wn|C)Calculate
    def __word_prob(self, word, category):

        #Number of occurrences in a word category+ 1 /Number of words in the category+Total number of words in training data(Additive smoothing)
        prob = (self.__in_category(word, category) + 1.0) / (sum(self.word_count[category].values())
                                                             + len(self.vocabularies) * 1.0)
        return prob

    #Returns the number of occurrences of a word in a category
    def __in_category(self, word, category):

        if word in self.word_count[category]:
            #Number of occurrences in a word category
            return float(self.word_count[category][word])
        return 0.0

I have called the learning and classification of the above classes from View.

view.py



def matching(request):

    if request.method == 'POST':

        #Acquisition of target document
        words = request.POST['words']

        nb = NaiveBayes()

        #Training dataset
        nb.train('Ball sports world cup ball', 'Football')
        nb.train('Ball sports glove bat', 'baseball')
        nb.train('Ball racket coat', 'tennis')
        nb.train('Ball sports', 'Football')

        #Get the result of classification
        category = nb.classifier(words)

        dictionary = {'category': category}

        return render(request, 'matching.html', dictionary)

    elif request.method == 'GET':

        return render(request, 'matching.html')

end

Machine learning is just beginning to study and there are many things I don't understand. Also, since I've been using Python for a few days, the writing style may be strange.

If you find any mistakes in the above explanation, I would appreciate it if you could point them out.

Also, janome didn't separate the words that follow katakana. I intended to make a note for myself, but I hope it helps someone: blush:

Recommended Posts

Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
Note that I understand the least squares algorithm. And I wrote it in Python.
I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)
[Fundamental Information Technology Engineer Examination] I wrote the algorithm of Euclidean algorithm in Python.
Using the naive Bayes classifier implemented in Python 3.3, calculate the similarity from the co-occurrence frequency of words in sentences and strings.
[Machine learning] "Abnormality detection and change detection" Let's draw the figure of Chapter 1 in Python.
The result of Java engineers learning machine learning in Python www
I wrote it in Go to understand the SOLID principle
I wrote the queue in Python
I wrote the stack in Python
I wrote the code to write the code of Brainf * ck in python
[python] A note that started to understand the behavior of matplotlib.pyplot
[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.
I set the environment variable with Docker and displayed it in Python
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
[Fundamental Information Technology Engineer Examination] I wrote an algorithm for the maximum value of an array in Python.
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
I wrote a book that allows you to learn machine learning implementations and algorithms in a well-balanced manner.
A memo that I wrote a quicksort in Python
Talking about the features that pandas and I were in charge of in the project
I compared the speed of regular expressions in Ruby, Python, and Perl (2013 version)
I wrote a class in Python3 and Java
List of main probability distributions used in machine learning and statistics and code in python
[Python] I thoroughly explained the theory and implementation of support vector machine (SVM)
[Note] About the role of underscore "_" in Python
Get a glimpse of machine learning in Python
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
[CodeIQ] I wrote the probability distribution of dice (from CodeIQ math course for machine learning [probability distribution])
The result of making a map album of Italy honeymoon in Python and sharing it
I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.
I want to replace the variables in the python template file and mass-produce it in another file.
A note that runs an external program in Python and parses the resulting line
Mayungo's Python Learning Note: List of stories and links
Tool MALSS (application) that supports machine learning in Python
Carefully understand the exponential distribution and draw in Python
Tool MALSS (basic) that supports machine learning in Python
About testing in the implementation of machine learning models
Plot and understand the multivariate normal distribution in Python
I checked out the versions of Blender and Python
Carefully understand the Poisson distribution and draw in Python
[Python] I made a classifier for irises [Machine learning]
Implemented the algorithm of "Algorithm Picture Book" in Python3 (Heapsort)
Summary of the basic flow of machine learning with Python
I investigated the reinforcement learning algorithm of algorithmic trading
MALSS, a tool that supports machine learning in Python
I wrote a class that makes it easier to divide by specifying part of speech when using Mecab in python
I wrote python3.4 in .envrc with direnv and allowed it, but I got a syntax error
I tried to verify the yin and yang classification of Hololive members by machine learning
I made an API with Docker that returns the predicted value of the machine learning model
[Machine learning] Write the k-nearest neighbor method (k-nearest neighbor method) in python by yourself and recognize handwritten numbers.
[Python] I wrote a test of "Streamlit" that makes it easy to create visualization applications.
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I considered the machine learning method and its implementation language from the tag information of Qiita
I tried the accuracy of three Stirling's approximations in python
Survey on the use of machine learning in real services
I wrote the basic grammar of Python with Jupyter Lab
I wrote the basic operation of Seaborn in Jupyter Lab
What I'm glad I studied in 2015 and what I'm thinking of learning in 2016
What I learned about AI and machine learning using Python (4)
Implemented the algorithm of "Algorithm Picture Book" in Python3 (Bubble Sort)
[Note] Import of a file in the parent directory in Python