Using the naive Bayes classifier implemented in Python 3.3, calculate the similarity from the co-occurrence frequency of words in sentences and strings.

** This post is the third day article of Escapism Advent Calendar 2013. ** **

Summarize the contents of this time in 3 lines

--I made a similarity calculator using the classifier I made last time. --Enabled to calculate the similarity between category and input string --Similarity can be calculated using cosine similarity and Simpson coefficient

I put the code together on Github. https://github.com/katryo/bing_search_naive_bayes_sim

Synopsis up to Last time

--Search the Web using the Bing API --Let the classifier learn the query as the category name --Classify which category the entered character string falls into

I made such a system.

Functions added this time

--Calculation of similarity between the entered character string and the documents in each category

More details

--Calculation of cosine similarity of vectors (I prefer to call them vector like math girls) --Similarity calculation by Simpson coefficients of two sets

Two types of similarity calculation functions have been added.

theory

Two types of similarity calculation methods

Cosine similarity

Calculate the cosine of two vectors and use the cosine value as the similarity. The number of dimensions of the vector may be 2, 3, or 100 (specific examples will be described later). However, as the number of dimensions increases, the calculation cost naturally increases, so it is better if you can devise ways to reduce the number of dimensions (such as not counting words that appear too frequently in tfidf). By the way, I haven't devised this time.

Simpson coefficient

Unlike cosine similarity, the similarity is calculated by comparing two "bags of words" instead of vectors and using the number of words they have in common. The frequency doesn't matter.

As you can see by looking at the code, whether the same word appears 100 times in one category (or in one input string) or only once, it is calculated in the same way (so the frequency does not matter!) .. Even if the same word appears repeatedly, it does not increase the score.

Naive Bayes Occurrence Probability and Similarity

Probability of occurrence

Suppose you want to calculate the probability that a word that is naive Bayes (eg, "examination") falls into a category (eg, "hay fever").

As explained in 1st, the probability is quite small. It is less than 0.01 in the usual way.

Furthermore, this time, the input in the classification is a sentence, not a word. Since morphological analysis is performed with MeCab to make Bag of words and calculated as a set of words, the probability is even lower.

In Naive Bayes, the probability that the sentence "If you have hay fever, you should first see an otolaryngologist's office" falls into the "hay fever" category is about 0.0000 ... 1.

But it's much higher than the odds of falling into other categories (such as "fractures" and "stomach heaviness"). Relatively, but by far the highest. Therefore, it seems most likely that "If you have hay fever, you should first see an otolaryngologist's office" in the "hay fever" category. That is, the likelihood is high.

Degree of similarity

Similarity is a completely different idea from the probability of occurrence. How to define and calculate the similarity between a set of words depends on the method.

For details, see Sucrose's blog article and [Data analysis / mining world by SAS Wiki page](http: // wikiwiki). .jp / cattail /?% CE% E0% BB% F7% C5% D9% A4% C8% B5% F7% CE% A5) and Summary of Similarity Scale I think you should read around .jp / naid / 110006440287).

This time, the similarity was calculated using the cosine similarity and the Simpson coefficient, but there are various methods for calculating the similarity, such as using the Jaccard coefficient and Dice coefficient. Let's use it properly according to the purpose and the amount of calculation.

code

Similarity calculator
Calculate the similarity by incorporating it into the created system

The created code can be divided into two.

1. Similarity calculator

First, I created the following SimCalculator class.

`sim_calculator.py`


import math


class SimCalculator():
    def _absolute(self, vector):
        #Returns the length or absolute value of the vector v
        squared_distance = sum([vector[word] ** 2 for word in vector])
        distance = math.sqrt(squared_distance)
        return distance

    def sim_cos(self, v1, v2):
        numerator = 0
        #When there is a key common to v1 and v2, the product of the values is added. It is the inner product of two vectors.
        for word in v1:
            if word in v2:
                numerator += v1[word] * v2[word]
        
        denominator = self._absolute(v1) * self._absolute(v2)

        if denominator == 0:
            return 0
        return numerator / denominator

    def sim_simpson(self, v1, v2):
        intersection = 0
        #Counting the number of keys common to v1 and v2
        for word in v2:
            if word in v1:
                intersection += 1
        denominator = min(len(v1), len(v2))

        #When the content of v1 or v2 is 0
        if denominator == 0:
            return 0
        return intersection / denominator

if __name__ == '__main__':
    sc = SimCalculator()
    print('Cosine similarity is' + str(sc.sim_cos({'Life hack': 1, 'fracture': 2}, {'Life hack': 2, 'jobs': 1, 'hobby': 1})))
    print('The similarity calculated by the Simpson coefficient is' + str(sc.sim_simpson({'Life hack': 1, 'fracture': 2}, {'Life hack': 2, 'jobs': 1, 'hobby': 1})))

When executed, the result is as follows.

Cosine similarity is 0.3651483716701107
The similarity calculated by the Simpson coefficient is 0.5

The _absolute method used in the sim_cos method that calculates the cosine similarity calculates the length (absolute value, magnitude) of the vector. The vector here is represented by words such as "life hack" and "fracture". For example in the above code

{‘Life hack': 1, 'fracture': 2}

Is a two-dimensional vector.

Reference http://www.rd.mmtr.or.jp/~bunryu/pithagokakutyou.shtml

2. Calculate the similarity by incorporating it into the created system

Incorporate the similarity calculator created in 1 into Naive Bayes classifier created last time.

In other words, a function to calculate the similarity between the standard input character string and the category (Bag of words of the training data put in) is added.

By the way, the classification by Naive Bayes is not performed, and only the result learned by the Naive Bayes object is used.

Execute the following code to become a terminal tool that can perform classification and similarity calculation at the same time.

`calc_similarity.py`


from sim_calculator import SimCalculator
from naive_bayes import NaiveBayes
import constants
import pickle
import sys
import pdb
from collections import OrderedDict


if __name__ == '__main__':
    sc = SimCalculator()
    with open(constants.NB_PKL_FILENAME, 'rb') as f:
        nb_classifier = pickle.load(f)

    #Train and word for the standard input character string_Using count{'input': {'Cedar pollen': 4, 'medicine':3}}I made it an NB object to format it in the format
    #I don't use it as a classifier, so I should actually create another class, but it's annoying.
    nb_input = NaiveBayes()

    for query in sys.stdin:
        nb_input.word_count = {}  #Initialization for the second and subsequent inputs
        nb_input.train(query, 'input')  #The character string entered in the standard input'input'Learn as a category
        results = OrderedDict()
        for category in nb_classifier.word_count:
            # sim_sim instead of cos_You can also use simpson
            sim_cos = sc.sim_cos(nb_input.word_count['input'], nb_classifier.word_count[category])
            results[category] = sim_cos

        for result in results:
            print('Category "%The degree of similarity with "s"%f' % (result, results[result]))

        # http://cointoss.hatenablog.com/entry/2013/10/16/I can't get the max key even if I follow 123129(´ ・ ω ・`)
        best_score_before = 0.0
        best_category = ''
        for i, category in enumerate(results):
            if results[category] > best_score_before:
                best_category = category
                best_score_before = results[category]
        try:
            print('The category with the highest degree of similarity is "%s "and the similarity is%f' % (best_category, results[best_category]))
        except KeyError:  #When input is blank
            continue

Do this and put in the appropriate string.

In order to get over pollinosis even a little, it is important to take proper measures against pollinosis. Introducing the basic measures for pollinosis that you can do yourself, the pollinosis diary, and the wrong measures for pollinosis.

Here is the result of entering the character string taken from appropriate page.

Similarity to the category "Stomach leaning" is 0.362058
Similarity to category "cavities" is 0.381352
Similarity to the category "Countermeasures against pollinosis" is 0.646641
Similarity to category "depression" is 0.250696
Similarity to category "Machine" is 0.300861
Similarity to category "fracture" is 0.238733
Similarity to category "stiff shoulders" is 0.326560
Similarity to category "Documents" is 0.333795
The category with the highest degree of similarity is "Countermeasures against pollinosis" and the degree of similarity is 0..646641

Yeah, that's the result.

Enter the text taken from here.

When I was in my teens and twenties, when it came to yakiniku, it was common to use ribs, pork cutlet, and ramen. I loved it, but it's getting estranged little by little

It's a sentence that seems to make my stomach feel heavy. When you enter this.

Similarity to the category "Stomach leaning" is 0.398943
Similarity to category "cavities" is 0.425513
Similarity to the category "Countermeasures against pollinosis" is 0.457718
Similarity to category "depression" is 0.300388
Similarity to category "Machine" is 0.340718
Similarity to category "fracture" is 0.256197
Similarity to category "stiff shoulders" is 0.339602
Similarity to category "Documents" is 0.322423
The category with the highest degree of similarity is "Countermeasures against pollinosis" and the degree of similarity is 0..457718

That ... I can't get "stomach leaning" ...?

Finally, calculate the similarity of the caries text taken from here.

Causes of tooth decay, treatments, prevention methods, children(Milk teeth)Detailed explanation of worm teeth, treatment costs, etc. Image of tooth decay(Photo)And also about the treatment of early caries

What it will be?

Similarity to the category "Stomach leaning" is 0.404070
Similarity to category "cavities" is 0.445692
Similarity to the category "Countermeasures against pollinosis" is 0.427097
Similarity to category "depression" is 0.306610
Similarity to category "Machine" is 0.381016
Similarity to category "fracture" is 0.241813
Similarity to category "stiff shoulders" is 0.346461
Similarity to category "Documents" is 0.394373
The category with the highest degree of similarity is "cavities" and the degree of similarity is 0..445692

Was good.

It is probable that the reason why the text that seems to be leaning on the stomach was judged as a countermeasure against pollinosis was that there was a lot of noise.

Words like "o" and "ha" appear in any category. These are not useful for categorization from the human point of view, but since the above method simply calculates the similarity based on the frequency of all the words that appear, it is used for the similarity calculation.

Performance may be improved by reducing the weight of frequent words and increasing the weight of infrequent words, such as by using tf-idf.

Supplement

As I wrote in the comments, I put the standard input in the NaiveBayes object nb_input. This is because it uses the train method and the word_count method, but since this is not a Naive Bayes classifier, it is better to create a separate class so that Naive Bayes inherits that class as well.

By the way, around the final result output, I tried to "get the key and value of the one with the maximum value from the dict", and I searched for a cool way to write it. This article wrote exactly what he wanted, but it didn't work. I think it's because the specifications changed in Python3.

Also, when actually using machine learning, I think that using a library like scikit-learn is fast and bug-free. Implementing it yourself like this article is for study purposes only, and you should feel free to use a library that maintains quality in practical use.

Github I posted the code on Github. https://github.com/katryo/bing_search_naive_bayes_sim

reference

Sucrose's personal blog http://sucrose.hatenablog.com/entry/2012/11/30/132803

Next time preview

Next, we will implement tf-idf and aim to improve performance. It would be interesting to change the calculation method of similarity and use the Dice coefficient and Jaccard coefficient.

I thought, but I decided to calculate tf-idf with scikit-learn.

Continued here