Learn Naive Bayes implemented in Python 3.3 on a web page obtained with the Bing API. Let the sentences be classified

** This post is the second day article of Escapism Advent Calendar 2013. ** **

Implemented last time Naive Bayes. Since I made it with great effort, I combined it with the code I wrote so far to make it a practical system.

I have uploaded the entire code other than the API key to Github, so please use it as it is if you like. https://github.com/katryo/bing_search_naive_bayes

Outline of what I made

With a given search query, use the Bing API to search the web and get 50 web pages
Make the sentences in the acquired HTML file Bag-of-words and learn naive Bayes with the search query as a category. The trained classifier is converted to pickle and saved.
Use a classifier to guess which category a given sentence belongs to

By combining the above three functions, we have created a system that can acquire learning data, learn, and classify.

1. Search the Web with Bing API

Based on @ o-tomox's http://o-tomox.hatenablog.com/entry/2013/09/25/191506, I wrote a wrapper that has been improved to work with Python 3.3.

Write the Bing API key in a file called my_api_keys.py in advance.

`my_api_keys.py`


BING_API_KEY = 'abcdefafdafsafdafasfaafdaf'

Please ignore it with .gitignore. Otherwise, the API key will be published.

Below is a wrapper for the Bing API.

`bing.py`


# -*- coding: utf-8 -*-
import urllib
import requests
import sys
import my_api_keys


class Bing(object):
    #On the same level my_api_keys.Create a file called py and BING there_API_Write the KEY.
    # my_api_keys.Keep py gitignore.
    def __init__(self, api_key=my_api_keys.BING_API_KEY):
        self.api_key = api_key

    def web_search(self, query, num_of_results, keys=["Url"], skip=0):
        """
in keys'ID','Title','Description','DisplayUrl','Url'Can enter
        """
        #Basic URL
        url = 'https://api.datamarket.azure.com/Bing/Search/Web?'
        #Maximum number returned at one time
        max_num = 50
        params = {
            "Query": "'{0}'".format(query),
            "Market": "'ja-JP'"
        }
        #Format received in json
        request_url = url + urllib.parse.urlencode(params) + "&$format=json"
        results = []

        #Number of times to repeat hitting API with maximum number
        repeat = int((num_of_results - skip) / max_num)
        remainder = (num_of_results - skip) % max_num

        #Repeat hitting the API with the maximum number
        for i in range(repeat):
            result = self._hit_api(request_url, max_num, max_num * i, keys)
            results.extend(result)
        #remaining
        if remainder:
            result = self._hit_api(request_url, remainder, max_num * repeat, keys)
            results.extend(result)

        return results

    def related_queries(self, query, keys=["Title"]):
        """
in keys'ID','Title','BaseUrl'Can enter
        """
        #Basic URL
        url = 'https://api.datamarket.azure.com/Bing/Search/RelatedSearch?'
        params = {
            "Query": "'{0}'".format(query),
            "Market": "'ja-JP'"
        }
        #Format received in json
        request_url = url + urllib.parse.urlencode(params) + "&$format=json"
        results = self._hit_api(request_url, 50, 0, keys)
        return results

    def _hit_api(self, request_url, top, skip, keys):
        #Final URL to hit the API
        final_url = "{0}&$top={1}&$skip={2}".format(request_url, top, skip)
        response = requests.get(final_url, 
                                auth=(self.api_key, self.api_key), 
                                headers={'User-Agent': 'My API Robot'}).json()
        results = []
        #Get the specified information from the returned items
        for item in response["d"]["results"]:
            result = {}
            for key in keys:
                result[key] = item[key]
            results.append(result)
        return results


if __name__ == '__main__':
    # bing_api.When py is used alone, it becomes a tool to search 50 items with the entered word and display the result.
    for query in sys.stdin:
        bing = Bing()
        results = bing.web_search(query=query, num_of_results=50, keys=["Title", "Url"])
        print(results)

Using this Bing API wrapper, I wrote a script that saves 50 search result pages locally.

`fetch_web_pages.py`


from bing_api import Bing
import os
import constants
from web_page import WebPage

if __name__ == '__main__':
    bing = Bing()
    if not os.path.exists(constants.FETCHED_PAGES_DIR_NAME):
        os.mkdir(constants.FETCHED_PAGES_DIR_NAME)
    os.chdir(constants.FETCHED_PAGES_DIR_NAME)
    results = bing.web_search(query=constants.QUERY, num_of_results=constants.NUM_OF_FETCHED_PAGES, keys=['Url'])
    for i, result in enumerate(results):
        page = WebPage(result['Url'])
        page.fetch_html()
        f = open('%s_%s.html' % (constants.QUERY, str(i)), 'w')
        f.write(page.html_body)
        f.close()

In addition, create a file called constants.py and write the name of the directory where the HTML of the query and search results is stored. This time, I will first search for the query "fracture".

`constants.py`


FETCHED_PAGES_DIR_NAME = 'fetched_pages'
QUERY = 'fracture'
NUM_OF_FETCHED_PAGES = 50
NB_PKL_FILENAME = 'naive_bayes_classifier.pkl'

I created a class called WebPage to make it easier to handle the acquired Web page. Based on the URL obtained by Bing API, fetch HTML, check the character code with cChardet, and delete the disturbing HTML tag with a regular expression.

`web_page.py`


import requests
import cchardet
import re


class WebPage():
    def __init__(self, url=''):
        self.url = url

    def fetch_html(self):
        try:
            response = requests.get(self.url)
            self.set_html_body_with_cchardet(response)
        except requests.exceptions.ConnectionError:
            self.html_body = ''

    def set_html_body_with_cchardet(self, response):
        encoding_detected_by_cchardet = cchardet.detect(response.content)['encoding']
        response.encoding = encoding_detected_by_cchardet
        self.html_body = response.text

    def remove_html_tags(self):
        html_tag_pattern = re.compile('<.*?>')
        self.html_body = html_tag_pattern.sub('', self.html_body)

Now, put the above code in the same directory,

$ python fetch_web_pages.py

To execute. It takes a moment to hit the Bing API to get 50 URLs, but it takes a little time to send an HTTP request to each of the 50 URLs and get the HTML. I think it will take about 30 seconds.

When you're done, take a look at the fetched_pages directory. You should have an HTML file from fracture_0.html to fracture.49.html.

2. Learn from 50 HTML files

Now, finally, the last implementation of Naive Bayes comes into play.

`naive_bayes`


#coding:utf-8
# http://gihyo.jp/dev/serial/01/machine-learning/0003 Bayesian filter implementation in Python3.Improved to be readable for 3
import math
import sys
import MeCab


class NaiveBayes:
    def __init__(self):
        self.vocabularies = set()
        self.word_count = {}  # {'Measures against pollinosis': {'Cedar pollen': 4, 'medicine': 2,...} }
        self.category_count = {}  # {'Measures against pollinosis': 16, ...}

    def to_words(self, sentence):
        """
input: 'All to myself'
output: tuple(['all', 'myself', 'of', 'How', 'What'])
        """
        tagger = MeCab.Tagger('mecabrc')  #You can use another Tagger
        mecab_result = tagger.parse(sentence)
        info_of_words = mecab_result.split('\n')
        words = []
        for info in info_of_words:
            #When divided by macab, "" is at the end of the sentence, before that.'EOS'Is coming
            if info == 'EOS' or info == '':
                break
                # info => 'Nana\t particle,Final particle,*,*,*,*,Nana,Na,Na'
            info_elems = info.split(',')
            #Sixth, inflected words are included. If the sixth is'*'If so, enter the 0th
            if info_elems[6] == '*':
                # info_elems[0] => 'Van Rossum\t noun'
                words.append(info_elems[0][:-3])
                continue
            words.append(info_elems[6])
        return tuple(words)

    def word_count_up(self, word, category):
        self.word_count.setdefault(category, {})
        self.word_count[category].setdefault(word, 0)
        self.word_count[category][word] += 1
        self.vocabularies.add(word)

    def category_count_up(self, category):
        self.category_count.setdefault(category, 0)
        self.category_count[category] += 1

    def train(self, doc, category):
        words = self.to_words(doc)
        for word in words:
            self.word_count_up(word, category)
        self.category_count_up(category)

    def prior_prob(self, category):
        num_of_categories = sum(self.category_count.values())
        num_of_docs_of_the_category = self.category_count[category]
        return num_of_docs_of_the_category / num_of_categories

    def num_of_appearance(self, word, category):
        if word in self.word_count[category]:
            return self.word_count[category][word]
        return 0

    def word_prob(self, word, category):
        #Bayesian law calculation
        numerator = self.num_of_appearance(word, category) + 1  # +1 is the Laplace method of additive smoothing
        denominator = sum(self.word_count[category].values()) + len(self.vocabularies)

        #In Python3, division is automatically float
        prob = numerator / denominator
        return prob

    def score(self, words, category):
        score = math.log(self.prior_prob(category))
        for word in words:
            score += math.log(self.word_prob(word, category))
        return score

    def classify(self, doc):
        best_guessed_category = None
        max_prob_before = -sys.maxsize
        words = self.to_words(doc)

        for category in self.category_count.keys():
            prob = self.score(words, category)
            if prob > max_prob_before:
                max_prob_before = prob
                best_guessed_category = category
        return best_guessed_category

if __name__ == '__main__':
    nb = NaiveBayes()
    nb.train('''Python is an open source programming language created by Dutchman Guido van Rossum.
It is a type of object-oriented scripting language and is widely used in Europe and the United States along with Perl. Named after the comedy show "Flying Monty Python" produced by the British television station BBC.
Python means the reptile python in English and is sometimes used as a mascot or icon in the Python language. Python is a general-purpose high-level language. Designed with programmer productivity and code reliability in mind, it has a large, convenient standard library with core syntax and semantics kept to a minimum.
It supports character string operations using Unicode, and Japanese processing is also possible as standard. It supports many platforms (platforms that work), and has abundant documents and abundant libraries, so its use is increasing in the industrial world.
             ''',
             'Python')
    nb.train('''Ruby is an object-oriented scripting language developed by Yukihiro Matsumoto (commonly known as Matz).
It realizes object-oriented programming in the area where scripting languages such as Perl have been used in the past.
Ruby was originally born on February 24, 1993, and was announced on fj in December 1995.
The name Ruby is because the programming language Perl pronounces the same as Pearl, the birthstone for June.
It was named after the ruby of Matsumoto's colleague's birthstone (July).
             ''',
             'Ruby')
    doc = 'Open source made by Guido van Rossum'
    print('%s =>Estimated category: %s' % (doc, nb.classify(doc)))  #Estimated category:Should be python

    doc = 'It's a pure object-oriented language.'
    print('%s =>Estimated category: %s' % (doc, nb.classify(doc)))  #Estimated category:Should be Ruby

Using this naive Bayes implementation and the downloaded HTML file, train the classifier, which is a Naive Bayes object.

It's a waste to throw away the trained Naive Bayes object every time, so save it using the pickle library.

Click here for a script to learn and save.

`train_with_fetched_pages.py`


import os
import pickle
import constants
from web_page import WebPage
from naive_bayes import NaiveBayes


def load_html_files():
    """
Use on the assumption that the HTML file is in the directory
    """
    pages = []
    for i in range(constants.NUM_OF_FETCHED_PAGES):
        with open('%s_%s.html' % (constants.QUERY, str(i)), 'r') as f:
            page = WebPage()
            page.html_body = f.read()
        page.remove_html_tags()
        pages.append(page)
    return pages

if __name__ == '__main__':
    #If you want to use it in another place, make it a function
    if not os.path.exists(constants.FETCHED_PAGES_DIR_NAME):
        os.mkdir(constants.FETCHED_PAGES_DIR_NAME)
    os.chdir(constants.FETCHED_PAGES_DIR_NAME)
    pages = load_html_files()
    pkl_nb_path = os.path.join('..', constants.NB_PKL_FILENAME)

    #If you already have a Naive Bayes object pickle saved, train it
    if os.path.exists(pkl_nb_path):
        with open(pkl_nb_path, 'rb') as f:
            nb = pickle.load(f)
    else:
        nb = NaiveBayes()
    for page in pages:
        nb.train(page.html_body, constants.QUERY)
    #I've learned so much, so let's save it
    with open(pkl_nb_path, 'wb') as f:
        pickle.dump(nb, f)

Put the above source code in the same directory and do the same as before

$ python train_with_fetched_web_pages.py

Perform learning and saving the classifier with. This time, it doesn't take much time because it doesn't communicate with the outside by HTTP. In my case, it took less than 5 seconds.

3. Category by saved classifier

With the above procedure, one query = one category of "fracture" could be learned. However, it cannot be classified by only one category. Therefore, I will repeat the above procedure several times with different queries.

repetition

First, rewrite QUERY in constants.py.

`constants.py`


QUERY =‘Stomach leaning’#Rewritten from'fracture'

Then fetch the HTML with the Bing API.

$ python fetch_web_pages.py

Train the Naive Bayes classifier saved under the name naive_bayes_classifier.pkl with 50 fetched HTML files.

$ python train_with_fetched_web_pages.py

Repeat the above work several times by rewriting constants.QUERY to "countermeasures against pollinosis" and "cavities".

Summary

This time, I used Naive Bayes, which I implemented myself, to perform supervised learning using Web search results. I've tried it several times, but it's performing reasonably well.

This implementation counts the number of occurrences of all the words that appear, including overly frequent words such as "ha" and "o". Originally, we should use tf-idf to reduce the value of words that appear too frequently, reduce their features, and reduce the calculation cost, but this time we did not. This is because the data used for learning was small and the calculation did not take long. As a future task, I would like to use methods such as tf-idf and LSA (latent semantic analysis) to reduce the calculation cost or improve the accuracy.

If you use a library like scikit-learn, it should be easier and more convenient to use high-performance smoothing and tfidf, so I'd like to do it next time.

Pushed to Github, so please take a look if you like. Please give me another star.

Next time preview

Added the function to calculate the similarity from the co-occurrence frequency of words once away from Naive Bayes. Next time, you will see the tears of time. http://qiita.com/katryo/items/b6962facf744e93735bb

Learn Naive Bayes implemented in Python 3.3 on a web page obtained with the Bing API. Let the sentences be classified

Outline of what I made

1. Search the Web with Bing API

my_api_keys.py

bing.py

fetch_web_pages.py

constants.py

web_page.py

2. Learn from 50 HTML files

naive_bayes

train_with_fetched_pages.py

3. Category by saved classifier

repetition

constants.py

Category

classify.py

Summary

Next time preview

`my_api_keys.py`

`bing.py`

`fetch_web_pages.py`

`constants.py`

`web_page.py`

`naive_bayes`

`train_with_fetched_pages.py`

`constants.py`

`classify.py`