** This post is the second day article of Escapism Advent Calendar 2013. ** **
Implemented last time Naive Bayes. Since I made it with great effort, I combined it with the code I wrote so far to make it a practical system.
I have uploaded the entire code other than the API key to Github, so please use it as it is if you like. https://github.com/katryo/bing_search_naive_bayes
By combining the above three functions, we have created a system that can acquire learning data, learn, and classify.
Based on @ o-tomox's http://o-tomox.hatenablog.com/entry/2013/09/25/191506, I wrote a wrapper that has been improved to work with Python 3.3.
Write the Bing API key in a file called my_api_keys.py in advance.
my_api_keys.py
BING_API_KEY = 'abcdefafdafsafdafasfaafdaf'
Please ignore it with .gitignore. Otherwise, the API key will be published.
Below is a wrapper for the Bing API.
bing.py
# -*- coding: utf-8 -*-
import urllib
import requests
import sys
import my_api_keys
class Bing(object):
#On the same level my_api_keys.Create a file called py and BING there_API_Write the KEY.
# my_api_keys.Keep py gitignore.
def __init__(self, api_key=my_api_keys.BING_API_KEY):
self.api_key = api_key
def web_search(self, query, num_of_results, keys=["Url"], skip=0):
"""
in keys'ID','Title','Description','DisplayUrl','Url'Can enter
"""
#Basic URL
url = 'https://api.datamarket.azure.com/Bing/Search/Web?'
#Maximum number returned at one time
max_num = 50
params = {
"Query": "'{0}'".format(query),
"Market": "'ja-JP'"
}
#Format received in json
request_url = url + urllib.parse.urlencode(params) + "&$format=json"
results = []
#Number of times to repeat hitting API with maximum number
repeat = int((num_of_results - skip) / max_num)
remainder = (num_of_results - skip) % max_num
#Repeat hitting the API with the maximum number
for i in range(repeat):
result = self._hit_api(request_url, max_num, max_num * i, keys)
results.extend(result)
#remaining
if remainder:
result = self._hit_api(request_url, remainder, max_num * repeat, keys)
results.extend(result)
return results
def related_queries(self, query, keys=["Title"]):
"""
in keys'ID','Title','BaseUrl'Can enter
"""
#Basic URL
url = 'https://api.datamarket.azure.com/Bing/Search/RelatedSearch?'
params = {
"Query": "'{0}'".format(query),
"Market": "'ja-JP'"
}
#Format received in json
request_url = url + urllib.parse.urlencode(params) + "&$format=json"
results = self._hit_api(request_url, 50, 0, keys)
return results
def _hit_api(self, request_url, top, skip, keys):
#Final URL to hit the API
final_url = "{0}&$top={1}&$skip={2}".format(request_url, top, skip)
response = requests.get(final_url,
auth=(self.api_key, self.api_key),
headers={'User-Agent': 'My API Robot'}).json()
results = []
#Get the specified information from the returned items
for item in response["d"]["results"]:
result = {}
for key in keys:
result[key] = item[key]
results.append(result)
return results
if __name__ == '__main__':
# bing_api.When py is used alone, it becomes a tool to search 50 items with the entered word and display the result.
for query in sys.stdin:
bing = Bing()
results = bing.web_search(query=query, num_of_results=50, keys=["Title", "Url"])
print(results)
Using this Bing API wrapper, I wrote a script that saves 50 search result pages locally.
fetch_web_pages.py
from bing_api import Bing
import os
import constants
from web_page import WebPage
if __name__ == '__main__':
bing = Bing()
if not os.path.exists(constants.FETCHED_PAGES_DIR_NAME):
os.mkdir(constants.FETCHED_PAGES_DIR_NAME)
os.chdir(constants.FETCHED_PAGES_DIR_NAME)
results = bing.web_search(query=constants.QUERY, num_of_results=constants.NUM_OF_FETCHED_PAGES, keys=['Url'])
for i, result in enumerate(results):
page = WebPage(result['Url'])
page.fetch_html()
f = open('%s_%s.html' % (constants.QUERY, str(i)), 'w')
f.write(page.html_body)
f.close()
In addition, create a file called constants.py and write the name of the directory where the HTML of the query and search results is stored. This time, I will first search for the query "fracture".
constants.py
FETCHED_PAGES_DIR_NAME = 'fetched_pages'
QUERY = 'fracture'
NUM_OF_FETCHED_PAGES = 50
NB_PKL_FILENAME = 'naive_bayes_classifier.pkl'
I created a class called WebPage to make it easier to handle the acquired Web page. Based on the URL obtained by Bing API, fetch HTML, check the character code with cChardet, and delete the disturbing HTML tag with a regular expression.
web_page.py
import requests
import cchardet
import re
class WebPage():
def __init__(self, url=''):
self.url = url
def fetch_html(self):
try:
response = requests.get(self.url)
self.set_html_body_with_cchardet(response)
except requests.exceptions.ConnectionError:
self.html_body = ''
def set_html_body_with_cchardet(self, response):
encoding_detected_by_cchardet = cchardet.detect(response.content)['encoding']
response.encoding = encoding_detected_by_cchardet
self.html_body = response.text
def remove_html_tags(self):
html_tag_pattern = re.compile('<.*?>')
self.html_body = html_tag_pattern.sub('', self.html_body)
Now, put the above code in the same directory,
$ python fetch_web_pages.py
To execute. It takes a moment to hit the Bing API to get 50 URLs, but it takes a little time to send an HTTP request to each of the 50 URLs and get the HTML. I think it will take about 30 seconds.
When you're done, take a look at the fetched_pages directory. You should have an HTML file from fracture_0.html to fracture.49.html.
Now, finally, the last implementation of Naive Bayes comes into play.
naive_bayes
#coding:utf-8
# http://gihyo.jp/dev/serial/01/machine-learning/0003 Bayesian filter implementation in Python3.Improved to be readable for 3
import math
import sys
import MeCab
class NaiveBayes:
def __init__(self):
self.vocabularies = set()
self.word_count = {} # {'Measures against pollinosis': {'Cedar pollen': 4, 'medicine': 2,...} }
self.category_count = {} # {'Measures against pollinosis': 16, ...}
def to_words(self, sentence):
"""
input: 'All to myself'
output: tuple(['all', 'myself', 'of', 'How', 'What'])
"""
tagger = MeCab.Tagger('mecabrc') #You can use another Tagger
mecab_result = tagger.parse(sentence)
info_of_words = mecab_result.split('\n')
words = []
for info in info_of_words:
#When divided by macab, "" is at the end of the sentence, before that.'EOS'Is coming
if info == 'EOS' or info == '':
break
# info => 'Nana\t particle,Final particle,*,*,*,*,Nana,Na,Na'
info_elems = info.split(',')
#Sixth, inflected words are included. If the sixth is'*'If so, enter the 0th
if info_elems[6] == '*':
# info_elems[0] => 'Van Rossum\t noun'
words.append(info_elems[0][:-3])
continue
words.append(info_elems[6])
return tuple(words)
def word_count_up(self, word, category):
self.word_count.setdefault(category, {})
self.word_count[category].setdefault(word, 0)
self.word_count[category][word] += 1
self.vocabularies.add(word)
def category_count_up(self, category):
self.category_count.setdefault(category, 0)
self.category_count[category] += 1
def train(self, doc, category):
words = self.to_words(doc)
for word in words:
self.word_count_up(word, category)
self.category_count_up(category)
def prior_prob(self, category):
num_of_categories = sum(self.category_count.values())
num_of_docs_of_the_category = self.category_count[category]
return num_of_docs_of_the_category / num_of_categories
def num_of_appearance(self, word, category):
if word in self.word_count[category]:
return self.word_count[category][word]
return 0
def word_prob(self, word, category):
#Bayesian law calculation
numerator = self.num_of_appearance(word, category) + 1 # +1 is the Laplace method of additive smoothing
denominator = sum(self.word_count[category].values()) + len(self.vocabularies)
#In Python3, division is automatically float
prob = numerator / denominator
return prob
def score(self, words, category):
score = math.log(self.prior_prob(category))
for word in words:
score += math.log(self.word_prob(word, category))
return score
def classify(self, doc):
best_guessed_category = None
max_prob_before = -sys.maxsize
words = self.to_words(doc)
for category in self.category_count.keys():
prob = self.score(words, category)
if prob > max_prob_before:
max_prob_before = prob
best_guessed_category = category
return best_guessed_category
if __name__ == '__main__':
nb = NaiveBayes()
nb.train('''Python is an open source programming language created by Dutchman Guido van Rossum.
It is a type of object-oriented scripting language and is widely used in Europe and the United States along with Perl. Named after the comedy show "Flying Monty Python" produced by the British television station BBC.
Python means the reptile python in English and is sometimes used as a mascot or icon in the Python language. Python is a general-purpose high-level language. Designed with programmer productivity and code reliability in mind, it has a large, convenient standard library with core syntax and semantics kept to a minimum.
It supports character string operations using Unicode, and Japanese processing is also possible as standard. It supports many platforms (platforms that work), and has abundant documents and abundant libraries, so its use is increasing in the industrial world.
''',
'Python')
nb.train('''Ruby is an object-oriented scripting language developed by Yukihiro Matsumoto (commonly known as Matz).
It realizes object-oriented programming in the area where scripting languages such as Perl have been used in the past.
Ruby was originally born on February 24, 1993, and was announced on fj in December 1995.
The name Ruby is because the programming language Perl pronounces the same as Pearl, the birthstone for June.
It was named after the ruby of Matsumoto's colleague's birthstone (July).
''',
'Ruby')
doc = 'Open source made by Guido van Rossum'
print('%s =>Estimated category: %s' % (doc, nb.classify(doc))) #Estimated category:Should be python
doc = 'It's a pure object-oriented language.'
print('%s =>Estimated category: %s' % (doc, nb.classify(doc))) #Estimated category:Should be Ruby
Using this naive Bayes implementation and the downloaded HTML file, train the classifier, which is a Naive Bayes object.
It's a waste to throw away the trained Naive Bayes object every time, so save it using the pickle library.
Click here for a script to learn and save.
train_with_fetched_pages.py
import os
import pickle
import constants
from web_page import WebPage
from naive_bayes import NaiveBayes
def load_html_files():
"""
Use on the assumption that the HTML file is in the directory
"""
pages = []
for i in range(constants.NUM_OF_FETCHED_PAGES):
with open('%s_%s.html' % (constants.QUERY, str(i)), 'r') as f:
page = WebPage()
page.html_body = f.read()
page.remove_html_tags()
pages.append(page)
return pages
if __name__ == '__main__':
#If you want to use it in another place, make it a function
if not os.path.exists(constants.FETCHED_PAGES_DIR_NAME):
os.mkdir(constants.FETCHED_PAGES_DIR_NAME)
os.chdir(constants.FETCHED_PAGES_DIR_NAME)
pages = load_html_files()
pkl_nb_path = os.path.join('..', constants.NB_PKL_FILENAME)
#If you already have a Naive Bayes object pickle saved, train it
if os.path.exists(pkl_nb_path):
with open(pkl_nb_path, 'rb') as f:
nb = pickle.load(f)
else:
nb = NaiveBayes()
for page in pages:
nb.train(page.html_body, constants.QUERY)
#I've learned so much, so let's save it
with open(pkl_nb_path, 'wb') as f:
pickle.dump(nb, f)
Put the above source code in the same directory and do the same as before
$ python train_with_fetched_web_pages.py
Perform learning and saving the classifier with. This time, it doesn't take much time because it doesn't communicate with the outside by HTTP. In my case, it took less than 5 seconds.
With the above procedure, one query = one category of "fracture" could be learned. However, it cannot be classified by only one category. Therefore, I will repeat the above procedure several times with different queries.
First, rewrite QUERY in constants.py.
constants.py
QUERY =‘Stomach leaning’#Rewritten from'fracture'
Then fetch the HTML with the Bing API.
$ python fetch_web_pages.py
Train the Naive Bayes classifier saved under the name naive_bayes_classifier.pkl with 50 fetched HTML files.
$ python train_with_fetched_web_pages.py
Repeat the above work several times by rewriting constants.QUERY to "countermeasures against pollinosis" and "cavities".
Well, learning is over. At last, the production starts from here. Take a string from standard input and let the classifier categorize which category the string should be assigned to.
The script to classify is simple, as follows. First, load the pickled Naive Bayes object and remove it from the salt. Then every time sys.stdin is entered, just run the classify () method of the NaiveBayes object and display the result.
classify.py
# -*- coding: utf-8 -*-
import pickle
import constants
import sys
if __name__ == '__main__':
with open(constants.NB_PKL_FILENAME, 'rb') as f:
nb = pickle.load(f)
for query in sys.stdin:
result = nb.classify(query)
print('The inferred category is%s' % result)
Let's do it.
$ python classify_inputs.py
It works as a terminal tool, so enter the character string as it is. To get started Wikipedia's Allergies Page, "Allergie (Germany)" means that an immune response occurs excessively against a specific antigen. The immune response works to eliminate foreign foreign substances (antigens). , It is an indispensable physiological function for the living body. "
success! It was judged to be in the "pollen allergy countermeasures" category.
Next, the text on the page of Lion's Clinica "After eating a meal or snack, the bacteria in the plaque metabolize sugar. The plaque-covered tooth surface becomes acidic because it produces acid. "
This is also successful. It was presumed to be in the "cavities" category.
This time, I used Naive Bayes, which I implemented myself, to perform supervised learning using Web search results. I've tried it several times, but it's performing reasonably well.
This implementation counts the number of occurrences of all the words that appear, including overly frequent words such as "ha" and "o". Originally, we should use tf-idf to reduce the value of words that appear too frequently, reduce their features, and reduce the calculation cost, but this time we did not. This is because the data used for learning was small and the calculation did not take long. As a future task, I would like to use methods such as tf-idf and LSA (latent semantic analysis) to reduce the calculation cost or improve the accuracy.
If you use a library like scikit-learn, it should be easier and more convenient to use high-performance smoothing and tfidf, so I'd like to do it next time.
Pushed to Github, so please take a look if you like. Please give me another star.
Added the function to calculate the similarity from the co-occurrence frequency of words once away from Naive Bayes. Next time, you will see the tears of time. http://qiita.com/katryo/items/b6962facf744e93735bb
Recommended Posts