I tried to get started with Jubatus.
Install from the package according to the instructions on the official website.
$ sudo rpm -Uvh http://download.jubat.us/yum/rhel/6/stable/x86_64/jubatus-release-6-1.el6.x86_64.rpm
$ sudo yum install jubatus jubatus-client
There is a sample repository called jubatus-example, so get this.
$ git clone https://github.com/jubatus/jubatus-example.git
There are quite a lot of explanations such as the Japanese README, so I think it's easy to enter from here.
For this purpose, you can use the sample `` `twitter_streaming_location```. The movement of this sample is as follows.
twitter_streaming_location
To a suitable name for each directory and modify it.
In the learning process, learn the correspondence between the blog category and the text, Give the classifier some text and try to guess the category.
Prepare a suitable SQL and output the list of blog categories and body text to text. With CLI, you can get data by tab delimiter as follows.
$ mysql -uuser -p -N db < blog.sql > blog.txt
The original train.py analyzes the geotags of tweets and does it, so it's a mess. A little rewritten to learn tab-delimited data fed from standard input instead of tweets acquired from the network.
train.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import json
import re
from jubatus.classifier import client
from jubatus.common import Datum
# Jubatus Configuration
host = "127.0.0.1"
port = 9199
instance_name = "" # required only when using distributed mode
def print_color(color, msg, end):
sys.stdout.write('\033[' + str(color) + 'm' + str(msg) + '\033[0m' + str(end))
def print_red(msg, end="\n"):
print_color(31, msg, end)
def print_green(msg, end="\n"):
print_color(32, msg, end)
def train():
classifier = client.Classifier(host, port, instance_name)
for line in sys.stdin:
category_name, body = line.split("\t")
d = Datum({'text': body})
classifier.train([(category_name, d)])
# Print trained entry
print_green(category_name, ' ')
print body
#If you want to back up the learning data after learning, enable the following
# classifier.save("foo")
if __name__ == '__main__':
try:
train()
except KeyboardInterrupt:
print "Stopped."
There is almost no need to change this, but I changed the display to only the top three estimated categories.
classify.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from jubatus.classifier import client
from jubatus.common import Datum
# Jubatus configuration
host = "127.0.0.1"
port = 9199
instance_name = "" # required only when using distributed mode
def estimate_blog_category_for(text):
classifier = client.Classifier(host, port, instance_name)
# Create datum for Jubatus
d = Datum({'text': text})
# Send estimation query to Jubatus
result = classifier.classify([d])
if len(result[0]) > 0:
# Sort results by score
est = sorted(result[0], key=lambda e: e.score, reverse=True)
# Print the result
print "Estimated Category for %s:" % text
i = 0
for e in est:
print " " + e.label + " (" + str(e.score) + ")"
i += 1
if i >= 3:
break
else:
# No estimation results; maybe we haven't trained enough
print "No estimation results available."
print "Train more data or try using another text."
if __name__ == '__main__':
if len(sys.argv) == 2:
estimate_blog_category_for(sys.argv[1])
else:
print "Usage: %s data" % sys.argv[0]
I wanted the text to be split to mecab instead of bigram, so I rewrote the settings a bit.
blog_category.json
{
"method": "NHERD",
"parameter": {
"regularization_weight": 0.001
},
"converter": {
"num_filter_types": {
},
"num_filter_rules": [
],
"string_filter_types": {
},
"string_filter_rules": [
],
"num_types": {
},
"num_rules": [
],
"string_types": {
"bigram": { "method": "ngram", "char_num": "2" },
"mecab": {
"method": "dynamic",
"path": "libmecab_splitter.so",
"function": "create"
}
},
"string_rules": [
{ "key": "*", "type": "mecab", "sample_weight": "bin", "global_weight": "idf" }
]
}
}
Start the server by specifying this json.
$ jubaclassifier -f blog_category.json -t 0
Feed the prepared teacher data to train.py.
$ cat blog.txt | ./train.py
Let's guess the category by feeding a suitable text.
$ ./classify.py "Nice to meet you. My name is Tanaka."
Estimated Category for Nice to meet you. My name is Tanaka.:
Self-introduction(0.231856495142)
diary(0.0823381990194)
Notice(0.0661180838943)
Recommended Posts