Pure Python version online morphological analysis tool Rakuten MA

I wrote Rakuten MA, which is pure Python, so this is an introductory article.

What is Rakuten MA?

Rakuten MA is a JavaScript morphological analyzer by Rakuten NLP Project. I think the features are that you can learn online sequentially and update the model easily, and you can perform morphological analysis from the client side via a browser.

For details, the explanation in the following article is easy to understand.

-Honke Rakuten MA Japanese Document -The second game to play on the PC of the PC studio! Morphological analysis with Rakuten MA with Anchibe-Hatena News -Introduction to morphological analysis by RakutenMA --Anchibe!

Python version Rakuten MA

I wanted to use Rakuten MA with Python, so I wrote the Python version!

You can install it with $ pip install rakutenma.

https://pypi.python.org/pypi/rakutenma

from rakutenma import RakutenMA

rma = RakutenMA(phi=1024, c=0.007812)
rma.load("model_ja.json")
rma.hash_func = rma.create_hash_func(15)

print(rma.tokenize("There is a chicken in the back"))
print(rma.train_one(
       [["Backyard","N-nc"],
        ["To","P-k"],
        ["Is","P-rj"],
        ["garden","N-n"],
        ["Chicken","N-nc"],
        ["But","P-k"],
        ["Is","V-c"]]))

With this kind of feeling, you can use the API in the same way as the original JS version. See PyPI above for details.

Only Python3 series is supported </ del>

~~I really wanted to support Python 2.7, but I'm exhausted, so for the time being I only support Python 3 series. </ del>~~

Starting with version 0.2, it works with Python 2.6 and 2.7 in addition to Python 3.3 and 3.4. </ ins>

The model is compatible

The Rakuten MA model handles the value of a character string quantified by a hash function. Since the behavior of this hash function is the same as the JS version, the same model file as the JS version can be reused in the Python version.

This package does not include model files, so please obtain it separately from Honke Rakuten MA Repository.

Comparison of processing time

I tried to compare the processing time with the original JS version by turning tokenize and learning 1000 times each.

Execution environment

CPU: Core i7 2GHz
Memory: 8GB
OS: Mac OSX 10.8.5
Python: 3.4.2
Node.js: 0.10.33
Pypy: 2.4.0 (Python 3.2.5)
Rakuten MA Python: 0.2
Rakuten MA (JS): 1.0.0

Comparison code

`rakutenma_benchmark.py`


# -*- coding: utf-8 -*-
from rakutenma import RakutenMA

rma = RakutenMA()
for i in range(1000):
    rma.tokenize("I am not afraid of anything anymore")
    rma.train_one(
        [["Already","F"],
         ["what","D"],
         ["Also","P-rj"],
         ["Scared","A-c"],
         ["Absent","X"]])

`rakutenma_benchmark.js`


var RakutenMA = require('./rakutenma');

var rma = new RakutenMA();
rma.featset = RakutenMA.default_featset_ja;

for (var i = 0; i < 1000; i++) {
    rma.tokenize("I am not afraid of anything anymore");
    rma.train_one(
        [["Already","F"],
         ["what","D"],
         ["Also","P-rj"],
         ["Scared","A-c"],
         ["Absent","X"]]);
}

result

I remeasured with a margin in the calculator. (2015/01/15) </ ins>

$ time python rakutenma_benchmark.py 

real	0m3.583s
user	0m3.573s
sys	0m0.009s

$ time node rakutenma_benchmark.js

real	0m1.852s
user	0m1.831s
sys	0m0.027s

It takes almost twice as long as the original JS version (; ´Д`)

I also tried it on Pypy.

$ time pypy3 test.py 

real	0m1.908s
user	0m1.859s
sys	0m0.042s

The performance is on par with the original family.

at the end

I wrote Rakuten MA for Python. It's almost twice as slow as the original family, and it's not good because it loses the big merit of being able to use it from a browser, but it is only this one point that you can use Rakuten MA from Python without writing glue code.