I wrote Rakuten MA, which is pure Python, so this is an introductory article.
Rakuten MA is a JavaScript morphological analyzer by Rakuten NLP Project. I think the features are that you can learn online sequentially and update the model easily, and you can perform morphological analysis from the client side via a browser.
For details, the explanation in the following article is easy to understand.
-Honke Rakuten MA Japanese Document -The second game to play on the PC of the PC studio! Morphological analysis with Rakuten MA with Anchibe-Hatena News -Introduction to morphological analysis by RakutenMA --Anchibe!
I wanted to use Rakuten MA with Python, so I wrote the Python version!
You can install it with $ pip install rakutenma
.
https://pypi.python.org/pypi/rakutenma
from rakutenma import RakutenMA
rma = RakutenMA(phi=1024, c=0.007812)
rma.load("model_ja.json")
rma.hash_func = rma.create_hash_func(15)
print(rma.tokenize("There is a chicken in the back"))
print(rma.train_one(
[["Backyard","N-nc"],
["To","P-k"],
["Is","P-rj"],
["garden","N-n"],
["Chicken","N-nc"],
["But","P-k"],
["Is","V-c"]]))
With this kind of feeling, you can use the API in the same way as the original JS version. See PyPI above for details.
I really wanted to support Python 2.7, but I'm exhausted, so for the time being I only support Python 3 series. </ del>
Starting with version 0.2, it works with Python 2.6 and 2.7 in addition to Python 3.3 and 3.4. </ ins>
The Rakuten MA model handles the value of a character string quantified by a hash function. Since the behavior of this hash function is the same as the JS version, the same model file as the JS version can be reused in the Python version.
This package does not include model files, so please obtain it separately from Honke Rakuten MA Repository.
I tried to compare the processing time with the original JS version by turning tokenize and learning 1000 times each.
CPU: Core i7 2GHz
Memory: 8GB
OS: Mac OSX 10.8.5
Python: 3.4.2
Node.js: 0.10.33
Pypy: 2.4.0 (Python 3.2.5)
Rakuten MA Python: 0.2
Rakuten MA (JS): 1.0.0
rakutenma_benchmark.py
# -*- coding: utf-8 -*-
from rakutenma import RakutenMA
rma = RakutenMA()
for i in range(1000):
rma.tokenize("I am not afraid of anything anymore")
rma.train_one(
[["Already","F"],
["what","D"],
["Also","P-rj"],
["Scared","A-c"],
["Absent","X"]])
rakutenma_benchmark.js
var RakutenMA = require('./rakutenma');
var rma = new RakutenMA();
rma.featset = RakutenMA.default_featset_ja;
for (var i = 0; i < 1000; i++) {
rma.tokenize("I am not afraid of anything anymore");
rma.train_one(
[["Already","F"],
["what","D"],
["Also","P-rj"],
["Scared","A-c"],
["Absent","X"]]);
}
I remeasured with a margin in the calculator. (2015/01/15) </ ins>
$ time python rakutenma_benchmark.py
real 0m3.583s
user 0m3.573s
sys 0m0.009s
$ time node rakutenma_benchmark.js
real 0m1.852s
user 0m1.831s
sys 0m0.027s
It takes almost twice as long as the original JS version (; ´Д`)
I also tried it on Pypy.
$ time pypy3 test.py
real 0m1.908s
user 0m1.859s
sys 0m0.042s
The performance is on par with the original family.
I wrote Rakuten MA for Python. It's almost twice as slow as the original family, and it's not good because it loses the big merit of being able to use it from a browser, but it is only this one point that you can use Rakuten MA from Python without writing glue code.
Recommended Posts