[PYTHON] Use Juman ++ in server mode

Content of this article

--The story of using Juman ++ in server mode -Python package for easy morphological analysis made Human ++ available

What is Juman ++?

Juman ++ is a morphological analyzer developed in the Kurobashi laboratory at Kyoto University. The point is, "What's the difference with Mecab?", But the difference is that Human ++ uses the RNN (so-called deep learning system) language model.

Introductory articles are gradually increasing in Qiita, and I look forward to its widespread use in the future.

-I tried to touch the new morphological analyzer JUMAN ++, but I thought about switching from MeCab with higher accuracy than I expected -Compare multiple morphological analyzers

A little worrisome point of Juman ++

  1. You have to update the dependent libraries. Especially around gcc
  2. Slow

There is a concern that the dependency library issue may update gcc and other code groups may get stuck ... In that case, use the cool solution Prepare Docker environment.

Now, the problem is the speed aspect of 2. This article

Mecab took about 10 seconds, while JUMAN ++ took more than 10 hours

So, it is certain that there are concerns about speed.

I also made a measurement comparison in my environment.

time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | mecab

echo   0.00s user 0.00s system 26% cpu 0.005 total
mecab  0.00s user 0.00s system 49% cpu 0.007 total
time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | jumanpp

echo   0.00s user 0.00s system 31% cpu 0.004 total
jumanpp  0.14s user 0.35s system 53% cpu 0.931 total

Compared to Mecab, I got a number that is 3 digits different.

This factor is not due to the design, but because it takes time to load the model (it seems that it is a story from a certain place) In other words, there is no choice but to use the RNN language model.

Then what should I do?

Let's use server mode!

The solution is simple, more than "use a server script"! is.

Actually, this is properly written in the ver.1.0.1 Manual. See page 5.

Use the __Ruby script enclosed in the tar of Juman ++ and leave it running in server mode.

According to the manual

$ ruby script/server.rb --cmd "jumanpp -B 5" --host host.name --port 1234

Start the server with. To call as a client

echo "Eat cake" | ruby script/client.rb --host host.name --port 1234

is.

So how much time can you save by using server mode?

time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | ruby client.rb --host localhost

echo   0.00s user 0.00s system 21% cpu 0.006 total
ruby client.rb --host localhost  0.04s user 0.01s system 47% cpu 0.092 total

It's about a tenth of the time! It's amazing! By the way, what happens with Human ++ over the network? I started the Human ++ server on a server machine that exists in the local network and measured it.

time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | ruby client.rb --host sever.hogehoge

echo   0.00s user 0.00s system 22% cpu 0.005 total
ruby client.rb --host sever.hogehoge 0.03s user 0.01s system 26% cpu 0.167 total

.. .. .. .. Well, considering the network response, is it something like this? Anyway, we found that using server mode could solve the bottleneck.

__ Everyone, let's use Human ++ in server mode __

Use Human ++ server mode from Python

The above client script is written in Ruby. So, I think Ruby people should just use it as it is (booger)

However, I'm a Python addict, so I want to call it from Python. (If you want to use client.rb as Python, please see the code attached at the bottom.) Officially, a Python package called pyknp has been released, but in fact, only subprocess calls are prepared for juman ++. Is not ... (Story in pyknp-0.3) This doesn't allow you to benefit from server mode.

I have published a Python package called Japanese Tokenizers. I've included it in this Python package.

Available for both Python 2x and Python 3x.

What you can do

--Get a list of morpheme division results in one line --Morpheme division in one line-> Part of speech filtering-> Stop word removal-> Listing --Mecab, Juman, Juman ++, Kytea can be called with the same notation

Installation method

  1. Install Mecab, Juman, Juman ++. See This README.
  2. Start Juman ++ in server mode. Use the included server.rb from Juman ++.
  3. pip install JapaneseTokenizer

That's it.

How to use

It only takes one line to call Juman ++ in server mode.

>>> from JapaneseTokenizer import JumanppWrapper
>>> sentence = 'Tehran (Persian): تهران  ; Tehrān Tehran.ogg pronunciation[help/File]/teɦˈrɔːn/,English:Tehran) is the capital of Iran, West Asia, and the capital of Tehran Province. Population 12,223,598 people. Metropolitan population is 13,413,Reach 348 people.'
>>> list_result = JumanppWrapper(server='localhost', port=12000).tokenize(sentence, return_list=True)
>>> print(list_result)
['Tehran', 'Persia', 'word', 'pronunciation', 'help', 'File', '英word', 'Tehran', 'West', 'Asia', 'Iran', 'capital', 'Tehran', 'State capital', 'population', '12,223,598', 'city', 'Category', 'population', '13,413,348']

To select a morpheme by part of speech, pass the part of speech you want to select with List [Tuple [str]]. See this page for the part of speech system of Juman ++.

>>> from JapaneseTokenizer import JumanppWrapper
>>> sentence = 'Tehran (Persian): تهران  ; Tehrān Tehran.ogg pronunciation[help/File]/teɦˈrɔːn/,English:Tehran) is the capital of Iran, West Asia, and the capital of Tehran Province. Population 12,223,598 people. Metropolitan population is 13,413,Reach 348 people.'
>>> pos_condition = [('noun', 'Place name')]
>>> JumanppWrapper(server='localhost', port=12000).tokenize(sentence, return_list=False).filter(pos_condition=pos_condition).convert_list_object()
['Tehran', 'Asia', 'Iran', 'Tehran']

In addition, you can also acquire part of speech information, surface system, and other information output by Human ++.

See examples.py for more information.

Improvements from previous article

--Added Juman ++ --Fixed a bug that occurs in Juman server mode --Introduction of syntactic sugar that completes part-speech filtering in one line

Recommended Posts

Use Juman ++ in server mode
run uwsgi server in uwsgi-gevent mode
Use config.ini in Python
Use DataFrame in Java
Use dates in Python
Use Mean in DataFrame
Use Valgrind in Python
Use ujson in requests
Use profiler in Python
DNS server in Python ....
Let's use def in python
Use "$ in" operator with mongo-go-driver
Use let expression in Python
Use Anaconda in pyenv environment
Use Measurement Protocol in Python
Use callback function in Python
Use parameter store in Python
Use HTTP cache in Python
Use regular expressions in C
Use MongoDB ODM in Python
Use list-keyed dict in Python
Use Random Forest in Python
Use regular expressions in Python
· Address already in use solution
Using Python mode in Processing
Use <input type = "date"> in Flask
Use jinja2 template in excel file
Use optinal type-like in Go language
Use fabric as is in python (fabric3)
How to use classes in Theano
Mock in python-how to use mox
Use watchdog (watchmedo) in test-driven development
Write an HTTP / 2 server in Python
How to use SQLite in Python
[Numpy] Call savetxt () in append mode
Use rospy with virtualenv in Python3
Use API not implemented in twython
How to use Mysql in python
Use Python in pyenv with NeoVim
How to use ChemSpider in Python
How to use PubChem in Python
Use django-debug-toolbar in VirtualBox / Vagrant environment
Use OpenCV with Python 3 in Window
Deploy and use the prediction model created in Python on SQL Server
How to use VS Code (code server) with Google Colab in just 3 lines