[PYTHON] Pack Japanese processing software into a Docker image

Pack Japanese processing software into a Docker image

I tried to pack the following software that is useful for Japanese natural language processing into one Docker image.


Publication place


How to use

Launch Bash

For the time being, if you want to set up a container and go inside it, you can do it like this:

$ docker run --rm -it ototadana/nlp-jp bash

Start Python (REPL)

Starting Python looks like this:

$ docker run --rm -it ototadana/nlp-jp python

MeCab An example of executing the mecab command:

$ echo "The UFO crash that became a hot topic last year is now just a tourism resource. City specialties" | docker run --rm -i ototadana/nlp-jp mecab
Last year noun,Adverbs possible,*,*,*,*,last year,Sakunen,Sakunen
Topic noun,General,*,*,*,*,topic,Wadai,Wadai
Particles,Case particles,General,*,*,*,To,D,D
Verb,Independence,*,*,Five steps, La line,Continuous connection,Become,Nat,Nat
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
UFO noun,Proper noun,General,*,*,*,UFO,UFO,UFO
Crash noun,Change connection,*,*,*,*,Crash,Twirac,Twirac
Case noun,General,*,*,*,*,Incident,Jiken,Jiken
, Symbol,Comma,*,*,*,*,、,、,、
Now noun,Adverbs possible,*,*,*,*,now,Ima,Ima
Is a particle,Particle,*,*,*,*,Is,C,Wow
Just a noun,General,*,*,*,*,However,free,free
Particles,Attributive,*,*,*,*,of,No,No
Tourism resource noun,Proper noun,General,*,*,*,Tourism resources,Kankoushigen,Kanko Shigen
.. symbol,Kuten,*,*,*,*,。,。,。
City noun,General,*,*,*,*,City,Machi,Machi
Particles,Attributive,*,*,*,*,of,No,No
Famous noun,General,*,*,*,*,Specialty,Mabutsu,Mabutsu
EOS

CaboCha Execution example of cabocha command:

$ echo "The UFO crash that became a hot topic last year is now just a tourism resource. City specialties" | docker run --rm -i ototadana/nlp-jp cabocha
last year---D
To the topic-D
became-D
UFO crash,-----D
now---D
Just-D
Tourism resources.---D
the town's-D
Specialty

Japanese WordNet

It is stored as a database in sqlite format in Japanese WordNet. You can access it with python code like below:

example-wordnet.py:

import sqlite3

query = """
    select c.def from sense a, word b, synset_def c
      where b.lemma = ? and c.lang = 'jpn'
      and a.wordid = b.wordid and a.synset = c.synset
    """

with sqlite3.connect('/dictionary/wnjpn.db') as conn:
    print([row[0] for row in conn.cursor().execute(query, ['topic'])])

If you write this code on the host side, mount the current directory on the host side with the -v option and execute it as follows:

$ docker run --rm -i -v $PWD:/app ototadana/nlp-jp python /app/example-wordnet.py
['Subject of conversation or discussion']

MeCab + Japanese WordNet

MeCab can also be accessed from Python code. From Python code, you can use MeCab in combination with Japanese WordNet as follows:

example-mecab+wordnet.py:

import MeCab, sqlite3

def get_definition(word):
    query = """
        select c.def from sense a, word b, synset_def c
          where b.lemma = ? and c.lang = 'jpn'
          and a.wordid = b.wordid and a.synset = c.synset
        """
    with sqlite3.connect('/dictionary/wnjpn.db') as conn:
        return [row[0] for row in conn.cursor().execute(query, [word])]

tagger = MeCab.Tagger()
tagger.parse('')

node = tagger.parseToNode('The UFO crash that became a hot topic last year is now just a tourism resource. City specialties').next

while node:
    print('%s:' % node.surface)
    print('  - %s' % node.feature)
    for definition in get_definition(node.feature.split(',')[6]):
        print('  - %s' % definition)
    print()
    node = node.next

When I do this, it looks like this:

$ docker run --rm -i -v $PWD:/app ototadana/nlp-jp python /app/example-mecab+wordnet.py
last year:
  -noun,Adverbs possible,*,*,*,*,last year,Sakunen,Sakunen

topic:
  -noun,General,*,*,*,*,topic,Wadai,Wadai
  -Subject of conversation or discussion

To:
  -Particle,Case particles,General,*,*,*,To,D,D

Became:
  -verb,Independence,*,*,Five steps, La line,Continuous connection,Become,Nat,Nat
  -Accept change or development
  -Loud and cheerful
  -Get sick and be a victim of illness
  -Officially take a year
  -Appropriate
  -To make or represent:
  -To exist
  -Number or quantity calculation fits
  -Develop and reach maturity
  -To mature
  -Happens in a particular way
  -Reach or enter a state, relationship, condition, use or status
  -Can, change, be made, or they are possible
  -Gradually shifts to a state and exhibits a particular property or attribute
  -Become
  -To be in or to be in a particular state or state
  -Deformed or subject to changes in position or behavior
  -Direct or distract a person's attention, interests, thoughts, or interests from something
  -Develop

Ta:
  -Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta

UFO:
  -noun,Proper noun,General,*,*,*,UFO,UFO,UFO

Crash:
  -noun,Change connection,*,*,*,*,Crash,Twirac,Twirac
  -Rapid free fall due to gravity
  -When under the influence of gravity, it falls without stopping
  -Fall or fall sharply

Incident:
  -noun,General,*,*,*,*,Incident,Jiken,Jiken
  -Public uproar
  -Issues that need to be investigated
  -A single notable event
  -Something happened

、:
  -symbol,Comma,*,*,*,*,、,、,、

now:
  -noun,Adverbs possible,*,*,*,*,now,Ima,Ima
  -Current or modern
  -Momentary present
  -Time currently happening
  -A series of hours including the moment of speech
  -Just a little bit before
  -Historical present
  -At this point in the narration of a series of past events
  -In the current time, the time pattern
  -Just now
  -At the moment
  -Current
  -At the moment

Is:
  -Particle,Particle,*,*,*,*,Is,C,Wow

However:
  -noun,General,*,*,*,*,However,free,free
  -Without anything else included or related
  -And many are nothing

of:
  -Particle,Attributive,*,*,*,*,of,No,No

Tourism resources:
  -noun,Proper noun,General,*,*,*,Tourism resources,Kankoushigen,Kanko Shigen

。:
  -symbol,Kuten,*,*,*,*,。,。,。

City:
  -noun,General,*,*,*,*,City,Machi,Machi
  -Situations that give opportunities
  -A region of the town with distinctive characteristics

of:
  -Particle,Attributive,*,*,*,*,of,No,No

Specialty:
  -noun,General,*,*,*,*,Specialty,Mabutsu,Mabutsu
  -Entertainment offered to the masses

:
  - BOS/EOS,*,*,*,*,*,*,*,*

Acknowledgments

The sample sentence used in the above example is the beginning of the lyrics of Oedo Controller --Yunomi feat. TORIENA.

** Yunomi is the best! ** (In short, this is the entry I just wanted to say ...)

Recommended Posts

Pack Japanese processing software into a Docker image
Generate a Docker image using Fabric
[Image processing] Posterization
python image processing
Image processing 100 knocks ①
Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)
100 image processing knocks !! (021-030) I want to take a break ...
[PyTorch] Tutorial (Japanese version) ④ ~ TRAINING A CLASSIFIER (image classification) ~
[Python] Mask the image into a circle using Pillow