[PYTHON] Dockerfile for RESTful MeCab server with mecab-ipadic-neologd

Previously, it was troublesome to introduce mecab-python, so let's put it together in a Dockerfile.

The REST API part was implemented in Flask by referring to here. The source can be found on github. This is convenient if you also try using docker-compose. It doesn't taste very good with just one container like this time.

: memo: [2016-10-07 added] Added about updating the dictionary file. : memo: [2018-03-11 postscript] Added front end.

Dockerfile This is the deliverable.

FROM ubuntu:16.04

RUN apt-get update \
  && apt-get install python3 python3-pip curl git sudo cron -y \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /opt
RUN git clone https://github.com/taku910/mecab.git
WORKDIR /opt/mecab/mecab
RUN ./configure  --enable-utf8-only \
  && make \
  && make check \
  && make install \
  && ldconfig

WORKDIR /opt/mecab/mecab-ipadic
RUN ./configure --with-charset=utf8 \
  && make \
  &&make install

WORKDIR /opt
RUN git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
WORKDIR /opt/mecab-ipadic-neologd
RUN ./bin/install-mecab-ipadic-neologd -n -y

COPY . /opt/api
WORKDIR /opt/api
RUN pip3 install -r requirements.txt

CMD ["python3", "server.py"]

: memo: [2016-10-07 postscript]

Other

Directory structure

.
├── README.md
├── docker-compose.yml
└── flask-mecab
    ├── Dockerfile
    ├── requirements.txt
    └── server.py

docker-compose.yml

api:
  build: ./flask-mecab
  volumes:
    - "./flask-mecab:/opt/api"
  ports:
    - "5000:5000"
  restart: always

Convert MeCab output result to JSON format

You can customize the output format of MeCab, but by default

Surface type\t part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Inflected form,Utilization type,Prototype,reading,pronunciation

Since it is, each line of the output character string is divided by'\ t'and',' and zip () is used to correspond to each element.

server.py


def mecab_parse(sentence, dic='ipadic'):
    dic_dir = "/usr/local/lib/mecab/dic/"
    if dic == 'neologd':
        dic_name = 'mecab-ipadic-neologd'
    else:
        dic_name = dic

    m = MeCab.Tagger('-d ' + dic_dir + dic_name)

    #Output format (default)
    format = ['Surface type', 'Part of speech','Part of speech細分類1', 'Part of speech細分類2', 'Part of speech細分類3', 'Inflected form', 'Utilization type','prototype','reading','pronunciation']

    return [dict(zip(format, (lambda x: [x[0]]+x[1].split(','))(p.split('\t')))) for p in m.parse(sentence).split('\n')[:-2]]

See also the execution example below.

starting method

Since you can write port forwarding settings etc. in docker-compose.yml, this is the only startup.

$ git clone https://github.com/matsulib/mecab-service.git
$ cd mecab-service/
$ sudo docker-compose up -d   

Execution method

HTTP request

POST /mecab/v1/parse-ipadic
POST /mecab/v1/parse-neologd

Request header

Content-Type: application/json

Request body

{
  "sentence":String
}

Execution example ipadic

$ curl -X POST http://localhost:5000/mecab/v1/parse-ipadic \
       -H "Content-type: application/json" \
       -d '{"sentence": "Functional programming"}'  | jq .
{
  "dict": "ipadic",
  "message": "Success",
  "results": [
    {
      "prototype": "function",
      "Part of speech": "noun",
      "Part of speech subclassification 1": "General",
      "Part of speech subclassification 2": "*",
      "Part of speech subclassification 3": "*",
      "Utilization type": "*",
      "Inflected form": "*",
      "pronunciation": "Kansu",
      "Surface type": "function",
      "reading": "Kansu"
    },
    {
      "prototype": "Mold",
      "Part of speech": "noun",
      "Part of speech subclassification 1": "suffix",
      "Part of speech subclassification 2": "General",
      "Part of speech subclassification 3": "*",
      "Utilization type": "*",
      "Inflected form": "*",
      "pronunciation": "Play",
      "Surface type": "Mold",
      "reading": "Play"
    },
    {
      "prototype": "programming",
      "Part of speech": "noun",
      "Part of speech subclassification 1": "Change connection",
      "Part of speech subclassification 2": "*",
      "Part of speech subclassification 3": "*",
      "Utilization type": "*",
      "Inflected form": "*",
      "pronunciation": "programming",
      "Surface type": "programming",
      "reading": "programming"
    }
  ],
  "status": 200
}
 

Execution example mecab-ipadic-neologd

mecab-ipadic-neologd is a dictionary that is strong against proper nouns.

$ curl -X POST http://localhost:5000/mecab/v1/parse-neologd \
       -H "Content-type: application/json" \
       -d '{"sentence": "Functional programming"}'  | jq .
{
  "dict": "neologd",
  "message": "Success",
  "results": [
    {
      "prototype": "Functional programming",
      "Part of speech": "noun",
      "Part of speech subclassification 1": "Proper noun",
      "Part of speech subclassification 2": "General",
      "Part of speech subclassification 3": "*",
      "Utilization type": "*",
      "Inflected form": "*",
      "pronunciation": "Kansugata programming",
      "Surface type": "Functional programming",
      "reading": "Kansugata programming"
    }
  ],
  "status": 200
}

About image size

The image created from this Dockerfile is the next mecabservice_api, but it is huge compared to the original ubuntu.

$ sudo docker images
REPOSITORY                TAG                 IMAGE ID            CREATED             SIZE
mecabservice_api          latest              785bc1295e46        About an hour ago   3.375 GB
ubuntu                    16.04               c73a085dc378        4 days ago          127.1 MB

Let's see where it became bloated with the history subcommand. It seems that the dictionary file is very large, but can't it be helped?

$ sudo docker history mecabservice_api
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
785bc1295e46        About an hour ago   /bin/sh -c #(nop) CMD ["python3" "server.py"]   0 B
00339a58f77e        About an hour ago   /bin/sh -c pip3 install -r requirements.txt     5.852 MB
a397235a8e30        About an hour ago   /bin/sh -c #(nop) WORKDIR /opt/api              0 B
436ef40f928d        About an hour ago   /bin/sh -c #(nop) COPY dir:a4cb3a57cc1f117b07   2.445 kB
12456a11160d        3 hours ago         /bin/sh -c ./bin/install-mecab-ipadic-neologd   2.307 GB
18478e2f8d71        3 hours ago         /bin/sh -c #(nop) WORKDIR /opt/mecab/mecab-ip   0 B
74e6a0bffa98        3 hours ago         /bin/sh -c git clone --depth 1 https://github   114.2 MB
db6a7c716f21        3 hours ago         /bin/sh -c #(nop) WORKDIR /opt/mecab            0 B
6980af0a5afd        3 hours ago         /bin/sh -c ./configure --with-charset=utf8      106 MB
db32f32c58e6        3 hours ago         /bin/sh -c #(nop) WORKDIR /opt/mecab/mecab-ip   0 B
5745385f2342        3 hours ago         /bin/sh -c ./configure  --enable-utf8-only      11.16 MB
8c601b1fac00        3 hours ago         /bin/sh -c #(nop) WORKDIR /opt/mecab/mecab      0 B
d730397e47eb        3 hours ago         /bin/sh -c git clone https://github.com/taku9   378.8 MB
2abd825af064        3 hours ago         /bin/sh -c apt-get update   && apt-get instal   325 MB
d5abec2370fb        3 hours ago         /bin/sh -c #(nop) WORKDIR /opt                  0 B
c73a085dc378        4 days ago          /bin/sh -c #(nop)  CMD ["/bin/bash"]            0 B

: memo: [2016/10/07 postscript] Automatically update the dictionary of mecab-ipadic-neologd

I noticed that the dictionary of mecab-ipadic-neologd is updated regularly on the server. I did.

-This dictionary is updated automatically on the development server --Will be updated at least twice a week --Monday and Thursday

In addition, instructions on how to get updates into the post-installation dictionary are also provided in Official.

There is already a useful option if you want to update automatically with cron (I'll omit the explanation). For example, if you write the following two lines in crontab etc., you can specify (-y) with user authority (-u) at 3:00 am on Tuesday and Friday without checking the analysis result.> In the directory (-) p [/ path / to / user / directory]) A dictionary file is updated.

    00 03 * * 2 ./bin/install-mecab-ipadic-neologd -n -y -u -p /path/to/user/directory > /path/to/log/file
    00 03 * * 5 ./bin/install-mecab-ipadic-neologd -n -y -u -p /path/to/user/directory > /path/to/log/file

If you want to set the running container as mecabservice_api_1 and output the update log to / opt / log / mecab, You can set it by connecting to the container with bash as follows.

$ sudo docker exec -it mecabservice_api_1 /bin/bash
# mkdir -p /opt/log/mecab
# (crontab -l 2>/dev/null; echo "PATH=$PATH") | crontab -
# (crontab -l 2>/dev/null; echo '00 18 * * 1 /opt/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y >/opt/log/mecab/neologd_`date -I`.log 2>/opt/log/mecab/neologd_`date -I`.err') | crontab -
# (crontab -l 2>/dev/null; echo '00 18 * * 4 /opt/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y >/opt/log/mecab/neologd_`date -I`.log 2>/opt/log/mecab/neologd_`date -I`.err') | crontab -
# /etc/init.d/cron restart
# exit

Here, the UTC time is set so that it will be updated at 3:00 am on Tuesday and Friday of JST time. By default, it got stuck for a moment where mecab was not included in the crontab PATH. Also, at first, a small bug in the shell did not work with the above settings, but I threw Issue and Pull Request. As soon as I got it merged, it started working (* ^^) v

: memo: [2018/03/11 postscript] Add front end

Even though I am using docker-compose, I am lonely with only one service, so I added a web application that calls the above REST API from the front end to docker-compose. Please refer to github for the front source. Communication between containers and CORS settings were clogged up.

docker-compose.yml

api:
  build: ./flask-mecab
  volumes:
    - "./flask-mecab:/opt/api"
  ports:
    - "5000:5000"
  restart: always
front:
  build: ./flask-mecab-front
  volumes:
    - "./flask-mecab-front:/opt/front"
  ports:
    - "5001:5001"
  restart: always
  links:
    - api
  environment:
    FLASK_MECAB_URI: "http://api:5000/mecab/v1"

After booting, access http: // localhost: 5001 / with your browser.

screenshot

mecab.PNG

Recommended Posts

Dockerfile for RESTful MeCab server with mecab-ipadic-neologd
Web server for browser testing with Mocha
Tftp server with Docker
Use mecab-ipadic-neologd with igo-python
Use mecab with Python3
Proxy server with Docker
Local server with python
Library for specifying a name server and dig with python
Let's reduce the man-hours required for server setup with Ansible