[PYTHON] Let's make Godzilla's image recognition model preprocessing, learning and deployment feel good

This article is the 20th day of "Chura Data Advent Calendar".

~~This is a half-finished article even though it is late. I will complete it tomorrow.~~

Overview

Recently, I'm interested in MLOps, and I'm hunting for information and reading books.

It is a Godzilla image classification API that was shown at the Chura Data Anniversary Festival, but when I fine-tuned the VGG16 model and learned using the images collected by Google image search appropriately, it can be classified with a loose feeling. became.

Such a guy

However, when I showed it, black people are classified as Godzilla, Gamera is classified as Godzilla, and it seems that learning and data are still insufficient.

① Collect data → ② Label → ③ Data placement → ④ Hit the Jupyter notebook for learning from above → ⑤ Check the result

However, I do the cycle manually, but it is troublesome to execute it many times. That's why I wanted to make it a workflow.

Regarding (1), it has already been implemented in code, and for (2), it seems difficult if people do not do it for the first time, so I will do it manually together with (3). This time, I thought it would be great if only ④ and ⑤ could be made into a workflow.

Also, if it is about this scale, it seems that sklearn's pipeline can be used, but I was looking for a good framework because it is troublesome to implement the data download process again.

Then I found this article.

MLOps2020 to start small and grow big

When I check the article above, it says that Kedro + Mlflow is good, so I thought I should try it for the first time.

That's why I prepared it, so I will write about the implementation method.

I wanted to prepare it quickly, but I couldn't.

environment

Local environment

$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.15.7
BuildVersion:	19H15

$ docker -v
Docker version 19.03.13, build 4484c46d9d

$ docker-compose -v
docker-compose version 1.27.4, build 40524192

Preparation

AWS

Play in Oregon.

Preparing IAM users

--Note that AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY will be used later. --S3 Please have write / read permission

--Prepare a bucket for the time being --This time, I prepared it with godzilla-hogehoge-1234 (I will erase it anyway)

EC2

--Use p2.xlarge (tears) ――Spot instances are cheap! !! !!

Deep Learning AMI (Ubuntu 18.04) --Allow inbound 80 ports in the security group (IP restrictions are better) --Use to check mlflow from the browser ――I think it's okay to prepare an authentication screen using nginx for full disclosure (but I think it's safer to limit IP after all)

Dockerfile ready

I will do it with docker

.env

Add the AWS information you wrote down earlier.


#Postgres information
POSTGRES_USER=user
POSTGRES_PASSWORD=password

#Prepare the created s3 bucket name
S3_BUCKET=s3://godzilla-hogehoge-1234/

#AWS information
AWS_ACCESS_KEY_ID=hogehoge
AWS_SECRET_ACCESS_KEY=fugafuga
AWS_DEFAULT_REGION=us-west-2
AWS_DEFAULT_OUTPUT=json

Dockerfile for kedro

Create with Dockerfile_kedro.


FROM hoto17296/anaconda3-ja

RUN pip install kedro==0.17.0 \
    fsspec==0.6.3 \
    s3fs==0.4.0 \
    botocore==1.19.36 \
    mlflow==1.12.1 \
    tensorflow-gpu

Dockerfile for mlflow server

Create with Dockerfile_mlflow.

By the way, I respected the code etc. from ↓ ~~Pakuri~~ .

https://github.com/ymym3412/mlflow-docker-compose

FROM conda/miniconda3:latest

RUN mkdir -p /mlflow/mlruns

WORKDIR /mlflow

ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

RUN echo "export LC_ALL=$LC_ALL" >> /etc/profile.d/locale.sh
RUN echo "export LANG=$LANG" >> /etc/profile.d/locale.sh

RUN apt-get update && apt-get install -y \
    build-essential \
    python3-dev \
    libpq-dev

RUN pip install -U pip && \
    pip install --ignore-installed google-cloud-storage && \
    pip install psycopg2 mlflow boto3

COPY ./mlflow_start.sh ./mlflow_start.sh
RUN chmod +x ./mlflow_start.sh

EXPOSE 80
EXPOSE 443

CMD ["./mlflow_start.sh"]

./mlflow_start.sh

#!/bin/bash

set -o errexit
set -o nounset
set -o pipefail

mlflow server \
    --backend-store-uri $DB_URI \
    --host 0.0.0.0 \
    --port 80 \
    --default-artifact-root ${S3_BUCKET}.env

docker-compose.yml

Where it is commented out, uncomment it when running in the gpu environment.

version: '3'
services:
  waitfordb:
    image: dadarek/wait-for-dependencies
    depends_on:
      - postgresql
    command: postgresql:5432

  postgresql:
    image: postgres:10.5
    container_name: postgresql
    ports:
      - 5432:5432
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: mlflow-db
      POSTGRES_INITDB_ARGS: "--encoding=UTF-8"
    hostname: postgresql
    restart: always

  mlflow:
    build:
      context: .
      dockerfile: Dockerfile_mlflow
    container_name: mlflow
    expose:
      - 80
      - 443
    ports:
      - "10006:80"
    depends_on:
      - postgresql
      - waitfordb
    environment:
      DB_URI: postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgresql:5432/mlflow-db
      VIRTUAL_PORT: 80
      S3_BUCKET: ${S3_BUCKET}
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}

  kedro:
    # runtime: nvidia
    build:
      context: .
      dockerfile: Dockerfile_kedro
    container_name: kedro
    environment:
      MLFLOW_TRACKING_URI: http://mlflow/
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
      # NVIDIA_VISIBLE_DEVICES: all
    depends_on:
      - mlflow
    volumes:
      - ./:/app

build

Now that it's ready, let's build it!

$ docker-compose up

Workflow assembly

From here, we will work on the container (kedro).

After starting the container, execute the following command to enable working in the container.

$ docker-compose exec kedro /bin/bash
(base) root@hogehoge:/app#

What is Kedro

I will post only the information. It doesn't matter, but I thought it was "Kedoro", but it seems to have a completely different meaning.

--Repository: https://github.com/quantumblacklabs/kedro --Documentation: https://kedro.readthedocs.io/en/stable/ --pipeline comparison article: https://qiita.com/Minyus86/items/70622a1502b92ac6b29c

There is also a community, so if you have any problems, you should ask a question.

https://discourse.kedro.community/

kedro new

Prepare a project in which the workflow works.

(base) root@hogehoge:/app# kedro new
/opt/conda/lib/python3.7/site-packages/jinja2/utils.py:485: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import MutableMapping
/opt/conda/lib/python3.7/site-packages/jinja2/runtime.py:318: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Mapping
/opt/conda/lib/python3.7/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.2) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Project Name:
=============
Please enter a human readable name for your new project.
Spaces and punctuation are allowed.
 [New Kedro Project]: workflow

Repository Name:
================
Please enter a directory name for your new project repository.
Alphanumeric characters, hyphens and underscores are allowed.
Lowercase is recommended.
 [workflow]:

Python Package Name:
====================
Please enter a valid Python package name for your project package.
Alphanumeric characters and underscores are allowed.
Lowercase is recommended. Package name must start with a letter or underscore.
 [workflow]:

Change directory to the project generated in /app/workflow

A best-practice setup includes initialising git and creating a virtual environment before running `kedro install` to install project-specific dependencies. Refer to the Kedro documentation: https://kedro.readthedocs.io/

(base) root@hogehoge:/app# ls -l workflow/
total 12
-rw-r--r--  1 root root 4033 Dec 23 15:52 README.md
drwxr-xr-x  5 root root  160 Dec 23 15:52 conf
drwxr-xr-x 10 root root  320 Dec 23 15:52 data
drwxr-xr-x  3 root root   96 Dec 23 15:52 docs
drwxr-xr-x  4 root root  128 Dec 23 15:52 logs
drwxr-xr-x  3 root root   96 Dec 23 15:52 notebooks
-rw-r--r--  1 root root  341 Dec 23 15:52 pyproject.toml
-rw-r--r--  1 root root   47 Dec 23 15:52 setup.cfg
drwxr-xr-x  6 root root  192 Dec 23 15:52 src

Ok (Python version is 3.7, so it's a big deal)

kedro implementation overview

To give you a really rough idea, here are five files to play with.

--File to define - ~/workflow/conf/base/parameters.yml --File that defines the arguments of the process handled by node (described later) - ~/workflow/conf/base/catalog.yml --File that defines input data, generated intermediate data, evaluation data, etc. --Scripts that incorporate workflow - ~/workflow/src/workflow/nodes/ --Handle functions and classes that you want to process on the workflow as nodes --Processing can be implemented here, or module can be imported and logic can be implemented. ――This time, all the original processing is packed in the notebook, so salvage from the notebook is thrust into node - ~/workflow/src/workflow/pipelines/ ――It is a script that defines the assembly such as feeding the process you want to execute to node and feeding the result of nodeA to nodeB. - ~/workflow/src/workflow/hooks.py --Knead and assemble the prepared pipeline here ――By the way, the order of execution is that the input and output names are confirmed and the order is assembled when kedro is executed, so if you set different names, they may not be executed in order. Make sure that the output name of the previous pipeline and the input name of the next pipeline match.

parameters.yml

Defines the parameters to be passed to the function (node) to be executed on the workflow.

my_train_test_split:
  test_size: 0.2
  random_state: 71

catalog.yml

Define the data information to be handled in the workflow.

This time, I uploaded the image data zip on s3, so specify it.

You can also define data that is generated in the middle, such as intermediate data.

Normally, text.TextDataSet is supposed to be a text file etc., but since there is no type for the zip file (probably), I defined it so that binary data can be read and written using fs_args.

10_84_godzilla.zip:
  type: text.TextDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/01_raw/10_84.zip
  fs_args:
    open_args_load:
      mode: 'rb'
    open_args_save:
      mode: 'wb'

10_GMK_godizlla.zip:
  type: text.TextDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/01_raw/10_GMK.zip
  fs_args:
    open_args_load:
      mode: 'rb'
    open_args_save:
      mode: 'wb'

10_SOS_godzilla.zip:
  type: text.TextDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/01_raw/10_SOS.zip
  fs_args:
    open_args_load:
      mode: 'rb'
    open_args_save:
      mode: 'wb'

10_first_godzilla.zip:
  type: text.TextDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/01_raw/10_first.zip
  fs_args:
    open_args_load:
      mode: 'rb'
    open_args_save:
      mode: 'wb'

99_other.zip:
  type: text.TextDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/01_raw/99_other.zip
  fs_args:
    open_args_load:
      mode: 'rb'
    open_args_save:
      mode: 'wb'

classes_text:
  type: text.TextDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/01_raw/classes.txt

X_train:
  type: pickle.PickleDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/02_intermediate/X_train.pkl

X_test:
  type: pickle.PickleDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/02_intermediate/X_test.pkl

y_train:
  type: pickle.PickleDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/02_intermediate/y_train.pkl

y_test:
  type: pickle.PickleDataSet
  filepath: s3://godzilla-hogehoge-1234/workflow/data/02_intermediate/y_test.pkl

There are various types of datasets, so please check them.

https://kedro.readthedocs.io/en/stable/05_data/02_kedro_io.html

Furthermore, if you put the credentials in ~/workflow/conf/local/credentials.yml, you need to specify the key name in catalog.yml as well, but this time you have already specified it in .env. I don't need it because I have it.

node

Functions to be fed to node are prepared in ~/workflow/src/workflow/nodes /.

However, the place to write node is ~/workflow/src/workflow/pipelines /, so I will explain it in the next section.

Check github for the functions to be processed.

https://github.com/Aipakazuma/recognize-godzilla

pipeline

Yes. It is a pipeline. Let's actually assemble the pipeline when the function group that can be eaten by node is prepared.

Pre-processing assumes the order of (1) zip file expansion → (2) image reading → (3) training and test.

That's not the case

~/workflow/src/workflow/pipelines/preprocess_pipeline.py

from kedro.pipeline import node, Pipeline

from workflow.nodes.preprocess import unzip_image_data, load_image_data, my_train_test_split


def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=unzip_image_data,
                inputs=[
                    '10_84_godzilla.zip',
                    '10_GMK_godizlla.zip',
                    '10_SOS_godzilla.zip',
                    '10_first_godzilla.zip',
                    '99_other.zip'
                ],
                outputs='unzip_path'
            ),
            node(
                func=load_image_data,
                inputs=[
                    'unzip_path',
                    'classes_text'
                ],
                outputs=[
                    'X',
                    'Y',
                    'classes'
                ]
            ),
            node(
                func=my_train_test_split,
                inputs=[
                    'X',
                    'Y',
                    'params:my_train_test_split'
                ],
                outputs=[
                    'X_train',
                    'X_test',
                    'y_train',
                    'y_test'
                ]
            ),
        ],
        tags=['preprocess'],
    )

I will assemble it like this.

Let's reduce the information a little and check it.

def create_pipeline(**kwargs):
    return Pipeline([
        node(),  # ①
        node(),  # ②
        node()   # ③
    ])

Something like this. I'm passing the node I want to execute to the Pipeline class as a list.

Now, let's check the node of the process of ①.

            node(
                func=unzip_image_data,
                inputs=[
                    '10_84_godzilla.zip',
                    '10_GMK_godizlla.zip',
                    '10_SOS_godzilla.zip',
                    '10_first_godzilla.zip',
                    '99_other.zip'
                ],
                outputs='unzip_path'
            ),

Specify the function you want to execute with func. inputs will be the argument of the specified function. In this case, you need to define five arguments for unzip_image_data. Also, the data corresponding to the name of inputs will be the data defined in catalog.yml. Kedro will automatically bind (?) While the character workflow is running.

outputs is the return value of the function. As with inputs, if there is data corresponding to catalog.yml, it will be saved as intermediate data. (If it is local, it is saved locally, if it is s3, it is saved in s3). If it does not support catalog.yml, it will be stored in memory by Kedro as kedro.io.MemoryDataSet during workflow execution.

However, I think there are some processes that have no return value, but honestly I'm still not sure what to do in that case (if anyone knows, please let me know).

Next, let's check the node of ②.

            node(
                func=load_image_data,
                inputs=[
                    'unzip_path',
                    'classes_text'
                ],
                outputs=[
                    'X',
                    'Y',
                    'classes'
                ]
            ),

You can see that the outputs of ① is included in the inputs of ②. It is not always necessary to set the output of the previous node to the next node, but Kedro assembles by associating the execution order of nodes with the names inputs and outputs, so if ① is executed before ② If you want to, implement it like ↑.

As a reminder, classes_text is defined in catalog.yml, so Kedro will load and prepare it for you during workflow execution.

The processing explained so far is a pipeline for pre-processing, but a separate file is prepared for model learning. If pipelines are prepared in multiple files, how to combine them is defined in ~/workflow/src/workflow/hooks.py.

hooks.py

Here is the definition that connects multiple pipelines.

Please note that, as explained above, the order in which nodes are executed is determined by the alignment of inputs and ouputs, so the order cannot be defined here.

~/workflow/src/workflow/hooks.py

class ProjectHooks:
    @hook_impl
    def register_pipelines(self) -> Dict[str, Pipeline]:
        """Register the project's pipeline.

        Returns:
            A mapping from a pipeline name to a ``Pipeline`` object.

        """
        pp = preprocess_pipeline.create_pipeline()  #For pretreatment
        tp = train_pipeline.create_pipeline()  #For training

        return {
            'preprocess': pp,
            'train': tp,
            '__default__': pp + tp
        }

The dictionary type is returned by return, but you can specify the pipeline you want to execute on the command line for key. It will be the key to use at that time.

As for __default__, as the name suggests, if you do not specify the pipeline name, the pipeline defined in __default__ will be executed.

Try to run

$ docker-compose exec godzilla_kedro /bin/bash -c 'cd /app/workflow; kedro run'

I was able to execute it. Let's check the product with s3.

TBD

Cooperation with Mlflow

TBD

What is hook

Try to work with mlflow using hook

Check the result

Try using Kedro

Points that look good

--Data linkage is easy --If you get used to Python module development, you can easily create a workflow. ――It's okay, you can hit the notebook! --Easy to introduce data catalog version control (?)

I don't know the correct answer, but I was able to implement it for the time being

--When you want to branch node by parameter TBD --Read catalog.yml and want to retrieve the key TBD

I want to change here as well, but it seems difficult unless I mess with the inside

--I want to change the versioning directory structure --I want to simplify the data specification of --load-version ――If you have more than one, you have to do more than one. ――I don't know if the version is loaded properly in the log. --Is there a binary data catalog type? ?? ?? --No catalog type for pytorch (maybe)

Miscellaneous feelings

MLOps is fun! !! From now on, cancer b ... tejo ... uho ... u ... (˘ω˘)