[PYTHON] Make a Notebook Pipeline with Kedro + Papermill

The other day, I received a request to "embed the notebook used for data analysis into Pipeline as it is", but I couldn't find a package that could do that, so I decided to make it myself.

Even if I make it myself, I just added a little function to the existing Pipeline. This time, we adopted Kedro for Pipeline and Papermill to incorporate Notebook as it is.

I chose Kedro because the Pipeline configuration is simple (just write functions and I / O), and the documentation is extensive and the learning cost seems to be low. Learning costs are quite important in making suggestions for someone to use. Also, as Mr. Kinui says, the [^ kinuit-kedro] logo is cool.

What I made

There are three main features.

--Run Notebook from Kedro using Papermill --Define Pipeline in YAML --Version control of Notebook output by Papermill

Run Notebook from Kedro using Papermill

The figure below is a visualization of Kedro's Hello World project with Kedro-Viz. The rectangle represents the function and the rounded rectangle represents the data. It is an image that each of these rectangles becomes a notebook. image.png

Define Pipeline in YAML

Pipeline's YAML is written as follows: For example, the output of split_data ʻexample_train_x is the input of train_model`, which represents the Pipeline flow (arrow).

conf/base/pipelines.yml


# data_engineering pipeline
data_engineering:
  # split_data node
  split_data:
    nb:
      input_path: notebooks/data_engineering/split_data.ipynb
      parameters:
        test_data_ratio: 0.2
    inputs:
      - example_iris_data
    outputs:
      - example_train_x
      - example_train_y
      - example_test_x
      - example_test_y

# data_science pipeline
data_science:
  # train_model node
  train_model:
    nb:
      input_path: notebooks/data_science/train_model.ipynb
      parameters:
        num_iter: 10000
        lr: 0.01
      versioned: True
    inputs:
      - example_train_x
      - example_train_y
    outputs:
      - example_model
  # predict node
  predict:
    nb:
      input_path: notebooks/data_science/predict.ipynb
      versioned: True
    inputs:
      - example_model
      - example_test_x
    outputs:
      - example_predictions
  # report_accuracy node
  report_accuracy:
    nb:
      input_path: notebooks/data_science/report_accuracy.ipynb
      versioned: True
    inputs:
      - example_predictions
      - example_test_y

Version control of Notebook output by Papermill

For example, if you write pipelines.yml as follows, the output destination of Notebook will be data / 08_reporting / train_model # num_iter = 10000 & lr = 0.01.ipynb / <version> /train_model#num_iter=10000&lr=0.01.ipynb. Will be. Where <version> is a date and time string formatted with YYYY-MM-DDThh.mm.ss.sssZ.

conf/base/pipelines.yml


# data_science pipeline
data_science:
  # train_model node
  train_model:
    nb:
      input_path: notebooks/data_science/train_model.ipynb
      parameters:
        num_iter: 10000
        lr: 0.01
      versioned: True
    inputs:
      - example_train_x
      - example_train_y
    outputs:
      - example_model

How to use

I haven't been able to maintain it properly yet ... The general flow is as follows.

  1. Create an environment
  2. Create a Data Catalog
  3. Make a Notebook
  4. Make a Pipeline
  5. Execute

Create an environment

Create an environment from the template project with the following command.

$ git clone https://github.com/hrappuccino/kedro-notebook-project.git
$ cd kedro-notebook-project
$ pipenv install
$ pipenv shell

Create a Data Catalog

Register all the data that appears in Pipeline (including intermediate products) in the Data Catalog.

conf/base/catalog.yaml


example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv

example_train_x:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/example_train_x.pkl

example_train_y:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/example_train_y.pkl

example_test_x:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/example_test_x.pkl

example_test_y:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/example_test_y.pkl

example_model:
  type: pickle.PickleDataSet
  filepath: data/06_models/example_model.pkl

example_predictions:
  type: pickle.PickleDataSet
  filepath: data/07_model_output/example_predictions.pkl

Please refer to Kedro Documents for how to write the Data Catalog.

Make a Notebook

Basically, you can make a notebook as usual, but only the following two are different from usual.

--Use Kedro's Data Catalog for data input / output --Parameterize for Papermill

Use Kedro's Data Catalog for data input and output

Launch Jupyter Notebook / Lab from Kedro.

$ kedro jupyter notebook
$ kedro jupyter lab

Execute the following magic command in the Notebook. Now you can use a global variable called catalog.

%reload_kedro

To read / save data, write as follows.

data = catalog.load('example_iris_data')
catalog.save('example_train_x', train_x)

In addition, how to operate Kedro with Jupyter is Kedro document Please refer to.

Parameterize for Papermill

To parameterize a Notebook, tag the cells with parameters. image.png

Please refer to Papermill documentation for how to do this.

Make a Pipeline

Write the Pipeline in YAML (reposted above) as follows:

conf/base/pipelines.yaml


# data_engineering pipeline
data_engineering:
  # split_data node
  split_data:
    nb:
      input_path: notebooks/data_engineering/split_data.ipynb
      parameters:
        test_data_ratio: 0.2
    inputs:
      - example_iris_data
    outputs:
      - example_train_x
      - example_train_y
      - example_test_x
      - example_test_y

# data_science pipeline
data_science:
  # train_model node
  train_model:
    nb:
      input_path: notebooks/data_science/train_model.ipynb
      parameters:
        num_iter: 10000
        lr: 0.01
      versioned: True
    inputs:
      - example_train_x
      - example_train_y
    outputs:
      - example_model
  # predict node
  predict:
    nb:
      input_path: notebooks/data_science/predict.ipynb
      versioned: True
    inputs:
      - example_model
      - example_test_x
    outputs:
      - example_predictions
  # report_accuracy node
  report_accuracy:
    nb:
      input_path: notebooks/data_science/report_accuracy.ipynb
      versioned: True
    inputs:
      - example_predictions
      - example_test_y

Execute

Run all / part of Pipeline.

$ kedro run
$ kedro run --pipeline=data_engineering

If you specify the --parallel option, parallel processing will be performed where parallelization is possible.

$ kedro run --parallel

For more information on how to run Pipeline, please refer to Kedro's documentation.

(Bonus) Visualize Pipeline with Kedro-Viz

Execute the following command to access http://127.0.0.1:4141/ and the page shown below will be displayed.

$ kedro viz

image.png

(Bonus) Track metrics with MLflow

Execute the following command to access http://127.0.0.1:5000/ and the page shown below will be displayed.

$ mlflow ui

image.png

  • Note: * Since I ran it in a Notebook, a single Experiment is recorded in two lines.

Kedro + MLflow is also introduced in Kedro's blog.

How it works

I will briefly explain how it works.

Run Notebook from Kedro using Papermill

To be precise, I'm running Notebook using Papermill in a function. At the extreme, all you have to do is run pm.execute_notebook, but in order to separate the Notebook and Pipeline arguments, we make them into classes and receive them with __ init__ and __call__. At first, I implemented it with closures, but I was angry that it could not be serialized when processing in parallel, so I made it a class. __get_default_output_path is a process for version control of Notebook output by Papermill, which will be described in detail later.

src/kedro_local/nodes/nodes.py


import papermill as pm
from pathlib import Path
import os, re, urllib, datetime

DEFAULT_VERSION = datetime.datetime.now().isoformat(timespec='milliseconds').replace(':', '.') + 'Z'

def _extract_dataset_name_from_log(output_text):
    m = re.search('kedro.io.data_catalog - INFO - Saving data to `(\\w+)`', output_text)
    return m.group(1) if m else None

class NotebookExecuter:
    def __init__(self, catalog, input_path, output_path=None, parameters=None, versioned=False, version=DEFAULT_VERSION):
        self.__catalog = catalog
        self.__input_path = input_path
        self.__parameters = parameters
        self.__versioned = versioned
        self.__version = version
        self.__output_path = output_path or self.__get_default_output_path()

    def __call__(self, *args):
        nb = pm.execute_notebook(self.__input_path, self.__output_path, self.__parameters)
        dataset_names = [
            _extract_dataset_name_from_log(output['text'])
            for cell in nb['cells'] if 'outputs' in cell
            for output in cell['outputs'] if 'text' in output
        ]
        return {dataset_name: self.__catalog.load(dataset_name) for dataset_name in dataset_names if dataset_name}

    def __get_default_output_path(self):
        #See below

Define Pipeline in YAML

Read the YAML mentioned above and create a Pipeline. Basically, I'm just converting YAML to an object in a dictionary comprehension. The last Pipeline of __default__ is executed when the --pipeline option is omitted in kedro run.

src/kedro_notebook_project/pipeline.py


from kedro.pipeline import Pipeline, node
from kedro_local.nodes import *
import yaml

def create_pipelines(catalog, **kwargs):
    with open('conf/base/pipelines.yml') as f:
        pipelines_ = yaml.safe_load(f)

    pipelines = {
        pipeline_name: Pipeline([
            node(
                NotebookExecuter(catalog, **node_['nb']),
                node_['inputs'] if 'inputs' in node_ else None,
                {output: output for output in node_['outputs']} if 'outputs' in node_ else None,
                name=node_name,
            ) for node_name, node_ in nodes_.items()
        ]) for pipeline_name, nodes_ in pipelines_.items()
    }

    for pipeline_ in list(pipelines.values()):
        if '__default__' not in pipelines:
            pipelines['__default__'] = pipeline_
        else:
            pipelines['__default__'] += pipeline_

    return pipelines

Version control of Notebook output by Papermill

It just rewrites the output destination according to the definition of pipelines.yml. Note that if self.__parameters is large, the file name will be too long. It used to be hashed, but since it is not human-friendly, it is tentatively converted to a query string.

src/kedro_local/nodes/nodes.py


class NotebookExecuter:
    #abridgement

    def __get_default_output_path(self):
        name, ext = os.path.splitext(os.path.basename(self.__input_path))
        if self.__parameters:
            name += '#' + urllib.parse.urlencode(self.__parameters)
        name += ext
        output_dir = Path(os.getenv('PAPERMILL_OUTPUT_DIR', ''))
        if self.__versioned:
            output_dir = output_dir / name / self.__version
            output_dir.mkdir(parents=True, exist_ok=True)
        return str(output_dir / name)

the end

Thank you for reading this far. All the source code presented in this article can be found on My GitHub. If you are interested, please use it and give us your feedback.

Recommended Posts

Make a Notebook Pipeline with Kedro + Papermill
Make a sound with Jupyter notebook
Make a fortune with Python
Make a fire with kdeplot
Let's make a GUI with python.
Let's make a breakout with wxPython
Make a recommender system with python
Make a filter with a django template
Let's make a graph with python! !!
Let's make a supercomputer with xCAT
Make a model iterator with PySide
Make a nice graph with plotly
Let's make a shiritori game with Python
Make a video player with PySimpleGUI + OpenCV
Make a rare gacha simulator with Flask
Make Jupyter Notebook a service on CentOS
Make a partially zoomed figure with matplotlib
Make a drawing quiz with kivy + PyTorch
Let's make a voice slowly with Python
Make a cascade classifier with google colaboratory
Let's make a simple language with PLY 1
Make a logic circuit with a perceptron (multilayer perceptron)
Make a Yes No Popup with Kivy
Make a wash-drying timer with a Raspberry Pi
Make a GIF animation with folder monitoring
Let's make a web framework with Python! (1)
Let's make a tic-tac-toe AI with Pylearn 2
Make a desktop app with Python with Electron
Let's make a Twitter Bot with Python!
Let's make a web framework with Python! (2)
A memorandum to make WebDAV only with nginx
Investment quest: Make a system trade with pyhton (2)
Make a Twitter trend bot with heroku + Python
[Python] Make a game with Pyxel-Use an editor-
Make a monitoring device with an infrared sensor
Make a simple pixel art generator with Flask
Investment quest: Make a system trade with pyhton (1)
How to make a dictionary with a hierarchical structure.
I want to make a game with Python
Try to make a "cryptanalysis" cipher with Python
[Python] Make a simple maze game with Pyxel
Let's replace UWSC with Python (5) Let's make a Robot
Try to make a dihedral group with Python
[Chat De Tornado] Make a chat using WebSocket with Tornado
Make holiday data into a data frame with pandas
Make a LINE WORKS bot with Amazon Lex
(Memorandum) Make a 3D scatter plot with matplodlib
Create a table of contents with IPython notebook
Make one repeating string with a Python regular expression.
Make a morphological analysis bot loosely with LINE + Flask
Run a machine learning pipeline with Cloud Dataflow (Python)
Try to make a command standby tool with python
[Practice] Make a Watson app with Python! # 2 [Translation function]
[Practice] Make a Watson app with Python! # 1 [Language discrimination]
Make a simple Slackbot with interactive button in python
[Let's play with Python] Make a household account book
How to make a shooting game with toio (Part 1)
Let's make a simple game with Python 3 and iPhone
Make a breakpoint on the c layer with python
Drawing a tree structure with D3.js in Jupyter Notebook
Make a function to describe Japanese fonts with OpenCV