[PYTHON] A proposal for versioning of features in Kedro

(Since there were abnormally many omissions, I reviewed it 12/28)

Recently, I am using Kedro to manage workflow in ML model development for verification. Kedro is one of the workflow management tools. It composes Pipeline by wrapping and connecting python functions in a class called node, and all input and output of node are memory and storage called Data Catalog. It is characterized by being managed by a file system class that abstracts. io itself supports it by preparing a separate module called DataSet.

When using kedro I wanted to save the features generated for learning and the features generated for inference as a whole, including the intermediate products, but it seems that there is a straightforward solution for how to save them separately. I couldn't find it, so I'll leave the method I tried this time here.

Premise

Suppose you cut the file structure according to the project template that kedro automatically generates. The version of kedro is 0.17.0.

Data management in Data Catalog

There are two ways to prepare the Data Catalog, one is to define it in python code and the other is to describe it in yaml, and to set hooks and generate it based on that yaml when executing kedro. Since this time it is based on the latter, only the latter example will be described.

At runtime, Kedro will automatically find the yaml named conf/*/catalog * in the root directory and go read it. A Data Catalog is created based on the settings written in yaml.

titanic_train:
  type: pandas.CSVDataSet
  filepath: s3://competitions/titanic/train.csv
  credentials: minio

outputs:
  type: pandas.CSVDataSet
  filepath: s3://competitions/titanic/outputs.csv
  credentials: minio
  versioned: true

Current kedro versioning

DataSet versioning

The DataSet provides a method of versioning. Versioning is enabled by inheriting the class AbstractVersionedDataSet, implementing the DataSet class, and setting versioned to true in catalog.yml. Unfortunately, this method only allows timestamps for version names. (Probably) This is not the case when setting DataCatalog in the code, but at least it seems that the format of the version cannot be changed from the description of yaml. Furthermore, there seems to be no way to inject code-generated DataCatalog when running using the CLI.

Motivation for versioning outside of time

I want to version the generated features by a version name other than time (I want to separate them like train and test). As a means to do this, you need to write a new data set for test in catalog.yml, and also generate a separate test data for pipeline. When defining pipeline, node must be defined, and when defining node, it is necessary to read the version and change the referenced dataset. It's the beginning of unreadable code. On the other hand, of course, the logic of feature generation must be the same at the time of learning and at the time of inference, and I want to make it common in the code. To avoid this, I want to version (or rather tag) the data with some value other than time.

Injecting values ​​into settings with TemplatedConfigLoader

This is a function implemented from 0.17.0. You can set a placeholder in yaml and inject the value from the code. You can also inject by preparing a yaml that describes the mapping to the placeholder externally. By using this, you can directly change the filepath in catalog.yml to achieve versioning. I am hungry to realize versioning by changing the save path for each version. This also forced all datasets to have a uniform version name, but you can change the version name according to the dataset.

TemplatedConfigLoader is about the function of a simplified version of hydra. hydra is one of yaml's configuration management tools. It is an OSS that enables you to inject values ​​from the command line and structure yaml. hydra requires a unique notation in hydra, and its writing style is a little incompatible with kedro, so it is convenient to have kedro prepare a template function.

Data Catalog preparation flow

Describes the process when DataCatalog is prepared when the project is generated from the project template. This will help you understand where to apply the TemplatedConfigLoader.

If you prepare cli etc. from the project template, the execution session of kedro is generated by the class called KedroSession. I will ignore the details because I can not explain it correctly, but at this time, refer to the instance of ProjectHooks in hooks.py under <project_name>/src / and each hooks described To execute.

Of these, there are two hooks related to Data Catalog, register_config_loader and register_catalog. The former prepares ConfigLoader and the latter prepares DataCatalog. Also, in register_catalog, DataCatalog is generated based on catalog.yml loaded by using register_config_loader.

From the above, you can see that if you replace ConfigLoader in register_config_loader with TemplatedConfigLoader, you can dynamically change the save path.

Implementation

I've found that implementing TemplatedConfigLoader in register_config_loader could dynamically change catalog.yml. Actually, I want to operate the version name with CLI etc., so I will accept the input. So --Set the variable corresponding to the version in the ProjectHooks class --In register_config_loader, load the settings by TemplatedConfigLoader. At this time, change to inject the value into catalog.yml --Receive the version name from the command line with cli.py and pass the version name to ProjectHooks before creating a session --Prepare a placeholder to receive the value of catalog.yml from TemplatedConfigLoader I will do that.

Set the variable corresponding to the version in the ProjectHooks class

class ProjectHooks:
    _mode: str = ''

    @classmethod
    def set_mode(cls, mode: str):
        cls._mode = mode

The specification for Session is to call the instance of hooks generated here. Since the singleton pattern is not adopted, it cannot be guaranteed that they are the same instance even if they are called. So I set a class variable for the version name. (The variable name is mode because, in fact, I just want to change the data used according to the purpose of execution, not the correct versioning.)

Change register_config_loader

Implement the following instance method in ProjectHooks.

    @hook_impl
    def register_config_loader(self, conf_paths: Iterable[str]) -> TemplatedConfigLoader:
        return TemplatedConfigLoader(conf_paths, globals_dict=dict(mode=self._mode))

This will cause the value of self._mode to be assigned to the placeholder if it contains a placeholder called mode.

Receive the version name from the command line with cli.py and pass the version name to Project Hooks before creating the session

@click.option(
    "--run-mode", type=click.Choice(['train', 'inference'], case_sensitive=True), default="train"
)
def run(
    ...,
    run_mode
):
   ...
   from .hooks import project_hooks
   project_hooks.set_mode(run_mode)

(Import is only placed in a place that is easy to see when writing, so please place it wherever you like)

Prepare a placeholder to receive the value of catalog.yml from TemplatedConfigLoader

outputs:
  type: pandas.CSVDataSet
  filepath: s3://competitions/titanic/${mode}/outputs.csv
  credentials: minio

Run

With the above changes, when kedro run --run-mode = train is set, it is set as the directory train.

I can no longer control it with versioned, but I think it's okay because it is a policy not to use at the time of this implementation.

Summary

I explained how to implement versioning by directly templated catalog.yml. If you want to change only the model, you can prepare a separate placeholder for the model and give some value. As another means, I thought about implementing VersionedDataSet by myself so that it can be changed with the value of yaml, but in the end I made it this way because it can not be changed dynamically without preparing a placeholder. It may not be available due to changes in the future, but I will do it once. ~~ I've been using it and I feel that it's a function that isn't quite critical, so I think there will be more functions in the future. (Individual impression) ~~ If you take a look at issues and PR, similar topics are progressing in WIP, so it seems that they will be added soon. I checked it roughly, so please let me know if it doesn't work.

Recommended Posts

A proposal for versioning of features in Kedro
Features of pd.NA in pandas 1.0.0 (rc0)
A brief summary of Graphviz in python (explained only for mac)
Display a list of alphabets in Python 3
Proposal of a new language shield language framework
Summary of various for statements in Python
Impressions of using Flask for a month
Change the list in a for statement
Sum of variables in a mathematical model
Get a token for conoha in python
Code reading of Safe, a library for checking password strength in Python
A memorandum of method often used in machine learning using scikit-learn (for beginners)
Draw a graph of a quadratic function in Python
Get the caller of a function in Python
Make a copy of the list in Python
Find the number of days in a month
Rewriting elements in a loop of lists (Python)
Make a joyplot-like plot of R in python
Output in the form of a python array
Get a glimpse of machine learning in Python
Enter a specific value for variable in tensorflow
Tips for using ElasticSearch in a good way
Proposal of Kuwahara filter as a photographic expression
A well-prepared record of data analysis in Python
Derivation of certainty of effect in A / B testing
Basic story of inheritance in Python (for beginners)
The concept of reference in Python collapsed for a moment, so I experimented a little.
Find a guideline for the number of processes / threads to set in the application server
Create a dataset of images to use for learning
This is a sample of function application in dataframe.
A list of stumbling blocks in Django's image upload
Memorandum of methods useful for organizing columns in DataFrame
A collection of code often used in personal Python
Try searching for a million character profile in Python
A memo of installing Chainer 1.5 for GPU on Windows
A brief summary of Linux antivirus software for individuals
A collection of commands frequently used in server management
Settings for easy selection of multiple kernels in Jupyter
A shell program that becomes aho in multiples of 3
Convenient to use matplotlib subplots in a for statement
Check for the existence of BigQuery tables in Java
Python: Get a list of methods for an object
Group by consecutive elements of a list in Python
Display a histogram of image brightness values in python
When you want to plt.save in a for statement
A collection of Excel operations often used in Python
A reminder about the implementation of recommendations in Python
A one-year history of operating a chatbot in a muffled manner
Set a proxy for Python pip (described in pip.ini)
The story of creating a "spirit and time chat room" exclusively for engineers in the company
Can be used with AtCoder! A collection of techniques for drawing short code in Python!