Overview

It is a common and common process to process data, save it to a disk once, and reuse it (skip data processing from the second time), but considering the dependency of parameters at the time of reuse, etc. It tends to be unexpectedly complicated. Therefore, consider an implementation that does not repeat the same process by making a skip judgment using a Python decorator.

A library of this process can be found at github.com/sotetsuk/memozo:

motivation

For example, suppose you now have a huge amount of statement data (one sentence per line):

1. I have a pen.
2. I have an apple.
3. ah! Apple pen!

...

9999...

# PPAP (copyright belongs to Pikotaro)

Now suppose you want to filter only the sentences that contain a specific keyword from this data (for example, the sentence that contains the keyword `` `pen```).

One naive implementation of the filter would be to create a generator that yields every time it finds a statement that meets the criteria:

def filter_data(keyword):
    path_to_raw_data = './data/sentences.txt'
    with codecs.open(path_to_raw_data, 'r', 'utf-8') as f:
        for line in f:
            if keyword in line:
                yield line

gen = filter_data('pen')
for line in gen:
    print(line, end='')

And if you want to reuse this processed data (filtered data) many times, it is not always a good idea to scan all the data each time. You may want to cache the data filtered to disk once and then use the cached data. Also, this data processing process depends on the parameter (`` keyword `), so if this process is executed with a different` `keyword, all the data will be checked again and put on the disk. There is also the aspect of wanting to cache. And I have a desire to achieve this process simply by wrapping the function with a decorator.

In summary, the goal is to use the decorator awesome_decorator to cache the output from the generator, and if this function is executed with the same parameters, use the cache to return the output: is:

@awesome_decorator
def filter_data(keyword):
    path_to_raw_data = './data/sentences.txt'
    with codecs.open(path_to_raw_data, 'r', 'utf-8') as f:
        for line in f:
            if keyword in line:
                yield line


#The first time it scans all the data and returns the result.
#At this time, the filtered sentence'./data/pen.txt'It will be cached in.
gen_pen_sentences1 = filter_data('pen')
for line in gen_pen_sentences1:
    print(line, end='')

#Since it is executed with the same parameters, the cache'./data/pen.txt'Returns the data from.
gen_pen_sentences2 = filter_data('pen')
for line in gen_pen_sentences2:
    print(line, end='')

#Since it is a new parameter, we will filter it again from the raw data.
gen_apple_sentences = filter_data('apple')
for line in gen_apple_sentences:
    print(line, end='')

Also, this example is a function that returns a generator, but I think there are other situations where you want to cache the execution result of a function that returns an object that can be serialized by pickle to disk (for example, preprocessed). `` `ndarray``` and parameter-dependent trained machine learning models).

Implementation

awesome_decoratorIs easy to implement, determine if there are already cached files,

If there is a cache, create a new generator that returns the value from the cache and return it in place of the original generator
If there is no cache, wrap the original generator and return a generator to write to the cache each time it returns a value

Just (even if you use `` `pickle``` etc.):

def awesome_decorator(func):

    @functools.wraps(func)
    def _wrapper(keyword):
        #This time, for the sake of simplicity, we assume that the argument of the function is only one keyword.
        #general(*args, **kwargs)When using, use inspect etc. to extract the arguments and their values.
        file_path = './data/{}.txt'.format(keyword)

        #If there is cached data, it returns a generator that reads statements from it.
        if os.path.exists(file_path):
            def gen_cached_data():
                with codecs.open(file_path, 'r', 'utf-8') as f:
                    for line in f:
                        yield line
            return gen_cached_data()

        #If there is no cached data, it will generate a decorator that returns statements from the raw data as usual.
        gen = func(keyword)

        #It also caches the values returned by the above generators.
        def generator_with_cache(gen, file_path):
            with codecs.open(file_path, 'w', 'utf-8') as f:
                for e in gen:
                    f.write(e)
                    yield e

        return generator_with_cache(gen, file_path)

    return _wrapper

The article 12 Steps for Understanding Python Decorators is an easy-to-understand explanation of the decorator itself.

All in all, it looks like this (this works just fine with `` `./data/sentence.txt```):

`awesome_generator.py`


# -*- coding: utf-8 -*-

import os
import functools
import codecs


def awesome_decorator(func):

    @functools.wraps(func)
    def _wrapper(keyword):
        #This time, for the sake of simplicity, we assume that the argument of the function is only one keyword.
        #general(*args, **kwargs)When using, use inspect etc. to extract the arguments and their values.
        file_path = './data/{}.txt'.format(keyword)

        #If there is cached data, it returns a generator that reads statements from it.
        if os.path.exists(file_path):
            def gen_cached_data():
                with codecs.open(file_path, 'r', 'utf-8') as f:
                    for line in f:
                        yield line
            return gen_cached_data()

        #If there is no cached data, it will generate a decorator that returns statements from the raw data as usual.
        gen = func(keyword)

        #It also caches the values returned by the above generators.
        def generator_with_cache(gen, file_path):
            with codecs.open(file_path, 'w', 'utf-8') as f:
                for e in gen:
                    f.write(e)
                    yield e

        return generator_with_cache(gen, file_path)

    return _wrapper


@awesome_decorator
def filter_data(keyword):
    path_to_raw_data = './data/sentences.txt'
    with codecs.open(path_to_raw_data, 'r', 'utf-8') as f:
        for line in f:
            if keyword in line:
                yield line


if __name__ == '__main__':
    #The first time it scans all the data and returns the result.
    #At this time, the filtered sentence'./data/pen.txt'It will be cached in.
    gen_pen_sentences1 = filter_data('pen')
    for line in gen_pen_sentences1:
        print(line, end='')

    #Since it is executed with the same parameters, the cache'./data/pen.txt'Returns the data from.
    gen_pen_sentences2 = filter_data('pen')
    for line in gen_pen_sentences2:
        print(line, end='')

    #Since it is a new parameter, we will filter it again from the raw data.
    gen_apple_sentences = filter_data('apple')
    for line in gen_apple_sentences:
        print(line, end='')

memozo 今回の実装は，パラメータの形やファイル名等を固定された形で扱っていましたが，任意の形に少し拡張したものをパッケージとしてgithub.com/sotetsuk/memozoにまとめました． With this, this process can be written like this:

from memozo import Memozo

m = Memozo('./data')

@m.generator(file_name='filtered_sentences', ext='txt')
def filter_data(keyword):
    path_to_raw_data = './data/sentences.txt'
    with codecs.open(path_to_raw_data, 'r', 'utf-8') as f:
        for line in f:
            if keyword in line:
                yield line

The cache file is saved in './ data / filtered_sentences_1fec01f.txt'` ``, and the history of parameters used in` `./data/.memozo is written. The hash is calculated from (file name, function name, parameter), and if both the history and cache file using the same hash already exist, the function execution will be skipped. In other words, if you execute with the same (file name, function name, parameter), the value will be returned from the cache, and if you change any one, the result will be different.

In addition to the generator, there are versions of the functions that correspond to `pickle```, `codecs, and ordinary open```.

I think the implementation is still incomplete, so I would be grateful if you could mention Issue / PR etc.

Relation

タスク間に複雑な依存関係がある場合はDAGベースのワークフローツールを使った方がいいでしょう．一例として，github.com/spotify/luigiなどが挙げられます．

References

-github.com/sotetsuk/memozo: Summary of this implementation -github.com/spotify/luigi: If you have complex dependencies between tasks, you should use a DAG-based workflow tool. luigi is one example. -github.com/petered/plato/pulls/56: Implementation of the same motivation -lru_cache: Cache to memory with decorator -12 Steps to Understand Python Decorators: Explanation of the decorator itself

[PYTHON] Use decorators to prevent re-execution of data processing