[PYTHON] Extension of Luigi's Task to make workflows reproducible

I wrote an article "To improve the reusability and maintainability of workflows created with Luigi" (http://qiita.com/ngr_t/items/b928bc13457571e25519). Among them, [check the input / output time stamps in the complete method](http://qiita.com/ngr_t/items/b928bc13457571e25519#complete-%E3%83%A1%E3] to maintain the consistency of the result. % 82% BD% E3% 83% 83% E3% 83% 89% E3% 81% A7% E3% 81% AF% E5% 85% A5% E5% 87% BA% E5% 8A% 9B% E3% 81 % AE% E3% 82% BF% E3% 82% A4% E3% 83% A0% E3% 82% B9% E3% 82% BF% E3% 83% B3% E3% 83% 97% E3% 82% 92 % E3% 83% 81% E3% 82% A7% E3% 83% 83% E3% 82% AF% E3% 81% 99% E3% 82% 8B) or [complete returns all true dependent tasks Check if](http://qiita.com/ngr_t/items/b928bc13457571e25519#complete-%E3%81%A7%E3%81%AF%E4%BE%9D%E5%AD%98%E3%82 % BF% E3% 82% B9% E3% 82% AF% E3% 81% AE-complete-% E3% 81% 8C% E3% 81% 99% E3% 81% B9% E3% 81% A6-true- % E3% 82% 92% E8% BF% 94% E3% 81% 99% E3% 81% 8B% E3% 82% 92% E3% 83% 81% E3% 82% A7% E3% 83% 83% E3 I wrote about the importance of saying% 82% AF% E3% 81% 99% E3% 82% 8B).

By default, Luigi only determines the end of a task "whether there is an output", which is inconvenient from the viewpoint of producing consistent calculation results, so be careful of the above points. You will have to pay. For that reason, I tried to make an extension of Task" for data science "with more emphasis on the reproducibility of the results. It is located in the following repository.

https://github.com/ngr-t/luigi_for_data_science

I haven't written a setup script, but it currently depends on the Portalocker module (and of course Luigi), so just pip it in.

How to use

As a usage

  1. Create a task by inheriting hash_checking_tasks.TaskWithCheckingInputHash.
  2. Make all task input / output tasks that properly implement hash_checking_tasks.HashableTarget.
  3. Execute.

It will be.

How is it working?

TaskWithCheckingInputHash extends Task as follows:

--Check if the complete () methods of all dependent tasks return True. --If the target exists, check whether the hash value of the input target matches the task used to generate the target. --Save the hash value of your own class and input target if the task execution is successful.

The processing such as the generation of a specific hash value is left to HashableTarget. HashableTarget is implemented so that the following operations can be performed.

--You can check the equivalence of targets by comparing the values of hash_content (). --You can get the hash value of the task that created the target (if the target exists) with get_current_input_hash (). --You can save the hash value of the task that generated the target with store_input_hash ().

I wrote in a previous article that "check the time stamp", but the important thing is not the date but the equivalence of input and output. Isn't it better to check the hash value? I reconsidered that after writing that article. However, when it comes to checking the hash value, it is inevitably necessary to keep the hash value somewhere, which is a slightly complicated problem. HashableLoaclTarget saves the target file in the path with the suffix" input.pickle ", but it doesn't handle any conflicts with this name.

I think that there are still many things to do, so please throw Masakari by all means.

Recommended Posts

Extension of Luigi's Task to make workflows reproducible
Try to make a kernel of Jupyter
Basics of PyTorch (2) -How to make a neural network-