[PYTHON] Build a data analysis environment with Kedro + MLflow + Github Actions
Introduction
- I built a data analysis environment with Kedro + MLflow + Github Actions, so I wrote my impressions.
background
** = "Issues when creating a notebook that is all in one file in a local environment for each experiment (lightgbm_02_YYYYMMDD.ipynb, etc.)" **
- You end up with a huge and complex notebook
- Preprocessing, model learning, model evaluation ...
- Difficult to divide in charge (although in many cases all will be done alone)
- Maintenance is spicy
- → If you divide by processing, you will not understand the dependency well this time
- Code review is painful
- Notebooks are difficult to diff
- Notebooks can't be code formatter or checkered
- Experiment management is difficult
- I want to list them (it's hard to open and remember each notebook)
- → It is troublesome to maintain the list manually (the more trials there are)
- It doesn't work or the result changes in another person's environment (when recreating from a clean environment)
- If you take over the matter from a person and clone the master, it will not work
- Results depend on local uncommitted data
What i did
Introduced Kedro as a pipeline tool
### What is Kedro? ・ Introduction method
* (Reference) [Introduction to Machine Learning Pipeline with Kedro](https://qiita.com/noko_qii/items/2395d3a3dbcd9410e5e7)
Good thing
- By defining the node (, data In / Out) / pipeline first, it was easy to divide the person in charge + easy to maintain.
- It was good to be able to match the In / Out recognition for each process first.
- The naming convention was decided first and made known.
- Easy to work with notebooks
$ kedro jupyter notebook --allow-root --port=8888 --ip=0.0.0.0 &
from kedro.framework.context import load_context
proj_path = '../../../'
context = load_context(proj_path)
# df = catalog.load("XXX")
parameters = context.params
- I was able to visualize the pipeline using kedor-viz
- I was able to manage credential information in credential.yaml
The remaining challenges
- I'm worried about when to pipeline (script)
- Data scientists made more and more trial and error in notebooks → Data engineers were making pipelines from time to time, but the number of corrections was large and it was a burden.
- I want to re-execute the pipeline from the middle (it seems possible, but it has not been investigated)
- kedro-viz is not updated automatically (it loads after rebooting)
- I want to cut out lib somewhere (assuming reading from notebook side as well as src side)
- I want to execute jobs in parallel
- In parameter search, you can link with optuna
Introduced MLflow as an experiment management tool
### What is MLflow? ・ Introduction method
* (Reference) [Introduction to experiment management with MLflow](https://future-architect.github.io/articles/20200626/)
Good thing
- See the link above
- Even if I wrote it so that it would be skipped to MLflow, the entire notebook could be lost.
- → It may be better to manage it with a spreadsheet (, excel) when it is the first chaos, and move to MLflow when it gets solid to some extent.
The remaining challenges
- Cooperation with Kedro
- I want to automatically link information in Kedro's param eters.yaml and pipeline.py
- Do you use journal versioning?
Introduced Github Actions as a CI tool
### What is Github Actions? ・ Introduction method
* (Reference) [CI / CD to try with the new function "GitHub Actions" on GitHub](https://knowledge.sakura.ad.jp/23478/)
Good thing
- Operation of master (main) branch is guaranteed
- I was able to create a reproducible model (because the commit id also moves with a clear cross section in a beautiful environment every time)
The remaining challenges
- Every build is heavy. Should I make good use of the cache?
- Heavy learning ・ Consider the configuration when using GPU
Other
- Introduced code formatter and checker
- (Reference) Execute formatter at pre-commit
- Checks run at the timing of commit, so it is important to introduce from the beginning
- If you are using a container, put it in the container with Git (because Python is required)
- Consider how to use it properly with the check on the CI side
- Made into a web application with AWS Elastic Beanstalk
- I wanted to build with S3 + API Gateway + Lambda because serverless was good, but I gave up due to file size limit
- You can use EFS, but it's a bit like you have to create a VPC environment.
- If it's a serverless container (or just want you to sleep when there is no request), then Cloud Run or GAE with GCP?
in conclusion
- There are still many challenges, but I would like to continue trying various things in order to quickly and surely run the machine learning cycle while accompanying the data scientist.
- By the way, the content of this article was something I tried while playing with my friends when I developed a model for predicting horse racing results. The code will probably be published once it's a little cleaner. (Slightly changed from Kedro's default directory structure.)