Overview

Someone in Mercari has published some wonderful knowledge about machine learning from development to operation. https://mercari.github.io/ml-system-design-pattern/README_ja.html Let's expand a little about model management.

First, let me give you the big picture. https://github.com/arc279/model-in-package-sample I mean, this is all.

The following is an explanation of the main points. I will not explain setuptools etc. so please google each one if necessary.

The sample execution environment is

(.venv) $ python -V
Python 3.8.1

I am sending it at.

Non-source data can be included in python packages

A while ago, in the setuptools area, things like package_data and data_files were complicated, but Recently MANIFEST.in and importlib.resources It seems that it has converged to .python.org/ja/3/library/importlib.html#module-importlib.resources).

Note that ʻimportlib.resources has been added since python 3.7, and older versions require you to use something like pkg_resources. To be honest, it is not easy to use, so if possible, use ʻimportlib.resources in 3.7 or later.

Please see this area for how to use it.

https://github.com/arc279/model-in-package-sample/blob/master/MANIFEST.in https://github.com/arc279/model-in-package-sample/blob/master/src/mymodel/init.py#L5

When hardened on a wheel, it looks like this

It contains a * .pkl file.

$ python setup.py bdist_wheel

(..snip..)

$ zipinfo -1 dist/mymodel-1.1.1_titanic.from_kaggle-py3-none-any.whl
mymodel/__init__.py
mymodel/version.py
mymodel/titanic_sample/__init__.py
mymodel/titanic_sample/models/__init__.py
mymodel/titanic_sample/models/LogisticRegression/__init__.py
mymodel/titanic_sample/models/LogisticRegression/model.pkl
mymodel/titanic_sample/models/RandomForestClassifier/__init__.py
mymodel/titanic_sample/models/RandomForestClassifier/model.pkl
mymodel/titanic_sample/models/SVC/__init__.py
mymodel/titanic_sample/models/SVC/model.pkl
mymodel/titanic_sample/models/SVC/__pycache__/__init__.cpython-38.pyc
mymodel/titanic_sample/models/__pycache__/__init__.cpython-38.pyc
mymodel-1.1.1_titanic.from_kaggle.dist-info/METADATA
mymodel-1.1.1_titanic.from_kaggle.dist-info/WHEEL
mymodel-1.1.1_titanic.from_kaggle.dist-info/top_level.txt
mymodel-1.1.1_titanic.from_kaggle.dist-info/RECORD

User side

Once you have it in the wheel, you can pip it in.

(.venv) $ pip install dist/mymodel-1.1.1_titanic.from_kaggle-py3-none-any.whl

(..snip..)

(.venv) $ pip list
Package         Version
--------------- -------------------------
joblib          0.15.1
mymodel         1.1.1-titanic.from-kaggle
numpy           1.18.5
pandas          1.0.4
pip             19.2.3
python-dateutil 2.8.1
pytz            2020.1
scikit-learn    0.23.1
scipy           1.4.1
setuptools      41.2.0
six             1.15.0
threadpoolctl   2.1.0
wheel           0.34.2

call

(.venv) $ ipython
In [1]: import mymodel

In [2]: mymodel.__version__
Out[2]: '1.1.1-titanic.from-kaggle'

Read the data in the package

It is a continuation of ipython.

In [3]: import importlib.resources

In [4]: import pickle

In [5]: import mymodel.titanic_sample.models.LogisticRegression

In [6]: b = importlib.resources.read_binary(mymodel.titanic_sample.models.LogisticRegression, "model.pkl")

In [9]: len(b)
Out[9]: 739

In [10]: c = pickle.loads(b)

In [11]: c.__class__
Out[11]: sklearn.linear_model._logistic.LogisticRegression

You can do it. See this area for details.

By the way

If you suppress the above points, you can get the implication that ** a python package containing only data ** is also possible. I think it depends on the project how much you should include, so you can consider various things.

Finally about versioning

The version convention of the python package is rather sloppy, and the semantic versioning ja /) can be adopted. So you can use this example of Mercari as it is. https://mercari.github.io/ml-system-design-pattern/Operation-patterns/Data-model-versioning-pattern/design_ja.html

Like this. https://github.com/arc279/model-in-package-sample/blob/master/setup.cfg#L3 https://github.com/arc279/model-in-package-sample/blob/master/src/mymodel/version.py

I'm talking about that. See Sample github for the big picture.

Attempt to include machine learning model in python package