Install python's C language dependent module in wheel format with multi stage build

background

Pip install of C language dependent module with alpine is a heavy story

procedure

  1. Build the required modules on the alpine image
  2. Collect products (wheel files) in one place for easy reuse
  3. Push to docker hub as a built image
  4. Get and install only the wheel directory with multi stage build in alpine for execution environment
  5. Execution confirmation and cleaning

Implementation

First, get the required module counties This time I will introduce the following modules

requirements.txt


cycler==0.10.0
Cython==0.29.17
h5py==2.10.0
joblib==0.14.1
kiwisolver==1.2.0
matplotlib==3.2.1
numpy==1.18.4
pandas==1.0.3
Pillow==7.1.2
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2020.1
scikit-learn==0.22.2.post1
scipy==1.3.3
six==1.14.0

Collect the required modules in one place

In the case of alpine, the compression format such as tar and zip drops for c language dependent modules. It is necessary to convert these to whl format.

Preparation

whl should have the necessary libraries for conversion, so install it via apk

apk update \
  && apk add --virtual .build --no-cache openblas-dev lapack-dev freetype-dev 
...
  && apk add --virtual .community_build --no-cache -X http://dl-cdn.alpinelinux.org/alpine/edge/community hdf5-dev

Prepare the necessary whl file

You can also download modules with pip download, Use the pip wheel command because it will download and automatically extract the tar / zip file and build it. Since pip wheel can also use the -r option, specify the versioning file with pip freeze> requirements.txt etc.

pip wheel --no-cache --wheel-dir=./whl -r requirements.txt

--Option supplement ---- no-cache-dir: Do not use / create cache. If not specified, it will be cached as ~ / .tmp. Build-time products are also cached. ---- wheel-dir: The output destination of the wheel file.

Supplement when using pip wheel

Unfortunately, in this case ** it fails on the way ** I'm using requirements.txt built in another alpine environment and pip frozen. Since numpy and scipy are not available in the environment, they fall during the scikit-learn build.

With pip install -r requirements.txt, the pip side will install it nicely, but [^ 1] [^ 1]: Since the installation order of pip is executed all at once without considering dependent libraries and priorities, the same phenomenon occurs with pip install. Instead, modules that fail in the middle due to "circular dependency" are avoided by running the build again as soon as all other modules are installed. Only this time, there is no choice but to put the dependent module first. [^ 2]

[^ 2]: If scipy ~ = 1.4 in the environment at hand, an error will occur and it will fail, so specify the 1.3 series that entered obediently

pip install cython numpy==1.18.4 scipy==1.3.3
pip wheel --no-cache --wheel-dir=./whl -r requirements.txt

I was trying to create a separate image to avoid building numpy and scipy I feel like I'm doing something meaningless ...?

Push to docker hub after build is complete

Tag properly and push

docker tag 123456789a hoge/builder-image:latest
docker push hoge/builder-image:latest

Bring the product for the execution environment

From here, we will work on the dockerfile for the execution environment.

Install wheel in local directory

To specify multiple modules with pip install, write solidly or specify a text file with --requirement. There is no specification that allows you to collect whl in a suitable directory and install it entirely.

This time, in the multi-stage build, COPY the directory containing the wheel and execute the following command to install from the local wheel.

pip install --no-index --no-deps --no-cache-dir -f ./whl -r requirement.txt

--Option supplement ---- no-index: Don't use index sites like PyPi. Use when you don't want to go online ---- no-deps: Do not install dependent modules. However, it seems that this is not the case if it is clearly specified on the module side. ---f, --find-links: Specify the search destination of the module. Use this when you want to specify a local path

Modules supported by --upgrade

Modules such as pip and setuptools that you want to install with the --upgrade option are installed separately in the upgrade text file. The text file referenced by the -r option can be installed without specifying the version.

upgrade.txt


pip
setuptools
wheel

Upgrade the modules in a specific directory with the following command

pip install -U --no-index --no-deps --no-cache-dir -f ./upgrade  -r upgrade.txt

However, since the number of files to be managed will increase, it is better to write directly in the docker file unless you are in an offline environment.

Execution confirmation

Check if it can be imported. Create a shell file and hit the RUN command directly.

import_test.sh


#!/bin/sh
python -c "import numpy"
python -c "import scipy"
python -c "import h5py"
python -c "import pandas"
python -c "import matplotlib"
python -c "import sklearn"

Clean up

Delete extra files to reduce the weight of the docker image The image used to build whl only needs to have a product, so erase everything else.

builder-image


apk del --purge .build .testing_build
pip freeze | xargs pip uninstall -y
pip cache purge

Check how light the built image is by deleting the extra files. ** 360MB ** seems to have succeeded in weight loss

# docker images
REPOSITORY              TAG                 IMAGE ID            CREATED             SIZE
naka345/wheel_build     latest              b6c9df898334        9 minutes ago       1.04GB 
naka345/wheel_build     latest              3236cf2f87de        2 days ago          639MB

Next is the arrangement on the execution environment side. Official python docker is very smart, so I will delete the file according to this. [^ 3]

[^ 3]: In the execution environment installed with -no-cache-dir specified, when pip cache purge is executed, the cache file is not found and an error code is returned. It's sober and difficult to use.

execution-image


#Bundle only the files required for the module as a new virtual package,
find /usr/local -type f -executable -not \( -name '*tkinter*' \) -exec scanelf --needed --nobanner --format '%n#p' '{}' ';' \
    | tr ',' '\n' \
    | sort -u \
    | awk 'system("[ -e /usr/local/lib/" $1 " ]") == 0 { next } { print "so:" $1 }' \
    | xargs -rt apk add --no-cache --virtual .module-rundeps && \
  #Erase all packages used at build time
  apk del --purge .build .community_build
#Delete extra files and garbage on the python side
find /usr/local -depth \
		\( \
			\( -type d -a \( -name test -o -name tests -o -name idle_test \) \) \
			-o \
			\( -type f -a \( -name '*.pyc' -o -name '*.pyo' \) \) \
		\) -exec rm -rf '{}' + 

#Cleaning of dust for the range of this execution
rm -rf /tmp/whl

Let's compare it with the time when it was not erased on the execution environment side.

# docker images
REPOSITORY              TAG                 IMAGE ID            CREATED             SIZE
naka345/wheel_install   latest              f0df8a9887de        3 hours ago         1.29GB
↓
naka345/wheel_install   latest              27b4805053f2        3 hours ago         968MB

I managed to keep it below 1GB.

Try to make it a docker file

Based on the above, write it down in the docker file. Since it will be long, I have pasted the github link.

Summary

Modules that take time can now be safely and relatively quickly brought in via pip. The docker image has also been made slightly lighter.

However, the part that must have multiple images is deferred. Since the consistency of requirements.txt is required, Would it be easier if there was a mechanism to push both images to docker hub when this one was updated?

References

Recommended Posts

Install python's C language dependent module in wheel format with multi stage build
Multi-instance module test in C language
Segfault with 16 characters in C language
I measured the time when I pip installed the C language dependent module with alpine
Build a C language development environment with a container
Try to make a Python module in C language