[PYTHON] Get data files from elsewhere during pip install

Overview

I don't know if it's okay to do that, but I want to get the data needed for the package from something other than the PyPI server and install it. Data file related arguments that can be described in setup.py, package_data and data_files, are both distributed with the source code. It seems that only files are supported. Therefore, I hooked the install_data command called during pip install to download the necessary files separately.

Command hook

To hook or replace the command, use the setuptools.setup function as the keyword argument cmdclass. It can be done by passing a dictionary with the implementation as the value, using the command name to be replaced as the key. (Reference: Integration of new commands)

The original source of the install_data command to be replaced this time is in here. I created a custom install_data command for reference.

import distutils.command.install_data
from os import path
import site
import sys
import urllib


class CustomInstallData(distutils.command.install_data.install_data):

    def run(self):
        # self.data_The setup function data in files_Contains the value passed to the files argument.
        #this is,(List of data file save destinations and data file paths)Since it is responsible for the list of tuples, it is processed individually.
        for f in self.data_files:
            if not isinstance(f, tuple):
                continue
            
            for i, u in enumerate(f[1]):
                #If the data file path is not a URL, do nothing.
                if not u.startswith("http"):
                    continue

                #If the target data file is already local, reuse it.
                base = path.basename(u)
                f[1][i] = path.join(sys.prefix, f[0], base)
                if not path.exists(f[1][i]):
                    f[1][i] = path.join(sys.prefix, "local", f[0], base)
                if not path.exists(f[1][i]):
                    f[1][i] = path.join(site.getuserbase(), f[0], base)
                if not path.exists(f[1][i]):
                    #If not found locally, download.
                    f[1][i] = urllib.urlretrieve(u, base)[0]

        #Other processing is transferred to the original command.
        return distutils.command.install_data.install_data.run(self)

Since the basic processing is the same as the original install_data command, Create a class that inherits distutils.command.install_data.install_data. Note that the parent class is distutils.command.install_data.install_data instead of distutils.command.install_data.

Since the run method is the body of the command, override this method. self.data_files contains a list of [(directory, files) pairs passed to the data_files argument of the setup function](http://docs.python.jp/2.7/distutils/setupscript.html" # installing-additional-files). directory is the save destination for the data files, and files is a list of data file paths to install.

Extract individual elements from self.data_files and extract individual data file paths from it for processing. If the data file path is not a URL, normal processing is fine, so nothing is done for them.

Next, if the target data file is already installed, I would like to reuse it to avoid unnecessary downloads. According to the Manual, the data file is saved in.

If directory is a relative path, it is interpreted as a relative path from the installation prefix (sys.prefix for pure Python packages, sys.exec_prefix for packages with extensions).

So, if the file name you are trying to install is filename,

You can check. However, in my Ubuntu, it was saved in one step digging ʻos.path.join (sys.prefix, "local", directory, filename) `in any of the above. Therefore, in the above code, the path with local is also checked.

It is also possible that the package is installed in the user directory with pip install --user. In this case, the package is saved under site.USER_BASE, so In the above code, site.getuserbase () is used to get the value of site.USER_BASE, and the existence of this file is also checked.

Finally, if the target data file is not found locally, it is obtained using ʻurllib.urlretrieve`.

The rest of the processing is the same as the original install_data command, so the method of the parent class is called. Since distutils.command.install_data.install_data is an old class, the method of the parent class is called bydistutils.command.install_data.install_data.run (self).

setup function

The setup function call using the above custom command is as follows (unrelated items are omitted).

setup(
    data_files=[(
        "rgmining/data",
        ["http://times.cs.uiuc.edu/~wang296/Data/LARA/TripAdvisor/TripAdvisorJson.tar.bz2"]
    )],
    cmdclass={
        "install_data": CustomInstallData
    },
    ... #Other items omitted
)

CustomInstallData is passed as the implementation class of the install_data command to the cmdclass argument. Now you can pass the URL to data_files.

Summary

In order to avoid registering a huge file on the PyPI server, we have summarized the method of downloading the necessary file from another server during pip install. However, it is not very good for security because it does not check the hash of the acquired file. In the first place, I feel that there is another method without doing such a troublesome thing, so please let me know if you know it.

Recommended Posts

Get data files from elsewhere during pip install
Get structural data from CHEMBLID
Get data from Quandl in Python
pip install mysql-Error handling during python
pip install prevents proxy from installing
Install the data files with setup.py
Get data from Twitter using Tweepy
[Note] Get data from PostgreSQL with Python
Get data from Cloudant with Bluemix flask
Get data from an oscilloscope with pyVISA
Install openstack client from pip so you don't get an error on CentOS7
Get time series data from k-db.com in Python
Get 10 or more data from SSM parameter store
sudo pip install
Install from conda-forge
Get data from GPS module at 10Hz in Python
If you get hooked on pip install dlib on OSX
Get files from Linux using paramiko and scp [Python]
Get data from database via ODBC with Python (Access)