[PYTHON] Conveniently upload to kaggle Dataset

kaggle notebook tied up code competition

Recently, in kaggle, the number of code competitions that can only be used in kaggle's notebook environment at the time of inference has increased, and in competitions that use deep learning, parameter files of locally trained models are often uploaded to kaggle Dataset and used. ..

The kaggle API command is still a hassle

By using the Kaggle API command, you can omit the manual work on the WebUI and automate Data Download / upload, but editing / creating the JSON file of metadata and API commands can be troublesome. .. Therefore, I made a wrapper function that can be executed by python, so I will make a memorial service. By linking the input of the function and the yaml file that describes the experimental parameters, the experimental conditions can be automatically reflected in the comments of the data set, which leads to the prevention of mistakes and labor saving.

Necessary preparation

--You need to install the kaggle API and generate an API token. See the related article for details. ――Of course, there must be data in the path of the file you want to upload. --This function assumes that there is an additional model_exp_XX and subdirectory for each experiment in a directory called model, in which the model parameter files are located. --Specify the extension of the model file in the argument of the function, and change it as appropriate such as .pth, .h5. --If you prepare a logger, it will be output to a log file for the time being.


import subprocess
import glob
import json
import os
def upload_to_kaggle(
                     
                     title: str, 
                     k_id: str,  
                     path: str, 
                     comments: str,
                     update:bool,
                     logger=None,
                     extension = '.pth',
                     subtitle='', 
                     description="",
                     isPrivate = True,
                     licenses = "unknown" ,
                     keywords = [],
                     collaborators = []
                     ):
    '''
    >> upload_to_kaggle(title, k_id, path,  comments, update)
    
    Arguments
    =========
     title: the title of your dataset.
     k_id: kaggle account id.
     path: non-default string argument of the file path of the data to be uploaded.
     comments:non-default string argument of the comment or the version about your upload.
     logger: logger object if you use logging, default is None.
     extension: the file extension of model weight files, default is ".pth"
     subtitle: the subtitle of your dataset, default is empty string.
     description: dataset description, default is empty string.
     isPrivate: boolean to show wheather to make the data public, default is True.
     licenses = the licenses description, default is "unkown"; must be one of /
     ['CC0-1.0', 'CC-BY-SA-4.0', 'GPL-2.0', 'ODbL-1.0', 'CC-BY-NC-SA-4.0', 'unknown', 'DbCL-1.0', 'CC-BY-SA-3.0', 'copyright-authors', 'other', 'reddit-api', 'world-bank'] .
     keywords : the list of keywords about the dataset, default is empty list.
     collaborators: the list of dataset collaborators, default is empty list.
   '''
    model_list = glob.glob(path+f'/*{extension}')
    if len(model_list) == 0:
        raise FileExistsError('File does not exist, check the file extention is correct \
        or the file directory exist.')
    
    if path[-1] == '/':
        raise ValueError('Please remove the backslash in the end of the path')
    
    data_json =  {
        "title": title,
        "id": f"{k_id}/{title}",
        "subtitle": subtitle,
        "description": description,
        "isPrivate": isPrivate,
        "licenses": [
            {
                "name": licenses
            }
        ],
        "keywords": [],
        "collaborators": [],
        "data": [

        ]
    }
    
    data_list = []
    for mdl in model_list:
        mdl_nm = mdl.replace(path+'/', '')
        mdl_size = os.path.getsize(mdl) 
        data_dict = {
            "description": comments,
            "name": mdl_nm,
            "totalBytes": mdl_size,
            "columns": []
        }
        data_list.append(data_dict)
    data_json['data'] = data_list

    
    with open(path+'/dataset-metadata.json', 'w') as f:
        json.dump(data_json, f)
    
    script0 = ['kaggle',  'datasets', 'create', '-p', f'{path}' , '-m' , f'\"{comments}\"']
    script1 = ['kaggle',  'datasets', 'version', '-p', f'{path}' , '-m' , f'\"{comments}\"']

    #script0 = ['echo', '1']
    #script1 = ['echo', '2']

    if logger:    
        logger.info(data_json)
        
        if update:
            logger.info(script1)
            logger.info(subprocess.check_output(script1))
        else:
            logger.info(script0)
            logger.info(script1)
            logger.info(subprocess.check_output(script0))
            logger.info(subprocess.check_output(script1))
            
    else:
        print(data_json)
        
        if update:
            print(script1)
            print(subprocess.check_output(script1))
        else:
            print(script0)
            print(script1)
            print(subprocess.check_output(script0))
            print(subprocess.check_output(script1))

If anyone says that this is more efficient, please comment.

Related article: -Download data to GCP easily with Kaggle API -Automate using your code on Github with Kaggle Code Competition with CI

Recommended Posts

Conveniently upload to kaggle Dataset
Upload a file to Dropbox
[For non-programmers] How to walk Kaggle
How to read the SNLI dataset
Upload packages to PyPI using tokens
I tried to explain Pytorch dataset
"Kaggle Memorandum" Conversion to One-hot Vector
Preparing to load the original dataset
File upload to Azure Storage (Python)
I posted a UNIQLO (Fast Retailing) stock price forecast dataset to Kaggle