Use Azure ML Python SDK 2: Use dataset as input-Part 2

Contents of this time

In Using Azure ML Python SDK: Using dataset as input-Part 1, the input dataset was specified by the caller of script script.py. Azure Machine Learning Workspace allows you to register a dataset, so it's natural to want to retrieve it and use it in script.py. This time, I will introduce how to do it.

Appearance item

The items that will appear this time are as follows.

--CSV file (assuming that it is located in Azure Blob Storage for operation) --Here, the CSV file is registered in the Azure Machine Learning Studio UI. --Remote virtual machine (hereinafter "computing cluster" using Azure ML terminology)

Last time I used a local PC (with Visual Studio Code, Azure ML Python SDK installed) instead of Jupyter Notebook, but both are the same in terms of using a remote compute cluster for script execution. Launching from a compute instance in Azure Machine Learning Studio, Jupyter Notebook is useful because you can recreate your compute instance to take advantage of the latest Azure ML Python SDK version.

Azureml2.png

The folder structure of the Notebook is described on the assumption that it is as follows. You don't need to consider config.json because it's in the Azure ML Workspace environment.

Azureml4.png

In this example as well, script1.1.py is as simple as reading the CSV file on the blob Storage and writing it to the outputs directory. Similarly, it is HelloWorld1.1.ipynb's job to send script1.1.py to the compute cluster for execution.

The procedure for HelloWorld1.1.ipynb is as follows. Unlike last time, there is no step to specify the CSV file path on the Blob Storage. Azureml5.png

procedure

Let's take a look at the steps in order.

  1. Load the package
    First, load the package.

    
    import azureml.core
    from azureml.core import Workspace, Experiment, Dataset, Datastore, ScriptRunConfig
    from azureml.core.compute import ComputeTarget
    from azureml.core.compute_target import ComputeTargetException
    from azureml.core.runconfig import RunConfiguration, DEFAULT_CPU_IMAGE
    from azureml.core.conda_dependencies import CondaDependencies
    
    workspace = Workspace.from_config()
    
  2. Specifying a computing cluster
    You can also create remote compute resources with the Python SDK, but here I've created a compute cluster in the Azure ML Studo workspace in advance for a better overall view.

    
    aml_compute_target = "demo-cpucluster"  # <== The name of the cluster being used
    try:
        aml_compute = ComputeTarget(workspace, aml_compute_target)
        print("found existing compute target.")
    except ComputeTargetException:
        print("no compute target with the specified name found")
    
  3. Specifying the container environment
    Here, specify the execution environment. Pass the variables of the compute cluster and specify the package to use in the container image. We're only using pip_package here, but you can also specify conda_package.

    run_config = RunConfiguration()
    run_config.target = aml_compute
    run_config.environment.docker.enabled = True
    run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE
    run_config.environment.python.user_managed_dependencies = False  
    
    run_config.environment.python.conda_dependencies = CondaDependencies.create(
        pip_packages=['azureml-defaults'], 
        pin_sdk_version=False
        )
    
  4. Specifying the executable file name
    Specify the folder name that contains the set of scripts to be executed remotely with script_folder. Also, in script, specify the name of the script file that will be the entry for remote execution. In remote execution, all the files and subdirectories in script_folder are passed to the container, so be careful not to place unnecessary files. Since the input file is fetched by script1.1.py, it is not specified here.

    
    src = ScriptRunConfig(source_directory='script_folder', script='script1.1.py',
                          run_config = run_config)
    
  5. Run the experiment
    experiment_name is used as the display name for the experiment.

    
    experiment_name = 'ScriptRunConfig2'
    experiment = Experiment(workspace = workspace, name = experiment_name)
    
    run = experiment.submit(config=src)
    run
    

This cell ends asynchronously, so if you want to wait for the end of execution, execute the following statement.

```python

%%time
run.wait_for_completion(show_output=True)
```
  1. script1.1.py
    The contents of the script that is executed remotely. Get_context () passes the execution information of the calling script. The experiment information, which is the attribute information in this run, is fetched, and the workspace information, which is the attribute information of the experiment, is fetched. Once you know the workspace information, you can get the dataset registered in the workspace with get_by_name. This get_by_name is written in the same format that appears on the "Use" tab from the registered dataset in Azure Machine Learning Studio.
    This script finally writes the file to the outputs folder. By default, this outputs folder is created without any action and can be referenced from the experiment's "Outputs and Logs" after execution.

    
    from azureml.core import Run, Dataset, Workspace
    
    run = Run.get_context()
    exp = run.experiment
    workspace = exp.workspace
    
    dataset = Dataset.get_by_name(workspace, name='hello_ds')
    df = dataset.to_pandas_dataframe()
    
    HelloWorld = df.iloc[0,1]
    print('*******************************')
    print('********* ' + HelloWorld + ' *********')
    print('*******************************')
    
    df.to_csv('./outputs/HelloWorld.csv', mode='w', index=False)
    
  2. [Reference] Contents of HelloWorld.txt
    The CSV file used here is simple.

    0,Hello World
    1,Hello World
    2,Hello World
    

in conclusion

What do you think. There are several variations of input / output in the Azure ML Python SDK. Next time, I would like to introduce the output.

Reference material

What is Azure Machine Learning SDK for Python azureml.core.experiment.Experiment class - Microsoft Docs Use Azure ML Python SDK 1: Use dataset as input-Part 1 Use Azure ML Python SDK 3: Write output to Blob storage-Part 1 [Use Azure ML Python SDK 4: Write output to Blob storage-Part 2] (https://qiita.com/notanaha/items/655290670a83f2a00fdc)

Recommended Posts

Use Azure ML Python SDK 2: Use dataset as input-Part 2
Use Azure ML Python SDK 4: Write output to Blob storage-Part 2
Use Azure ML Python SDK 3: Write output to Blob storage-Part 1
Using Azure ML Python SDK 5: Pipeline Basics
Use pymol as a python library
Use fabric as is in python (fabric3)
Use blender as a python module
Use Azure Blob Storage from Python
Use embeddable Python as Vim's Python 3 interface
Use Python and MeCab with Azure Functions
Upgrade the Azure Machine Learning SDK for Python
Use python in Docker container as Pycharm interpreter
Use Python and word2vec (learned) with Azure Databricks
Use Python / Django with Windows Azure Cloud Service!
Check! How to use Azure Key Vault with Azure SDK for Python! (Measures around authentication)
Specify MinGW as the compiler to use with Python
Use AWS SDK for Python (boto) under Proxy environment
I want to use the R dataset in python