[PYTHON] Execute API of Cloud Pak for Data analysis project Job with environment variables

In the analysis project of Cloud Pak for Data (CP4D), Notebook and Data Refinery Flow can be converted to Job and executed in batch. What I want to do this time is the following two points.

Strictly speaking, Job seems to be more accurate in the expression "set environment variables and start" than "pass arguments at runtime". I presume that it treats the environment variable as an OpenShift ConfigMap, probably because it launches internally as an OpenShift pod.

Let's actually start Job with API, give environment variables at that time, and pass it to the processing logic.

Create Notebook

Create a Notebook and turn it into a Job. Assuming "MYENV1", "MYENV2", and "MYENV3" as the environment variables handled this time, the values are processed into a pandas data frame and output as CSV to the data assets of the analysis project. Of course, these environment variables are not defined by default, so set the default values in default in os.getenv.

import os
myenv1 = os.getenv('MYENV1', default='no MYENV1')
myenv2 = os.getenv('MYENV2', default='no MYENV2')
myenv3 = os.getenv('MYENV3', default='no MYENV3')

print(myenv1)
print(myenv2)
print(myenv3)
# -output-
# no MYENV1
# no MYENV2
# no MYENV3

Next, dataframe these three values with pandas and

import pandas as pd
df = pd.DataFrame({'myenv1' : [myenv1], 'myenv2' : [myenv2], 'myenv3' : [myenv3]})
df
# -output-
#	myenv1	myenv2	myenv3
# 0	no MYENV1	no MYENV2	no MYENV3

Export as a data asset for your analysis project. Add a time stamp to the file name. The output of data assets to the analysis project is [this article](https://qiita.com/ttsuzuku/items/eac3e4bedc020da93bc1#%E3%83%87%E3%83%BC%E3%82%BF%E8 % B3% 87% E7% 94% A3% E3% 81% B8% E3% 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E4% BF % 9D% E5% AD% 98-% E5% 88% 86% E6% 9E% 90% E3% 83% 97% E3% 83% AD% E3% 82% B8% E3% 82% A7% E3% 82% AF% E3% 83% 88).

from project_lib import Project
project = Project.access()
import datetime
now = datetime.datetime.now(datetime.timezone(datetime.timedelta(hours=9))).strftime('%Y%m%d_%H%M%S')

project.save_data("jov_env_test_"+now+".csv", df.to_csv(),overwrite=True)

Create Job

From the Notebook menu, select File> Save Versions to save the version. Required when creating a Job. Then click the Job button at the top right of the Notebook screen> Create Job. image.png Give the job a name and click Create. image.png

Run Job

Let's execute the created Job on the CP4D screen. First, just click the "Run Job" button and execute it without defining any environment variables. image.png

OK when the job is executed and it becomes "Complete". image.png

Looking at the data assets of the analysis project, a CSV file is generated, image.png

If you click on the file name to see the preview, you can see that the default value set in Notebook is stored. image.png

Next, set the environment variable and execute it. Click "Edit" of "Environment Variables" on the Job screen and set the following 3 lines.

MYENV1=1
MYENV2=hoge
MYENV3=10.5

It is like this. image.png

After submitting the settings, try executing Job again. The contents of the resulting CSV file look like this. image.png Since it is an environment variable, it is treated as a character string String even if you enter a numerical value.

Run Job with API

Use python requests to kick the created Job via API. From the Python environment outside CP4D, execute the following code.

Authentication

In order to get a token, basic authentication is performed with a user name and password, and an accessToken is obtained. For authentication, [Example of running with curl in CP4D v2.5 product manual](https://www.ibm.com/support/knowledgecenter/ja/SSQNUZ_2.5.0/wsj/analyze-data/ml-authentication- There is local.html).

url = "https://cp4d.hostname.com"
uid = "username"
pw = "password"

import requests

#Authentication
response = requests.get(url+"/v1/preauth/validateAuth", auth=(uid,pw), verify=False).json()
token = response['accessToken']

The verify = False option in requests is a certificate checking workaround if CP4D is using a self-signed certificate.

Get Job List

Get the Job list of the analysis project. As a preparation, find out the ID of the analysis project to be used on CP4D in advance. View and verify the environment variable PROJECT_ID in the Notebook in your analysis project.

Project ID survey(Run on Notebook on CP4D)


import os
os.environ['PROJECT_ID']
# -output-
# 'f3110316-687e-450a-8f17-57296c907973'

Set the project ID found above and get the job list with API. The API uses the Watson Data API. The API reference is Jobs / Get list of jobs under a project is.

project_id = 'f3110316-687e-450a-8f17-57296c907973'
headers = {
    'Authorization': 'Bearer ' + token,
    'Content-Type': 'application/json'
}

# Job list
response = requests.get(url+"/v2/jobs?project_id="+project_id, headers=headers, verify=False).json()
response
# -output-
#{'total_rows': 1,
# 'results': [{'metadata': {'name': 'job_env_test',
#    'description': '',
#    'asset_id': 'b05d1214-d684-4bd8-b1fa-cc05a8ccee81',
#    'owner_id': '1000331001',
#    'version': 0},
#   'entity': {'job': {'asset_ref': '6e0b450e-2f9e-4605-88bf-d8a5e2bda4a3',
#     'asset_ref_type': 'notebook',
#     'configuration': {'env_id': 'jupconda36-f3110316-687e-450a-8f17-57296c907973',
#      'env_type': 'notebook',
#      'env_variables': ['MYENV1=1', 'MYENV2=hoge', 'MYENV3=10.5']},
#     'last_run_initiator': '1000331001',
#     'last_run_time': '2020-05-31T22:20:18Z',
#     'last_run_status': 'Completed',
#     'last_run_status_timestamp': 1590963640135,
#     'schedule': '',
#     'last_run_id': 'ebd1c2f1-f7e7-40cc-bb45-5e12f4635a14'}}}]}

The above asset_id is the ID of Job "job_env_test". Store it in a variable.

job_id = "b05d1214-d684-4bd8-b1fa-cc05a8ccee81"

Run Job

Execute the above Job with API. The API reference is Job Runs / Start a run for a job. You need to give json the value job_run at runtime, including the runtime environment variables here.

jobrunpost = {
  "job_run": {
      "configuration" : {
          "env_variables" :  ["MYENV1=100","MYENV2=runbyapi","MYENV3=100.0"] 
      }
  }
}

Give the above job_run as json and run the job. The execution ID is stored in the'asset_id'of the response'metadata'.

# Run job
response = requests.post(url+"/v2/jobs/"+job_id+"/runs?project_id="+project_id, headers=headers, json=jobrunpost, verify=False).json()

# Job run id
job_run_id = response['metadata']['asset_id']
job_run_id
# -output-
# 'cedec57a-f9a7-45e9-9412-d7b87a04036a'

Check Job execution status

After running, check the status. API reference is Job Runs / Get a specific run of a jobis.

# Job run status
response = requests.get(url+"/v2/jobs/"+job_id+"/runs/"+job_run_id+"?project_id="+project_id, headers=headers, verify=False).json()
response['entity']['job_run']['state']
# -output-
# 'Starting'

If you run this requests.get several times, the result will change to'Starting'->' Running'->'Completed'. When it becomes'Completed', the execution is completed.

Execution result

Return to the CP4D screen and check the contents of the CSV file generated in the data asset of the analysis project. image.png

It was confirmed that the environment variable specified in job_run is properly stored in the result data.

(bonus) Double-byte characters could also be used for the value of the job_run environment variable.

Jobs containing double-byte characters_run


jobrunpost = {
  "job_run": {
      "configuration" : {
          "env_variables" :  ["MYENV1=AIUEO","MYENV2=a-I-U-E-O","MYENV3=Aio"] 
      }
  }
}

Execution result: image.png

After that, you can boil or bake the value (character string) of the environment variable received in Job's Notebook or use it as you like.

(Reference material) https://github.ibm.com/GREGORM/CPDv3DeployML/blob/master/NotebookJob.ipynb This repository contained useful samples of Notebooks that can be used with CP4D.

Recommended Posts

Execute API of Cloud Pak for Data analysis project Job with environment variables
Output log file with Job (Notebook) of Cloud Pak for Data
Deploy functions with Cloud Pak for Data
Create a USB boot Ubuntu with a Python environment for data analysis
Practice of creating a data analysis platform with BigQuery and Cloud DataFlow (data processing)
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Prepare a programming language environment for data analysis
Save pandas data in Excel format to data assets with Cloud Pak for Data (Watson Studio)
Data analysis for improving POG 1 ~ Web scraping with Python ~
Clean up the Cloud pak for Data deployment space
Data analysis environment construction with Python (IPython notebook + Pandas)
Challenge principal component analysis of text data with Python
Analysis of measurement data ①-Memorandum of understanding for scipy fitting-
I tried using the API of the salmon data project
How to change python version of Notebook in Watson Studio (or Cloud Pak for Data)
Data analysis with python 2
Data analysis with Python
Build a data analysis environment with Kedro + MLflow + Github Actions
Flow of extracting text in PDF with Cloud Vision API
Recommendation of Jupyter Notebook, a coding environment for data scientists
Speech recognition of wav files with Google Cloud Speech API Beta
Get data from analytics API with Google API Client for python