This article is the 9th day article of mixi Group Advent Calendar 2019.
(Summary of 3 lines)
Amazon SageMaker provides CloudWatch Metrics-based charts for training job metrics monitoring (https://aws.amazon.com/jp/blogs/news/easily-monitor-and-visualize-metrics-while -training-models-on-amazon-sagemaker /) and now also appear in the job details in the management console It's easy to set up, but personally, it's a tough impression for algorithm metrics monitoring, such as log smoothness (output frequency), scale, and unit notation.
Use the SageMaker SDK's TrainingJobAnalytics (https://sagemaker.readthedocs.io/en/stable/analytics.html#sagemaker.analytics.TrainingJobAnalytics) to get the data and control the drawing yourself The data source is still CloudWatchLogs (not fundamentally resolved), but ** readability can be significantly improved **
You can have it drawn in Jupyter Notebook during training, or you can draw it in the code of the Estimator caller at regular or end time and save it in place.
analytics.py
metric_names = ['train:loss','validation:loss']
metrics_dataframe = sagemaker.analytics.TrainingJobAnalytics(
training_job_name=training_job_name,
metric_names=metric_names,
period=60, #1 min is the limit value
).dataframe()
#Formatting dataframe
...
plt = metrics_dataframe_fixed.plot(
kind='line',
figsize=(20,15),
fontsize=18,
x='timestamp',
y=[metric_names[0],metric_names[1]],
xlim=[0, 2000],
ylim=[0.1, 0.5],
style=['b.-','r+-'],
rot=45,
)
plt.figure.savefig('metrics_training_job_xxx.png')
plt.clf()
You can also use this method in SageMaker built-in algorithm
SageMaker has officially 4 ways, but ML framework provided by Amazon Container and [Case using original container](https://docs.aws.amazon.com/ja_jp/ With sagemaker / latest / dg / your-algorithms.html), you can periodically graph and output the situation during training with your own program code and send it to S3.
I'm afraid I'm using SageMaker but not using the monitoring features provided, but if I can't get it in the format I want, I have to take it inside the container (because I write the entry point script and my own ML algorithm myself). , The effort to add graph drawing to the code you know is not so big)
"Where to send the graph drawn in the container and how to share the graph placement destination (S3 path) inside and outside the container" is surprisingly difficult, but the following method can be used as an example.
conditions
place in the same location and put a JSON file with information about the modelmetrics
in the same place, and use it as a place to put metrics data and drawn graphs.conditions
path as ʻinputs` to the Estimator and start trainingconditions
and assemble the model output destination pathmetrics
.train_task.py
#conditions generation, training_job_Record name
dict_conditions = { "training_job_name" : training_job_name }
s3_conditions_path = '/model/{}/conditions/training_job_config.json'.format(training_job_name)
boto3.resource('s3').Object(bucket,s3_conditions_path).put(Body=json.dumps(dict_conditions))
#Hand over conditions to sagemaker training job
Estimator.fit(
job_name=training_job_name,
inputs={'train_data':s3_train_data_path,'conditions':s3_conditions_path},
)
train_entrypoint.py
# Estimator.get the training job name from the conditions passed from the fit caller
#(The path corresponding to the dict key of the passed inputs is generated and the file is placed)
input_conditions = '/opt/ml/input/data/conditions/training_job_config.json'
with open(input_conditions) as f:
conditions = json.load(f)
training_job_name = input_conditions['training_job_name']
#Graph path definition
graph_name = 'training_history_{}.png'.format(metrics)
graph_outpath = '{}/{}'.format(output_path,graph_name)
s3_graph_outpath = '/model/{}/metrics/{}'.format(training_job_name,graph_name)
#Draw and save graph (keras example)
history = model.fit(...)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.legend(['training', 'validation'], loc='upper right')
plt.figure.savefig(graph_outpath)
plt.clf()
#S3 Send graph to bucket to save training job results (update)
boto3.resource('s3').Bucket(bucket).upload_file(graph_outpath,s3_graph_outpath)
As shown in the code, make it possible to pass training_job_name
with json in the part that calls Estimator of sagemaker, and from the shared information, the metric output destination for each training job is in the specified format (s3: // {bucket } / Model / {training_job_name} /metrics/{graph_name}.png
)
In Action 2, you can write the code freely, so you can output the log for TensorBoard, synchronize it with the specified bucket of S3, and draw it by referring to the log on S3 from TensorBoard launched with Notebook Instance etc. Masu
train_entrypoint_keras.py
tensorboard_log_outpath = '{}/{}'.format(output_path,tensorboard_log_name)
tensorboard_callback = keras.callbacks.TensorBoard(
log_dir=tensorboard_log_outpath,
histogram_freq=1)
callbacks = [tensorboard_callback]
model.fit(..., callbacks=callbacks)
boto3.resource('s3').Bucket(bucket).upload_file(
tensorboard_log_outpath, s3_tensorboard_log_outpath)
notebook.py
tensorboard --logdir={s3_tensorboard_log_outpath}
It is possible to draw with other tools of your choice, but I think it is better to take a well-balanced method based on the management cost.
As I mentioned several times along the way, the options you can take differ depending on How to use SageMaker.
For both measures 1 and 2, I think it is easier to manage by ** uploading the metric data (Dataframe and log) and the drawn graph image to the same S3 as the model storage area **.
I want to organize the metrics to be compared with the same definition so that they can be judged at a glance.
Recommended Posts