This is a memo when I was concerned about the performance of Dataflow. Cloud monitoring, Cloud Profiler, and memory dump can be used with Dataflow.

Cloud monitoring You can monitor the memory usage of the JVM with Cloud monitoring.

To understand

JVM memory usage
Number of threads
Other agent metrics

Try out

It is disabled by default, so at --experiments=enable_stackdriver_agent_metrics, when you start the pipeline (or build the template) ) Is enabled (*).

Although it is an experiment, it is unknown if there are any concerns.

PubSubToText of Templates provided by Google Try it with cloud / teleport / templates / PubsubToText.java). The second line from the bottom is the added part.

 mvn compile exec:java \
 -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToText \
 -Dexec.cleanupDaemonThreads=false \
 -Dexec.args=" \
 --project=${PROJECT_ID} \
 --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
 --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
 --runner=DataflowRunner \
 --windowDuration=2m \
 --numShards=1 \
 --inputTopic=projects/${PROJECT_ID}/topics/windowed-files \
 --outputDirectory=gs://${PROJECT_ID}/temp/ \
 --outputFilenamePrefix=windowed-file \
 --experiments=enable_stackdriver_agent_metrics  \
 --outputFilenameSuffix=.txt"

After starting the job, you can check the index with Cloud Monitoring. By default, it is per instance, but you can also Group By by job name.

Cloud Monitoring例

I also tried it with Word Count, but it was not displayed in Monitoring ... (The cause has not been investigated. Execution Is the time short?)

Stackdriver Profiler This is the method introduced by a person in Google at Meidum.

To understand

CPU time frame graph
Real-time frame graph

Stackdriver Profiler seems to be able to get a heap in Java (https://cloud.google.com/profiler/docs/concepts-profiling?hl=ja#types_of_profiling_available), but I didn't know if it could be taken by Dataflow ...

Try out

The profilingAgentConfiguration ='{\ "APICurated ": true}' option must be enabled. It can also be used at the same time as Cloud Monitoring.

 * mvn compile exec:java \
 -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToText \
 -Dexec.cleanupDaemonThreads=false \
 -Dexec.args=" \
 --project=${PROJECT_ID} \
 --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
 --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
 --runner=DataflowRunner \
 --windowDuration=2m \
 --numShards=1 \
 --inputTopic=projects/${PROJECT_ID}/topics/windowed-files \
 --outputDirectory=gs://${PROJECT_ID}/temp/ \
 --outputFilenamePrefix=windowed-file \
 --experiments=enable_stackdriver_agent_metrics  \
 --outputFilenameSuffix=.txt"

経過時間

CPU時間

To read the graph, see Stackdriver Profiler's Quick Start and Frame Graph Description. Please refer to / profiler / docs / concepts-flame? Hl = ja).

Memory dump

[Introduction](https://github.com/GoogleCloudPlatform/community/blob/master/tutorials/dataflow-debug-oom-conditions] at Google Cloud Platform Community /index.md), but you can also use the standard memory dump. Unlike the above two methods, you will be looking at each instance rather than the entire job.

How to take a dump

Add an option to output to GCS during OOM
--dumpHeapOnOOM and --saveHeapDumpsToGcsPath
Connect to Dataflow worker and download heap dump
Connect via JMX

Three types are introduced. This time I will try downloading from a worker.

To understand

JVM memory usage for each instance
Who is using a lot of memory
Reference relationship

Try out

Follow the steps below:

Start the pipeline
Create a heap dump
Download locally
Take a look

Since there are no special options or precautions, starting the pipeline is omitted.

Creating a heap dump

Enter the Dataflow worker and make a heap dump.

Note that the GCE instance is labeled with the Dataflow job ID and job name, so you can identify the instance with it (example below).

gcloud compute instances list --filter='labels.dataflow_job_name=${job_name}'

Create an SSH tunnel:

gcloud compute ssh --project=$PROJECT --zone=$ZONE \
  $WORKER_NAME --ssh-flag "-L 8081:127.0.0.1:8081"

Download locally

Open http://127.0.0.1:8081/heapz in your browser. It takes quite a while (about 10 minutes with n1-standard-4).

Take a look

View the dump with your favorite tool, such as VisualVM. You can see the status of threads and objects.

VisualVMサマリー

VisualVMスレッド

VisualVMオブジェクト

I think that my predecessors wrote various ways to read the heap dump, so please do your best (Java Performance).

important point

There are some caveats regarding the download from the worker I tried this time:

Dumping impacts pipeline performance
Note the capacity as the heap dump is temporarily placed on the worker's disk
It seems that at least 30GB + memory is required

[GO] Take a look at profiling and dumping with Dataflow

To understand

Try out

To understand

Try out

Memory dump

To understand

Try out

Creating a heap dump

Download locally

Take a look

important point