This is a memo when I was concerned about the performance of Dataflow. Cloud monitoring, Cloud Profiler, and memory dump can be used with Dataflow.
Cloud monitoring You can monitor the memory usage of the JVM with Cloud monitoring.
It is disabled by default, so at --experiments=enable_stackdriver_agent_metrics, when you start the pipeline (or build the template) ) Is enabled (*).
PubSubToText of Templates provided by Google Try it with cloud / teleport / templates / PubsubToText.java). The second line from the bottom is the added part.
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToText \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=${PROJECT_ID} \
--stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
--tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
--runner=DataflowRunner \
--windowDuration=2m \
--numShards=1 \
--inputTopic=projects/${PROJECT_ID}/topics/windowed-files \
--outputDirectory=gs://${PROJECT_ID}/temp/ \
--outputFilenamePrefix=windowed-file \
--experiments=enable_stackdriver_agent_metrics \
--outputFilenameSuffix=.txt"
After starting the job, you can check the index with Cloud Monitoring. By default, it is per instance, but you can also Group By by job name.
I also tried it with Word Count, but it was not displayed in Monitoring ... (The cause has not been investigated. Execution Is the time short?)
Stackdriver Profiler This is the method introduced by a person in Google at Meidum.
Stackdriver Profiler seems to be able to get a heap in Java (https://cloud.google.com/profiler/docs/concepts-profiling?hl=ja#types_of_profiling_available), but I didn't know if it could be taken by Dataflow ...
The profilingAgentConfiguration ='{\ "APICurated ": true}' option must be enabled. It can also be used at the same time as Cloud Monitoring.
* mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToText \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=${PROJECT_ID} \
--stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
--tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
--runner=DataflowRunner \
--windowDuration=2m \
--numShards=1 \
--inputTopic=projects/${PROJECT_ID}/topics/windowed-files \
--outputDirectory=gs://${PROJECT_ID}/temp/ \
--outputFilenamePrefix=windowed-file \
--experiments=enable_stackdriver_agent_metrics \
--outputFilenameSuffix=.txt"
To read the graph, see Stackdriver Profiler's Quick Start and Frame Graph Description. Please refer to / profiler / docs / concepts-flame? Hl = ja).
[Introduction](https://github.com/GoogleCloudPlatform/community/blob/master/tutorials/dataflow-debug-oom-conditions] at Google Cloud Platform Community /index.md), but you can also use the standard memory dump. Unlike the above two methods, you will be looking at each instance rather than the entire job.
How to take a dump
Three types are introduced. This time I will try downloading from a worker.
Follow the steps below:
Since there are no special options or precautions, starting the pipeline is omitted.
Enter the Dataflow worker and make a heap dump.
Note that the GCE instance is labeled with the Dataflow job ID and job name, so you can identify the instance with it (example below).
gcloud compute instances list --filter='labels.dataflow_job_name=${job_name}'
Create an SSH tunnel:
gcloud compute ssh --project=$PROJECT --zone=$ZONE \
$WORKER_NAME --ssh-flag "-L 8081:127.0.0.1:8081"
Open http://127.0.0.1:8081/heapz in your browser. It takes quite a while (about 10 minutes with n1-standard-4).
View the dump with your favorite tool, such as VisualVM. You can see the status of threads and objects.
I think that my predecessors wrote various ways to read the heap dump, so please do your best (Java Performance).
There are some caveats regarding the download from the worker I tried this time:
Recommended Posts