I couldn't find a way to deploy DataFlow, and I couldn't find a program that could simply run, so I summarized it as a memorandum.
Execute the following command in Cloud Shell.
sudo pip3 install apache_beam[gcp]
The following installation method is useless because an error will occur in beam.io.ReadFromText
.
sudo pip install apache_beam
The installation method of apache_beam using the virtual environment is as follows.
#Create folder
mkdir python2
cd python2
#Create virtual environment
python -m virtualenv env
#Activate
source env/bin/activate
# apache-beam installation
pip install apache-beam[gcp]
This time I created something as simple as below.
Just read the file read.txt
directly under the specified bucket and output it to the file write.txt
.
If you want to actually try it, enter the appropriate contents in PROJECTID
, JOB_NAME
, BUCKET_NAME
.
gcs_readwrite.py
# coding:utf-8
import apache_beam as beam
#Specify job name, project ID, bucket name
PROJECTID = '<PROJECTID>'
JOB_NAME = '<JOB_NAME>' #Enter the DataFlow job name
BUCKET_NAME = '<BUCKET_NAME>'
#Set job name, project ID, temporary file storage
options = beam.options.pipeline_options.PipelineOptions()
gcloud_options = options.view_as(
beam.options.pipeline_options.GoogleCloudOptions)
gcloud_options.job_name = JOB_NAME
gcloud_options.project = PROJECTID
gcloud_options.staging_location = 'gs://{}/staging'.format(BUCKET_NAME)
gcloud_options.temp_location = 'gs://{}/tmp'.format(BUCKET_NAME)
#Specify the maximum number of workers, machine type, etc.
worker_options = options.view_as(beam.options.pipeline_options.WorkerOptions)
# worker_options.disk_size_gb = 100
# worker_options.max_num_workers = 2
# worker_options.num_workers = 2
# worker_options.machine_type = 'n1-standard-8'
# worker_options.zone = 'asia-northeast1-a'
#Switching the execution environment
# options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DirectRunner' #Run on local machine
options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner' #Run on Dataflow
#pipeline
p = beam.Pipeline(options=options)
(p | 'read' >> beam.io.ReadFromText('gs://{}/read.txt'.format(BUCKET_NAME))
| 'write' >> beam.io.WriteToText('gs://{}/write.txt'.format(BUCKET_NAME))
)
p.run().wait_until_finish()
BUCKET_NAME
in the above programstaging
and tmp
directly under the created bucket.read.txt
locally. Any content is fineread.txt
directly under the created bucketFirst, switch the comment out as follows in "Switching the execution environment" of the above program.
options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DirectRunner' #Run on local machine
# options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner' #Run on Dataflow
Then execute the following command.
python gcs_readwrite.py
This will create a file called write.txt-00000-of-00001
in your bucket.
First, switch the comment out as follows in "Switching the execution environment" of the above program.
# options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DirectRunner' #Run on local machine
options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner' #Run on Dataflow
Then execute the following command.
python gcs_readwrite.py
This will create a file called write.txt-00000-of-00001
in your bucket.
If you select the job you created in the DataFlow GUI, you will see that read
and write
are "completed".
Simply add a line like the one below and run it to create your custom template. You can freely select the save destination and template name.
gcloud_options.template_location = 'gs://{}/template/template_name'.format(BUCKET_NAME)
Use of custom templates Create Job from Custom Template-> Select Template-> Custom Template-> Specify GCS Path for Template Just do.
Quick start using Python https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python?hl=ja
Specify the execution parameters of the pipeline https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
Cloud Dataflow Super Primer https://qiita.com/hayatoy/items/987658490a69c7d24635
Recommended Posts