Introduction

I couldn't find a way to deploy DataFlow, and I couldn't find a program that could simply run, so I summarized it as a memorandum.

procedure

1. Install apache_beam

Execute the following command in Cloud Shell.

sudo pip3 install apache_beam[gcp]

The following installation method is useless because an error will occur in beam.io.ReadFromText.

sudo pip install apache_beam

The installation method of apache_beam using the virtual environment is as follows.

#Create folder
mkdir python2
cd python2

#Create virtual environment
python -m virtualenv env

#Activate
source env/bin/activate

# apache-beam installation
pip install apache-beam[gcp]

2. Program creation

This time I created something as simple as below. Just read the file read.txt directly under the specified bucket and output it to the file write.txt.

If you want to actually try it, enter the appropriate contents in PROJECTID, JOB_NAME, BUCKET_NAME.

`gcs_readwrite.py`


# coding:utf-8
import apache_beam as beam

#Specify job name, project ID, bucket name
PROJECTID = '<PROJECTID>'
JOB_NAME = '<JOB_NAME>'  #Enter the DataFlow job name
BUCKET_NAME = '<BUCKET_NAME>'

#Set job name, project ID, temporary file storage
options = beam.options.pipeline_options.PipelineOptions()
gcloud_options = options.view_as(
    beam.options.pipeline_options.GoogleCloudOptions)
gcloud_options.job_name = JOB_NAME
gcloud_options.project = PROJECTID
gcloud_options.staging_location = 'gs://{}/staging'.format(BUCKET_NAME)
gcloud_options.temp_location = 'gs://{}/tmp'.format(BUCKET_NAME)

#Specify the maximum number of workers, machine type, etc.
worker_options = options.view_as(beam.options.pipeline_options.WorkerOptions)
# worker_options.disk_size_gb = 100
# worker_options.max_num_workers = 2
# worker_options.num_workers = 2
# worker_options.machine_type = 'n1-standard-8'
# worker_options.zone = 'asia-northeast1-a'

#Switching the execution environment
# options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DirectRunner'  #Run on local machine
options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner'  #Run on Dataflow

#pipeline
p = beam.Pipeline(options=options)

(p | 'read' >> beam.io.ReadFromText('gs://{}/read.txt'.format(BUCKET_NAME))
    | 'write' >> beam.io.WriteToText('gs://{}/write.txt'.format(BUCKET_NAME))
 )
p.run().wait_until_finish()

3. GCS preparation

Create the bucket name specified by BUCKET_NAME in the above program
Create folders called staging and tmp directly under the created bucket.
Create a file called read.txt locally. Any content is fine
Upload read.txt directly under the created bucket

4. Run locally

First, switch the comment out as follows in "Switching the execution environment" of the above program.

options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DirectRunner'  #Run on local machine
# options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner'  #Run on Dataflow

Then execute the following command.

python gcs_readwrite.py

This will create a file called write.txt-00000-of-00001 in your bucket.

5. Deploy

First, switch the comment out as follows in "Switching the execution environment" of the above program.

# options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DirectRunner'  #Run on local machine
options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner'  #Run on Dataflow

Then execute the following command.

python gcs_readwrite.py

This will create a file called write.txt-00000-of-00001 in your bucket. If you select the job you created in the DataFlow GUI, you will see that read and write are "completed".

Bonus (how to create a custom template)

Simply add a line like the one below and run it to create your custom template. You can freely select the save destination and template name.

gcloud_options.template_location = 'gs://{}/template/template_name'.format(BUCKET_NAME)

Use of custom templates Create Job from Custom Template-> Select Template-> Custom Template-> Specify GCS Path for Template Just do.

reference

Quick start using Python https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python?hl=ja

Specify the execution parameters of the pipeline https://cloud.google.com/dataflow/docs/guides/specifying-exec-params

Cloud Dataflow Super Primer https://qiita.com/hayatoy/items/987658490a69c7d24635

[GCP] Steps to deploy DataFlow on Cloud Shell (using Python)