I checked the Python package pre-installed in Google Cloud Dataflow

Google Cloud Dataflow doesn't get much attention, but it's quite convenient because you can easily switch the execution environment between local and remote. Moreover, if you think that you can only use the standard library, you can install it from list of pip or [install your own](http://qiita. com / orfeon / items / 78ff952052c4bde4bcd3) is also possible. Then, I tried to find out which library was pre-installed because it did not come out even if I searched Document briefly.

Preparation

First of all, setting options, this area is @ orfeon's round plagia ...

Option settings


import apache_beam as beam
import apache_beam.transforms.window as window

options = beam.utils.pipeline_options.PipelineOptions()

google_cloud_options = options.view_as(beam.utils.pipeline_options.GoogleCloudOptions)
google_cloud_options.project = '{PROJECTID}'
google_cloud_options.job_name = 'test'
google_cloud_options.staging_location = 'gs://{BUCKET_NAME}/binaries'
google_cloud_options.temp_location = 'gs://{BUCKET_NAME}/temp'

worker_options = options.view_as(beam.utils.pipeline_options.WorkerOptions)
worker_options.max_num_workers = 1

# options.view_as(beam.utils.pipeline_options.StandardOptions).runner = 'DirectRunner'
options.view_as(beam.utils.pipeline_options.StandardOptions).runner = 'DataflowRunner'

p = beam.Pipeline(options=options)

Run pip freeze to log the Python package list.

Package list output part


def inspect_df(dat):
    import subprocess
    import logging
    process = subprocess.Popen('pip freeze', shell=True,
                               stdout=subprocess.PIPE, 
                               stderr=subprocess.PIPE)
    for line in process.stdout:
        logging.info(line)

Run on Dataflow. You may not need hello world ...

Pipeline execution


(p | 'init' >> beam.Create(['hello', 'world'])
   | 'inspect' >> beam.Map(inspect_df))

p.run()

When the execution of the pipeline is completed, the package list will be output to the log, so check it in the Cloud Console.

Log check

In Dataflow Document, you can check the log from the Job details screen of Dataflow, but as of March 4, 2017, Stackdriver -> Moving to Logging.

Logs_Viewer_-_Test_fx_lab.png The log is output like this.

Package List

It is a list of packages spit out in the above log. ** As of March 4, 2017 **

Package Version
avro 1.8.1
beautifulsoup4 4.5.1
bs4 0.0.1
crcmod 1.7
Cython 0.25.2
dataflow-worker 0.5.5
dill 0.2.5
enum34 1.1.6
funcsigs 1.0.2
futures 3.0.5
google-api-python-client 1.6.2
google-apitools 0.5.7
google-cloud-dataflow 0.5.5
google-python-cloud-debugger 1.9
googledatastore 6.4.1
grpcio 1.1.0
guppy 0.1.10
httplib2 0.9.2
mock 2.0.0
nltk 3.2.1
nose 1.3.7
numpy 1.12.0
oauth2client 2.2.0
pandas 0.18.1
pbr 1.10.0
Pillow 3.4.1
proto-google-datastore-v1 1.3.1
protobuf 3.0.0
protorpc 0.11.1
pyasn1 0.2.2
pyasn1-modules 0.0.8
python-dateutil 2.6.0
python-gflags 3.0.6
python-snappy 0.5
pytz 2016.10
PyYAML 3.11
requests 2.10.0
rsa 3.4.2
scikit-learn 0.17.1
scipy 0.17.1
six 1.10.0
tensorflow 1.0.0
tensorflow-transform 0.1.4
uritemplate 3.0.0

Is it because tf.transform has arrived? In Cloud ML, ~~ TensorFlow ver is 0.12 ~~ ** (EDIT: The latest Ver is [here](https://cloud.google.com/ml-engine/docs/concepts/runtime-version-" You can check it with list)) ** It is 1.0.0 in Dataflow. scikit-learn seems a bit old.

Although staging is a little slow, Dataflow, which can easily switch between local and remote from Jupyter Notebook and perform fully managed instance startup and startup without permission, seems to be a powerful tool for applications such as data analysis and machine learning. is.

Recommended Posts

I checked the Python package pre-installed in Google Cloud Dataflow
I wrote the queue in Python
I wrote the stack in Python
Implemented in Dataflow to copy the hierarchy from Google Drive to Google Cloud Storage
I tried simulating the "birthday paradox" in Python
I tried the least squares method in Python
I tried using the Google Cloud Vision API
Install the python package in an offline environment
I implemented the inverse gamma function in python
I checked the calendar deleted in Qiita Advent Calendar 2016
I want to display the progress in Python!
I tried to graph the packages installed in Python
I checked out the versions of Blender and Python
I want to write in Python! (3) Utilize the mock
I want to use the R dataset in python
I can't use the darknet command in Google Colaboratory!
I checked the reference speed when using python list, dictionary, and set type in.
I wrote python in Japanese
I tried the accuracy of three Stirling's approximations in python
Download the file in Python
Python package management in IntelliJ
Find the difference in Python
I tried programming the chi-square test in Python and Java.
What is Google Cloud Dataflow?
I tried to implement the mail sending function in Python
I tried the Google Cloud Vision API for the first time
I understand Python in Japanese!
What I learned in Python
Google search for the last line of the file in Python
I downloaded the python source
I tried to summarize the contents of each package saved by Python pip in one line
I checked the distribution of the number of video views of "Flag-chan!" [Python] [Graph]
I compared the calculation time of the moving average written in Python
I tried running the Python Package Repository (Warehouse) that supports PyPI
Movement that changes direction in the coordinate system I tried Python 3
I got an AttributeError when mocking the open method in python
I wrote the code to write the code of Brainf * ck in python
Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]
Regularly upload files to Google Drive using the Google Drive API in Python
Try using FireBase Cloud Firestore in Python for the time being
[Package cloud] Manage python packages with package cloud
Getting the arXiv API in Python
I got lost in the maze
Python in the browser: Brython's recommendation
Save the binary file in Python
Hit the Sesami API in Python
[Python] Hit the Google Translation API
Get the desktop path in Python
I checked Mac Python environment construction
Bayesian optimization package GPyOpt in Python
I participated in the ISUCON10 qualifying!
Download Google Drive files in Python
Get the script path in Python
In the python command python points to python3.8
Implement the Singleton pattern in Python
Run XGBoost with Cloud Dataflow (Python)
I wrote Fizz Buzz in Python
Hit the web API in Python
I liked the tweet with python. ..
I learned about processes in Python
I can't install scikit-learn in Python