[PYTHON] The nice and regrettable parts of Cloud Datalab

This entry is

-Overview of Cloud Datalab

It is a continuation of.

Cloud Datalab Basics

Here too I wrote something similar, but if I write it again, Cloud Datalab is as follows.

--Interactive analysis environment based on Jupyter --Environment integrated with GCP --Containerized package of Jupyter and Python libraries --Containers can be easily launched and dropped on GCE via the datalab command

Datalab assumptions

Datalab is designed to work closely with GCP projects.

By default, if nothing is specified, it will be as follows.

--The Cloud Source Repository in the project will have a [repository created] called datalab-notebooks (https://cloud.google.com/). datalab / docs / how-to / datalab-team # use_the_automatically_created_git_repository_for_sharing_notebooks) --A $ {PROJECT_ID} .appspot.com/datalab_backups bucket is created on GCS and a backup is created in it (https://cloud.google.com/datalab/docs/how-to/" working-with-notebooks # cloud_datalab_backup)

Start-up

I will try various things on the premise. Anyway, it is the start of Datalab.

$ datalab create --disk-size-gb 10 --no-create-repository datalab-test

--Specify the disk size with --disk-size-gb. --By default, it is made with 200GB, so I specified 10GB for a smaller size. --Do not create repository with --no-create-repository --If I deleted the repository alone, it wouldn't start unless I added --no-create-repository. .. .. I wonder why this. I will investigate it again.

Cooperation with BigQuery

Datalab is very nice to work with BigQuery. So, to change the story a little, Jupyter has a command function called Magic Command that starts with %%. BigQuery and GCS features are also provided.

Run query as Magic Command

As per the Sample, but you can see how great it is to write it in a cell. ..

%%bq query
SELECT id, title, num_characters
FROM `publicdata.samples.wikipedia`
WHERE wp_namespace = 0
ORDER BY num_characters DESC
LIMIT 10

Run through google.datalab.bigquery

I'm querying a cell for BQ, so I want to process it as it is [what is in the sample](https://github.com/googledatalab/notebooks/blob/master/tutorials/BigQuery/SQL%20and%20Pandas% 20DataFrames.ipynb), but you can pass the result of the query to Pandas as a dataframe. wonderful.

%%bq query -n requests
SELECT timestamp, latency, endpoint
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Popular' OR endpoint = 'Recent'
import google.datalab.bigquery as bq
import pandas as pd

df = requests.execute(output_options=bq.QueryOutput.dataframe()).result()

Is it like this if it seems to be via API a little more?

import google.datalab.bigquery as bq
import pandas as pd

#Query to issue
query = """SELECT timestamp, latency, endpoint
           FROM `cloud-datalab-samples.httplogs.logs_20140615`
           WHERE endpoint = 'Popular' OR endpoint = 'Recent'"""
#Create a query object
qobj = bq.Query(query)
#Get query results as pandas dataframe
df2 = qobj.execute(output_options=bq.QueryOutput.dataframe()).result()
#To the operation of pandas below
df2.head()

If you think about it carefully, since this API is provided, it seems that Magic Command is created. In fact, if you look at here, %% bq is defined as Magic Command. You can see that.

Cooperation with GCS

As with BigQuery, you can manipulate objects on GCS from the cell as sample. The point is, is it possible to read and write files? It is also helpful to be able to use the BigQuery results as a data source, but it is attractive to be able to transparently handle GCS data as a data source.

Cooperation with CloudML

I was able to confirm that something works via the API for the time being, but I will skip this time because there are many things that I do not understand as various behaviors.

Change instance type

This is the true value of the cloud. If you need it, which is not possible with on-premise, you can upgrade the specifications. You can specify the instance type with the --machine-type option in create of the datalab command. By default, n1-standard-1 is started.

#Delete the instance with the delete command
#In this case, the attached disk remains as it is
$ datalab delete datalab-test

#Start with the same machine name but different instance types
#Machine name+Because the disk is created with the pd naming convention
#If the machine name is the same, the disc will be attached without permission.
$ datalab create --no-create-repository \
                 --machine-type n1-standard-4 \
                 datalab-test

Now you can raise or lower the specs of your machine as needed.

GPU analysis environment!

For the time being, this is the highlight.

with this! !! !! After specifying the GPU instance! !! !! !! You can easily get a GPU machine learning environment! !! !! !!

When I think about it, it's not so easy in the world ... So far, GPU instance is not supported by Datalab.

Summary

Datalab is regrettable in some places, but there is a faint expectation that GPU instances will support it somehow, except for the Cloud Source Repository and the Cloud ML Engine surroundings. However, these days I think it is an important part for creating a data analysis environment. Next time, I would like to take a closer look at this area.

Other reference information

Recommended Posts

The nice and regrettable parts of Cloud Datalab
The story of Python and the story of NaN
Review the concept and terminology of regression
The story of trying deep3d and losing
About the behavior of copy, deepcopy and numpy.copy
Summary of the differences between PHP and Python
Full understanding of the concepts of Bellman-Ford and Dijkstra
The answer of "1/2" is different between python2 and 3
Organize the meaning of methods, classes and objects
Specifying the range of ruby and python arrays
Change the color of Fabric errors and warnings
Compare the speed of Python append and map
General description of the CPUFreq core and CPUFreq notifiers
Organize the super-basic usage of Autotools and pkg-config
I read and implemented the Variants of UKR
About the * (asterisk) argument of python (and itertools.starmap)
A discussion of the strengths and weaknesses of Python
Cloud Datalab Overview
[SLAYER] I visualized the lyrics of thrash metal and checked the soul of steel [Word Cloud]
Investigation of the relationship between speech preprocessing and transcription accuracy in the Google Cloud Speech API
Consideration of the difference between ROC curve and PR curve
Run the flask app on Cloud9 and Apache Httpd
The story of Python without increment and decrement operators.
Automatically determine and process the encoding of the text file
relation of the Fibonacci number series and the Golden ratio
The process of installing Atom and getting Python running
Python --Explanation and usage summary of the top 24 packages
Visualize the range of interpolation and extrapolation with python
[FSL] Measurement of segment and volume of the basal ganglia
Think about the next generation of Rack and WSGI
Referencing and changing the upper bound of Python recursion
I checked out the versions of Blender and Python
Visualization of the connection between malware and the callback server
[Django 2.2] Sort and get the value of the relation destination
Personal notes about the integration of vscode and anaconda
Check the type and version of your Linux distribution
Animate the basics of dynamic programming and the knapsack problem