BigQuery integration for Python users

Collaboration between Python and BigQuery

The combination of Python and BigQuery is a good match for data analysis.

Python is not suitable for handling data that is too huge, but if you let BigQuery do only that part and cut out the data into small pieces, you can do whatever you want with Python.

The question is how to integrate Python and BigQuery, but there are two main ways to do this.

  1. Use Python's BigQuery integration library
  2. Use Google Cloud Datalab

2 is especially recommended for those who are accustomed to using Jupyter Notebook.

1. Use Python's BigQuery integration library

Recommended library

There are several libraries for hitting BigQuery from Python. For example, BigQuery-Python, bigquery_py.

However, in fact __the easiest and recommended __ is the one module of Pandas.io pandas.io.gbq .gbq.read_gbq.html). It works well with DataFrame objects and is very easy to authenticate, so it's great that you can use it without worrying about difficult things.

How to authenticate pandas.io.gbq

pandas.io.All you need to use gbq is your BigQuery project ID.


 The notation is as follows.
 In this example, the result of ``` SELECT * FROM tablename``` is stored in` `data_frame```, and then it can be used as a normal DF object.
   

```python
import pandas as pd
query = 'SELECT * FROM tablename'

#str-based queries and projects_id required
data_frame = pd.read_gbq(query, 'project-id')

When executed it will return a brief statistic of the query process (If you want to hide the statistics, specify verbose = False as an argument)

Execution example image

Running pd.read_gbq will open the Google Account authentication screen in your browser. If you log in with an account that can access the project, the linked authentication will be completed and the process will start.

At this time, the json format credential file is spit out to the working folder. As long as you have this file, you can hit the query many times without re-authentication. Conversely, if this file is handed over, BigQuery will be used without permission and you will be bill shocked, so be careful about managing the file. http://qiita.com/itkr/items/745d54c781badc148bb9

Writing from Python to BigQuery side is also possible

You can also easily write a Python DataFrame object as a table on BigQuery. Library official documentation

from pandas.io import gbq
gbq.to_gbq(df, 'Table to write to', 'Project ID')

With this alone, you can return the DF object used in Python to BigQuery.

When you can do so far, for example

(1) Fetch user feature data from BigQuery (read)_bgq)
↓
② Connect and process data with Python(pandas)
↓
③ Label users with a machine learning model(scikielearn)
↓
④ Return the labeled result to BigQuery(gbq.to_gbq)

You will be able to do things like this relatively easily. Hmmm wonderful Pandas is really convenient

2. Use Google Cloud Datalab

image

What is Cloud Datalab?

Datalab is an interactive cloud analysis environment based on jupyter notebook (formerly iPython-Notebook) built on Google Compute Engine. The operation is performed by writing the code in an interface called "Notebook" that can be viewed and described with a browser.

Roughly, it has the following advantages.

--Interactive operation -You can keep queries, results, and charts in Notebook format. --Coexistence of Markdown format sentences and code --Inline display of code execution results (including charts) --Seamless integration with Python (both Python and SQL can be used in the same console) --Because the environment is automatically built on Google Compute Engine --Do not overwhelm the local environment --Easy to share queries, etc.

I don't think it's easy to read the text, so it may be the fastest to watch the video or image.

[Video] https://www.youtube.com/watch?v=RzIjz5HQIx4

[Image of usage example] image

The characters on the first line are Markdown,
The first cell is written in SQL and the second cell is written in Python
Graphs etc. are also displayed inline
(Of course, not only graphs but also tables can be displayed)

How to get started

Beta version (?), So you can't turn on features directly from the GCP console Select a project from this page and deploy it to use the function

In addition, it seems that only the owner with owner authority or editing authority can turn on the function Once deployed to the project, it will be available to all members of the project

Cloud Datalab is deployed as a Google App Engine application module in the selected project. The Google Compute Engine and Google BigQuery APIs must be enabled for the project, and you must be authorized to use the project as an owner or editor.

Start screen image image

What happens when you start

environment

An instance for Datalab is launched on Google Compute Engine, and the Datalab environment is built on it. The Datalab interface can be operated from a browser. (Of course, SSH connection is possible to this environment as well)

Notebook (SQL and Python code) written on the browser will be saved on this instance (= everyone can see it)

The GCP console doesn't have an entry point to turn on the Datalab feature, but if you're using Datalab, you'll see "Datalab" in your instance list.

Fee

There is a charge for GCE instances (~ several thousand yen? Depending on the instance specifications) Of course, you will also be charged for hitting BigQuery.

Datalab official

You only pay for the resources you use to run Cloud Datalab, as follows:

Compute Resources Cloud Datalab uses Google App Engine and Google Compute Engine resources to run within your project. When you have Cloud Datalab instances deployed within your project, you incur compute charges —the charge for one VM per Cloud Datalab instance

Google BigQuery -You incur BigQuery charges when issuing SQL queries within Cloud Datalab

Other Resources —You incur charges for other API requests you make within the Cloud Datalab environment.

Reference-Information on other companies that seem to be used (domestic)

Voyage Group http://www.slideshare.net/hagino_3000/cloud-datalabbigquery Vasily http://tech.vasily.jp/entry/cloud-datalab Wonderplanet http://wonderpla.net/blog/engineer/Try_GoogleCloudDatalab/

Recommended Posts

BigQuery integration for Python users
2016-10-30 else for Python3> for:
python [for myself]
[python] Get Twitter timeline for multiple users
EC2 (Python3)-> BigQuery
R code compatible sheet for Python users
About Python for loops
Vue-Cli and Python integration
About Python, for ~ (range)
python textbook for beginners
Refactoring tools for Python
python for android Toolchain
Use BigQuery from python.
python tag integration test
OpenCV for Python beginners
Install Python (for Windows)
[Python] for statement error
Python environment for projects
[BigQuery] How to use BigQuery API for Python -Table creation-
Connect to BigQuery with Python
Python memo (for myself): Array
About Fabric's support for Python 3
Python list, for statement, dictionary
Python for Data Analysis Chapter 4
Learning flow for Python beginners
Python 3.6 installation procedure [for Windows]
Python learning plan for AI learning
Set Up for Mac (Python)
Search for strings in Python
Python Tkinter notes (for myself)
OpenCV3 installation for Python3 @macOS
Python code memo for yourself
[Python] xmp tag for photos
Python environment construction For Mac
Techniques for sorting in Python
pp4 (python power for anything)
Python3 environment construction (for beginners)
Roadmap for publishing Python packages
Python 3 series installation for mac
Python #function 2 for super beginners
Python template for Codeforces-manual test-
Basic Python grammar for beginners
3 months note for starting Python
Qt for Python app self-update
Python for Data Analysis Chapter 2
100 Pandas knocks for Python beginners
Checkio's recommendation for learning Python
Keyword arguments for Python functions
[For organizing] Python development environment
[Python] Sample code for Python grammar
Python for super beginners Python #functions 1
[Python / PyQ] 4. list, for statement
Simple HTTP Server for python
[Python + Selenium] Tips for scraping
Python #list for super beginners
~ Tips for beginners to Python ③ ~
Extract only Python for preprocessing
Indentation formatting for python scripts
Introduction to Python For, While
About "for _ in range ():" in python
tesseract-OCR for Python [Japanese version]