[PYTHON] Try speed comparison of BigQuery Storage API

When doing ML-related operations and analysis, it is often the case that the data on BigQuery is directly downloaded locally and processed. Recently, I often use colaboratory for analysis and visualization, so I often download data with magic commands that can be used quickly, but as the target data becomes larger, the time required for downloading becomes considerably longer. I will.

So, Introducing the new features announced at Google Cloud Next '19! (Cloud Run, BigQuery Storage API, Cloud Data Fusion) How fast is the ** BigQuery Storage API ** introduced? I will try to see if it becomes.

BigQuery Storage API Please refer to the following article for details.

Installation

To take advantage of the BigQuery Storage API, install BigQuery Client Library version 1.9.0 or later and the BigQuery Storage API Client Library.

pip install --upgrade google-cloud-bigquery[bqstorage,pandas]

Speed comparison

Speed comparison compares ** until you get the BQ data from the query and download it as pandas.DataFrame.

Data analysis tasks often use queries to retrieve data from the BQ table, and rarely retrieve the BQ table itself. The BigQuery Storage API allows you to apply simple row filters, but if you want to apply complex filtering, you need to retrieve them using a query. Therefore, data acquisition directly using the BigQuery Storage API client library is excluded from the comparison here.

The data to be acquired will be the data of Wikipedia page view log data from BigQuery public data at 0:00 on January 1, 2019.

bigquery-public-data.wikipedia.pageviews_2019

SELECT * FROM `bigquery-public-data.wikipedia.pageviews_2019` 
WHERE datehour = '2019-01-01'

The number of columns is 4, but this alone has 5,287,235 rows, which is quite rugged data.

(Data of 1 million rows order is a volume that is in Zara in the analysis task, so it may be a good verification target)

Comparison

The execution environment is on colaboratory.

Magic command

The acquisition using the magic command is as follows.

%%bigquery --project <project_id> df_temp

SELECT * FROM `bigquery-public-data.wikipedia.pageviews_2019` 
where datehour = '2019-01-01'

By executing the above command in the cell of colaboratory, the acquisition result will be obtained as pandas.DataFrame in df_temp.

BigQuery client library

In colaboratory, the client library is already installed, so you can import it as it is.

from google.cloud import bigquery

query = """
SELECT * FROM `bigquery-public-data.wikipedia.pageviews_2019` 
where datehour = '2019-01-01'
"""

client = bigquery.Client(project=<project_id>)
df_temp = client.query(query).to_dataframe()

BigQuery → (AVRO) → GCS This is the equivalent of the second method described in the BigQuery Storage API Overview (https://cloud.google.com/bigquery/docs/reference/storage/#background).

The method is to save the query results as a temporary table using the BQ client library, export it as an AVRO file to GCS, and load it as pandas.DataFrame on the execution environment.

In the experiment, we developed an in-house library that acquires data by this method, so we will use this.

Magic Command + BigQuery Storage API

To use the BigQuery Storage API with magic commands, set the context.use_bqstorage_api property to True as follows:

import google.cloud.bigquery.magics
google.cloud.bigquery.magics.context.use_bqstorage_api = True

After doing this, use the magic command above.

%%bigquery --project <project_id> df_temp

SELECT * FROM `bigquery-public-data.wikipedia.pageviews_2019` 
where datehour = '2019-01-01'

BigQuery client library + BigQuery Storage API

To use the BigQuery Storage API in your client library, simply pass the BigQuery Storage API client object as an argument to the to_dataframe () method. (Cantan!)

from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1

query = """
SELECT * FROM `bigquery-public-data.wikipedia.pageviews_2019` 
where datehour = '2019-01-01'
"""

bq_client = bigquery.Client(project=<project_id>)
bqstorage_client = bigquery_storage_v1beta1.BigQueryStorageClient()
df_temp = bq_client.query(query).to_dataframe(bqstorage_client)

Experiment

As noted above, time ** from the query to pandas.DataFrame **. Time measurement is performed by attaching the magic command % time to the corresponding line. When using the client library, it looks like the following ::

from google.cloud import bigquery

query = """
SELECT * FROM `bigquery-public-data.wikipedia.pageviews_2019` 
where datehour = '2019-01-01'
"""

client = bigquery.Client(project=<project_id>)
%time df_temp = client.query(query).to_dataframe() # <--Measure here

In [Magic Command](#Magic Command) and [Magic Command + BigQuery Storage API](#Magic Command --bigquery-storage-api), add %% time to the first line of the cell to add %% time to the entire cell. Measure the time.

%%time
%%bigquery --project <project_id> df_temp

SELECT * FROM `bigquery-public-data.wikipedia.pageviews_2019` 
where datehour = '2019-01-01'

Both are one-shot measurements, so be aware that there are variations.

Measurement result

Method CPU time Wall time
Magic command 2min 9s 4min 48s
Magic command+ BigQuery Storage API 19.6 s 21.2 s
BQ client library 2min 9s 5min 1s
BQ -> (AVRO) -> GCS 57.1 s 1min 42s
BQ client library+ BigQuery Storage API 19.7 s 36.8 s

From the results, the method using the BigQuery Storage API is ** overwhelmingly fast **. .. .. An order of magnitude faster. ..

([BQ-> (AVRO)-> GCS](# bigquery--avro--gcs) made a library.)

in conclusion

The conclusion is that the BigQuery Storage API is the way to get the BQ table data from the query as pandas.DataFrame for speed.

One last note.

As of December 17, 2019, it is still in beta, so it seems that it can be used only in the US and EU multi-regions. over view has US / EU multi-region + some locations (ʻasia-east1, , Includes asia-northeast1, ʻasia-south1, ʻasia-southeast1, ʻeurope-west2, northamerica-northeast1`), but on the Price Page (https://cloud.google.com) / bigquery / pricing # storage-api) shows Unavailable except for US / EU multi-regions. (Actually, can it be used just because the document is not catching up?)

Please be careful about the location of the dataset when using it. ..

By the way, the charge is charged based on the number of bytes read from BigQuery storage by calling ReadRows, and basically it is charged according to the amount of data read by ReadRows of BigQuery Storage API. It seems to be a mechanism (I'm sorry if the recognition is wrong).

(@ na0 Thank you m (_ _) m)

Recommended Posts

Try speed comparison of BigQuery Storage API
Speed comparison of Python XML parsing
Speed comparison of murmurhash3, md5 and sha1
I tried using the BigQuery Storage API
[Python3] Coarse graining of numpy.ndarray Speed comparison etc.
Try to detect an object with Raspberry Pi ~ Part 1: Comparison of detection speed ~
Try using kabu station API of kabu.com Securities
Speed comparison of each language by Monte Carlo method
Comparison of LDA implementations
Comparison of online classifiers
Comparison of fitting programs
Try NNabla's C ++ API