[PYTHON] [BigQuery] Load a part of BQ data into pandas at high speed

background

There are several ways to use BigQuery data in python code, but the recently introduced ** BigQuery Storage API ** seems to be very good at loading speed. Official page: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas?hl=ja

It's still a new API, and when I tried to do something a little unusual, I didn't have the sample code, so I'll leave it. Verification of execution speed etc. will be left to other articles.

Thing you want to do

Read BigQuery data directly in Python and store it in Pandas. This time, I wanted to process a large number of data, so I set conditions and extracted only some data.

Preparation

First, install the library to be used with pip. It is recommended to ** prepare another virtual environment ** by conda create etc. (I put it in the environment I usually use, but it was very troublesome for me to get an error because the version dependency between libraries was broken.)

Implementation

First, set the authentication information and create a client.

python


import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1

credentials, your_project_id = google.auth.default(
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

bqclient = bigquery.Client(
    credentials=credentials,
    project=your_project_id,
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
    credentials=credentials
)

Finally read the data. Please rewrite the project ID etc. as appropriate.

python


        print('Loading data...')
        table = bigquery_storage_v1beta1.types.TableReference()
        table.project_id = "PROJECT_ID"//The project ID with the BQ you want to load
        table.dataset_id = "DATASET_ID"//Dataset ID you want to read
        table.table_id = "TASBLE_ID"//Table ID you want to read
        
        read_options = bigquery_storage_v1beta1.types.TableReadOptions()
        #Column name you want to read
        cols = ['col1', 'col2', 'col3']
        for col in cols:
            read_options.selected_fields.append(col)
        #For example, "userid= "hogehoge"If you want to narrow down the data to be fetched under the condition of
        read_options.row_restriction = 'userid = "hogehoge"'
        parent = "projects/{}".format(your_project_id)
        session = bqstorageclient.create_read_session(
            table,
            parent,
            read_options=read_options,
            format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
            sharding_strategy=(
                bigquery_storage_v1beta1.enums.ShardingStrategy.LIQUID
            ),
        )
        stream = session.streams[0]
        position = bigquery_storage_v1beta1.types.StreamPosition(stream=stream)
        reader = bqstorageclient.read_rows(position)
        
        dataframe = reader.to_dataframe(session)

        #The datafreme contains the read data, so what's the rest?

Recommended Posts

[BigQuery] Load a part of BQ data into pandas at high speed
A small story that outputs table data in CSV format at high speed
Combine multiple CSVs and data frames into one data frame at high speed [60x speed]
Make holiday data into a data frame with pandas
[Memo] Load csv of s3 into pandas with boto3
Easily build a GCP environment for Kaggle at high speed
Image crawling summary performed at the speed of a second
Perform implied volatility calculation at high speed (market data processing)
A collection of methods used when aggregating data with pandas
[TF2.0 application] A case where general-purpose Data Augmentation was parallelized and realized at high speed with the strong data set function of the TF example.
[Python] How to get divisors of natural numbers at high speed
A script that downloads AWS RDS log files at high speed