background

There are several ways to use BigQuery data in python code, but the recently introduced ** BigQuery Storage API ** seems to be very good at loading speed. Official page: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas?hl=ja

It's still a new API, and when I tried to do something a little unusual, I didn't have the sample code, so I'll leave it. Verification of execution speed etc. will be left to other articles.

Thing you want to do

Read BigQuery data directly in Python and store it in Pandas. This time, I wanted to process a large number of data, so I set conditions and extracted only some data.

Preparation

First, install the library to be used with pip. It is recommended to ** prepare another virtual environment ** by conda create etc. (I put it in the environment I usually use, but it was very troublesome for me to get an error because the version dependency between libraries was broken.)

Implementation

First, set the authentication information and create a client.

`python`


import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1

credentials, your_project_id = google.auth.default(
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

bqclient = bigquery.Client(
    credentials=credentials,
    project=your_project_id,
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
    credentials=credentials
)

Finally read the data. Please rewrite the project ID etc. as appropriate.

`python`


        print('Loading data...')
        table = bigquery_storage_v1beta1.types.TableReference()
        table.project_id = "PROJECT_ID"//The project ID with the BQ you want to load
        table.dataset_id = "DATASET_ID"//Dataset ID you want to read
        table.table_id = "TASBLE_ID"//Table ID you want to read
        
        read_options = bigquery_storage_v1beta1.types.TableReadOptions()
        #Column name you want to read
        cols = ['col1', 'col2', 'col3']
        for col in cols:
            read_options.selected_fields.append(col)
        #For example, "userid= "hogehoge"If you want to narrow down the data to be fetched under the condition of
        read_options.row_restriction = 'userid = "hogehoge"'
        parent = "projects/{}".format(your_project_id)
        session = bqstorageclient.create_read_session(
            table,
            parent,
            read_options=read_options,
            format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
            sharding_strategy=(
                bigquery_storage_v1beta1.enums.ShardingStrategy.LIQUID
            ),
        )
        stream = session.streams[0]
        position = bigquery_storage_v1beta1.types.StreamPosition(stream=stream)
        reader = bqstorageclient.read_rows(position)
        
        dataframe = reader.to_dataframe(session)

        #The datafreme contains the read data, so what's the rest?

[PYTHON] [BigQuery] Load a part of BQ data into pandas at high speed

background

Thing you want to do

Preparation

Implementation

python

python

`python`

`python`