Introduction

BigQuery is very convenient, isn't it? For the time being, you can put tables in, aggregate the put tables at high speed, and even machine learning using BigQuery ML, etc., can be done only with BigQuery.

I use it conveniently, and when I notice it, various Datasets and Tables have been created. It seems that the meta information can be seen using INFORMATION_SCHEMA.

This time, we will organize how to use INFORMATION_SCHEMA and other meta information.

How to read meta information

Get a list of datasets for a particular project

The first is to specify the project and get the list of datasets inside.


SELECT * FROM `myproject.INFORMATION_SCHEMA.SCHEMATA`

The following items are output.

--catalog_name: project name --schema_name: dataset name --schema_owner: Owner name? (All were null) --creation_time: Creation date and time --last_modified_time: Modified date and time --location: storage location

It was a pity that the owner name was Null. This would be very easy to manage if you knew who created the dataset. .. .. Is there any way to put it in?

Acquisition of table list of specific project data set 1

Next, project: How to specify the dataset and get the list of tables inside.

SELECT * FROM `myproject.mydataset.INFORMATION_SCHEMA.TABLES`

The following items are output.

--table_catalog: Project name --table_schema: Dataset name --table_name: table name --table_type: table or VIEW or xref table --is_insertable_into: Can you INSERT? --is_typed: Unknown (is there anything other than No?) --creation_time: Creation date and time

I don't understand the meaning of is_insertable_into. There seems to be no element other than Yes for the table and No for VIEW. In that case, table_type is enough. .. ..

Acquisition of table list of specific project data set 2

Similarly, Project: A method to specify a dataset and get a list of tables inside. The items that are output are slightly different

SELECT * FROM `myproject.mydataset.__TABLES__`

--project_id: project name --dataset_id: dataset name --table_id: table name --creation_time: Creation date and time --last_modified_time: Modified date and time --row_count: number of rows --size_bytes: Data size --Type: 1 for Table, 2 for VIEW

It's unpleasant that the name is slightly different depending on the meta information you see. It may be catalog_name, table_catalog, or project_id to refer to the same project name. Is it used properly in BigQuery?

Acquisition of table list of specific project data set 3

You can also see additional information about the table in INFORMATION_SCHEMA.

SELECT * FROM `myproject.mydataset.INFORMATION_SCHEMA.TABLE_OPTIONS`

The following items are output.

--table_catalog: Project name --table_schema: Dataset name --table_name: table name --table_type: table or VIEW or xref table

option_name
option_type
option_value

It is difficult to understand what the option is saying for a moment, but option_name contains "expiration_timestamp", option_type contains "TIMESTAMP", and option_value contains the date and time, and these three are a set.

There were also descriptions and labels.

Get table list + column list of specific project data set

The more tables you have, the more output you have.

SELECT * FROM `myproject.mydataset.INFORMATION_SCHEMA.COLUMNS`

--table_catalog: Project name --table_schema: Dataset name --table_name: table name --column_name: column name --ordinal_position: column number --is_nullable: Is Null OK? --data_type: data type

is_generated
generation_expression
is_stored
is_hidden
is_updatable
is_system_defined --is_partitioning_column: Is it used for table partitioning? --clustering_ordinal_position: Is it used for table clustering?

The ones that don't have a commentary are the columns that only contain NEVER or Null. I didn't understand the meaning of the columns.

Try using a combination of meta information

I would like to calculate the billing amount for each dataset in a project by combining meta information.

import pandas as pd

query="SELECT schema_name FROM `myproject.INFORMATION_SCHEMA.SCHEMATA`"
df = pd.read_gbq(query, project_id='myproject', dialect="standard")

df_output = pd.DataFrame()

query="""
SELECT 
 "{dataset}" AS dataset, 
 SUM(size_bytes) / 1000000000 AS DataSize, 
 0.020 * SUM(size_bytes) / 1000000000 AS Cost 
FROM `myproject.{dataset}.__TABLES__`
GROUP BY dataset"""

for i, dataset in df.iterrows():
  d = pd.read_gbq(query.format(dataset=dataset[0]), project_id='myproject',dialect="standard")
  
  df_output=df_output.append(d)

BQ storage costs are $ 0.02 per GB from "BigQuery Fees". For long-term storage, it costs $ 0.01, so the actual bill may be a little cheaper than the result.

in conclusion

I summarized the meta information of BigQuery. Since there are various meta information, there seems to be other ones, but for the time being, I will put out the ones that I often use. However, there are some mysterious items, so I need to study a little more.

[PYTHON] I looked at the meta information of BigQuery & tried using it

Introduction

How to read meta information

Get a list of datasets for a particular project

Acquisition of table list of specific project data set 1

Acquisition of table list of specific project data set 2

Acquisition of table list of specific project data set 3

Get table list + column list of specific project data set

Try using a combination of meta information

in conclusion