[PYTHON] I looked at the meta information of BigQuery & tried using it

Introduction

BigQuery is very convenient, isn't it? For the time being, you can put tables in, aggregate the put tables at high speed, and even machine learning using BigQuery ML, etc., can be done only with BigQuery.

I use it conveniently, and when I notice it, various Datasets and Tables have been created. It seems that the meta information can be seen using INFORMATION_SCHEMA.

This time, we will organize how to use INFORMATION_SCHEMA and other meta information.

How to read meta information

Get a list of datasets for a particular project

The first is to specify the project and get the list of datasets inside.


SELECT * FROM `myproject.INFORMATION_SCHEMA.SCHEMATA`

The following items are output.

--catalog_name: project name --schema_name: dataset name --schema_owner: Owner name? (All were null) --creation_time: Creation date and time --last_modified_time: Modified date and time --location: storage location

It was a pity that the owner name was Null. This would be very easy to manage if you knew who created the dataset. .. .. Is there any way to put it in?

Acquisition of table list of specific project data set 1

Next, project: How to specify the dataset and get the list of tables inside.

SELECT * FROM `myproject.mydataset.INFORMATION_SCHEMA.TABLES`

The following items are output.

--table_catalog: Project name --table_schema: Dataset name --table_name: table name --table_type: table or VIEW or xref table --is_insertable_into: Can you INSERT? --is_typed: Unknown (is there anything other than No?) --creation_time: Creation date and time

I don't understand the meaning of is_insertable_into. There seems to be no element other than Yes for the table and No for VIEW. In that case, table_type is enough. .. ..

Acquisition of table list of specific project data set 2

Similarly, Project: A method to specify a dataset and get a list of tables inside. The items that are output are slightly different

SELECT * FROM `myproject.mydataset.__TABLES__`

--project_id: project name --dataset_id: dataset name --table_id: table name --creation_time: Creation date and time --last_modified_time: Modified date and time --row_count: number of rows --size_bytes: Data size --Type: 1 for Table, 2 for VIEW

It's unpleasant that the name is slightly different depending on the meta information you see. It may be catalog_name, table_catalog, or project_id to refer to the same project name. Is it used properly in BigQuery?

Acquisition of table list of specific project data set 3

You can also see additional information about the table in INFORMATION_SCHEMA.

SELECT * FROM `myproject.mydataset.INFORMATION_SCHEMA.TABLE_OPTIONS`

The following items are output.

--table_catalog: Project name --table_schema: Dataset name --table_name: table name --table_type: table or VIEW or xref table

It is difficult to understand what the option is saying for a moment, but option_name contains "expiration_timestamp", option_type contains "TIMESTAMP", and option_value contains the date and time, and these three are a set.

There were also descriptions and labels.

Get table list + column list of specific project data set

The more tables you have, the more output you have.

SELECT * FROM `myproject.mydataset.INFORMATION_SCHEMA.COLUMNS`

--table_catalog: Project name --table_schema: Dataset name --table_name: table name --column_name: column name --ordinal_position: column number --is_nullable: Is Null OK? --data_type: data type

The ones that don't have a commentary are the columns that only contain NEVER or Null. I didn't understand the meaning of the columns.

Try using a combination of meta information

I would like to calculate the billing amount for each dataset in a project by combining meta information.

import pandas as pd

query="SELECT schema_name FROM `myproject.INFORMATION_SCHEMA.SCHEMATA`"
df = pd.read_gbq(query, project_id='myproject', dialect="standard")

df_output = pd.DataFrame()

query="""
SELECT 
 "{dataset}" AS dataset, 
 SUM(size_bytes) / 1000000000 AS DataSize, 
 0.020 * SUM(size_bytes) / 1000000000 AS Cost 
FROM `myproject.{dataset}.__TABLES__`
GROUP BY dataset"""

for i, dataset in df.iterrows():
  d = pd.read_gbq(query.format(dataset=dataset[0]), project_id='myproject',dialect="standard")
  
  df_output=df_output.append(d)

BQ storage costs are $ 0.02 per GB from "BigQuery Fees". For long-term storage, it costs $ 0.01, so the actual bill may be a little cheaper than the result.

in conclusion

I summarized the meta information of BigQuery. Since there are various meta information, there seems to be other ones, but for the time being, I will put out the ones that I often use. However, there are some mysterious items, so I need to study a little more.

Recommended Posts

I looked at the meta information of BigQuery & tried using it
I tried using the BigQuery Storage API
I tried using PyCaret at the fastest speed
I tried using the image filter of OpenCV
I tried using BigQuery ML
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to get the location information of Odakyu Bus
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried face recognition of the laughter problem using Keras.
[Python] I tried collecting data using the API of wikipedia
I tried using the checkio API
I tried to get the index of the list using the enumerate function
I tried to make a site that makes it easy to see the update information of Azure
I tried the asynchronous server of Django 3.0
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to get the batting results of Hachinai using image processing
zoom I tried to quantify the degree of excitement of the story at the meeting
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried using the trained model VGG16 of the deep learning library Keras
I tried to extract and illustrate the stage of the story using COTOHA
I tried to get the movie information of TMDb API with Python
I tried the common story of using Deep Learning to predict the Nikkei 225
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried the common story of predicting the Nikkei 225 using deep learning (backtest)
I tried using scrapy for the first time
I tried the pivot table function of pandas
I tried cluster analysis of the weather map
vprof --I tried using the profiler for Python
I tried to touch the API of ebay
I tried using the Datetime module by Python
I tried using the functional programming library toolz
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[Python] I tried to judge the member image of the idol group using Keras
I tried to rescue the data of the laptop by booting it on Ubuntu
I didn't understand the Resize of TensorFlow so I tried to summarize it visually.
I tried to get the information of the .aspx site that is paging using Selenium IDE as non-programming as possible.
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to summarize the basic form of GPLVM
I tried the MNIST tutorial for beginners of tensorflow.
I tried using aiomysql
I tried to make the phone ring when it was posted at the IoT post
I tried using Summpy
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using openpyxl
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried clustering ECG data using the K-Shape method
I tried using Ipython