[PYTHON] How to batch insert from CSV file to Tableau hyper file with PostgreSQL-like COPY command

For some reason, when I was looking at the Tableau SDK API Reference, I found the following description. I was surprised.

Insert Data Directly from CSV Files

You can use the PostgreSQL-like COPY command to insert records into a hyper file, which is Tableau's unique data format, at high speed **.

A few years ago, I implemented a process to create a tde file from a CSV file in order to distribute Tableau Reader regularly, but due to the SDK specifications, it is an insert for each record, so processing a large amount of data I have the impression that it took a long time. However, if you can insert it directly with the COPY command, it seems that you can quickly create Tableau's original file (hyper in this case) even with a large amount of data.

This time, I will insert the training data (CSV file) of Kaggle's Titanic dataset into the hyper file using the COPY command.

Click here for the code https://github.com/yolo-kiyoshi/csv2hyper

Premise

The complete _API can be installed with pip install tableau hyperapi. _

Directory structure

It is running in the following directory. The .hyper file is created underdata /.

Directory structure


.
├── data
│   └── train.csv # Titanic train
└── notebook
    └── insert_test.ipynb

Implementation code

First, define the I / O file.

#Insert source CSV file path
src_path = '../data/train.csv'
#Hyper file path you want to create
dist_path = '../data/train.hyper'

Then use the COPY command to insert from CSV to hyper. It's a rough explanation, but it's like "(1) create a session to process hyper files", "(2) create a virtual PostgreSQL-like table", "(3) insert all at once from a CSV file", and "(4) spit out the entire table into a hyper file".

from tableauhyperapi import HyperProcess, Telemetry, Connection, CreateMode, NOT_NULLABLE, NULLABLE, SqlType, TableDefinition, escape_string_literal

#Create a session to work with Hyper files
with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
     with Connection(
         endpoint=hyper.endpoint,
         database=dist_path,
         create_mode=CreateMode.CREATE_AND_REPLACE
     ) as connection:
            #Define table
            table_def = TableDefinition(
                table_name='train',
                columns=[
                TableDefinition.Column("PassengerId", SqlType.big_int(), NOT_NULLABLE),
                TableDefinition.Column("Survived", SqlType.big_int(), NOT_NULLABLE),
                TableDefinition.Column("Pclass", SqlType.big_int(), NOT_NULLABLE),
                TableDefinition.Column("Name", SqlType.text(), NOT_NULLABLE),
                TableDefinition.Column("Sex", SqlType.text(), NOT_NULLABLE),
                TableDefinition.Column("Age", SqlType.double(), NULLABLE),
                TableDefinition.Column("SibSp", SqlType.big_int(), NOT_NULLABLE),
                TableDefinition.Column("Parch", SqlType.big_int(), NOT_NULLABLE),
                TableDefinition.Column("Ticket", SqlType.text(), NOT_NULLABLE),
                TableDefinition.Column("Fare", SqlType.double(), NULLABLE),
                TableDefinition.Column("Cabin", SqlType.text(), NULLABLE),
                TableDefinition.Column("Embarked", SqlType.text(), NULLABLE)
                ]
            )
            #Virtually create a table based on the table definition(default schema is public)
            connection.catalog.create_table(table_def)
            #Execute COPY command like PostgreSQL
            record_count = connection.execute_command(
                command=f'''
                COPY {table_def.table_name} from {escape_string_literal(src_path)} with (format csv, delimiter ',', header)
                '''
            )
            print(f"The number of rows in table {table_def.table_name} is {record_count}.")

Execution result


The number of rows in table "train" is 891.

After processing, if you look at data / train.hyper in Tableau ... スクリーンショット 2020-02-09 14.43.30.png I was able to confirm that the hyper file was created successfully!

Finally

I was impressed that the Tableau SDK has evolved so far. If you mainly use Tableau Desktop or Tableau Server, you will rarely have the opportunity to use hyper files, but if for some reason you have to distribute reports using Tableau Reader etc., it is quite useful. Will do.

reference

https://help.tableau.com/current/api/hyper_api/en-us/docs/hyper_api_insert_csv.html

Recommended Posts

How to batch insert from CSV file to Tableau hyper file with PostgreSQL-like COPY command
How to read a CSV file with Python 2/3
How to convert JSON file to CSV file with Python Pandas
[Python] Write to csv file with Python
Output to csv file with Python
How to make a command to read the configuration file with pyramid
[Python] How to convert db file to csv
Convert from PDF to CSV with pdfplumber
How to use CUT command (with sample)
How to create sample CSV data with hypothesis
[Python] How to read excel file with pandas
Generate an insert statement from CSV with Python.
Draw a graph with matplotlib from a csv file
How to create an article from the command line
How to delete the specified string with the sed command! !! !!
I tried to touch the CSV file with Python
Backtrader How to import an indicator from another file
How to measure mp3 file playback time with python
How to output CSV of multi-line header with pandas
How to convert Json file to CSV format or EXCEL format
Python script to create a JSON file from a CSV file
[Python / Ruby] Understanding with code How to get data from online and write it to CSV
How to put a hyperlink to "file: // hogehoge" with sphinx-> pdf
How to return to the previous directory with the Bash cd command
How to run a Python file at a Windows 10 command prompt
[Linux] Copy data from Linux to Windows with a shell script
How to check ORM behavior in one file with django
How to launch AWS Batch from a python client app
[Python Kivy] How to create an exe file with pyinstaller
How to read an Excel file (.xlsx) with Pandas [Python]
Python hand play (CSV is applied with Pandas DataFrame, but only full-column Insert from CSV to DB?)