[PYTHON] Use single quotes without escaping with the AWS Redshift UNLOAD command

Overview

AWS Redshift has an UNLOAD command that prints the results of a query to S3.

I enclose the query in single quotes like UNLOAD ('{query}'), which is annoying because I have to escape the single quotes in the query.

Also, when handwriting a query, I'm still at a loss if I'm still using placeholders and single quotes are automatically inserted.

As a result of various studies to deal with such a situation, I have found a method that can be used without escaping the single quote, so I will introduce it.

The method is to enclose the query in "$$".

background

We have a big data DB built on AWS Redshift, and data extraction from it is often done.

Since the data to be extracted is large, it is not possible to write it as it is on EC2, and the pattern is usually to write it to S3 and then move it locally for analysis.

So use Redshift's UNLOAD command.

https://docs.aws.amazon.com/ja_jp/redshift/latest/dg/r_UNLOAD.html

UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
authorization
[ option [ ... ] ]

Well, I write a query in this "('select-statement')" part, but in the document

The query must be quoted as shown below:

('select * from venue order by venueid')

There is.

further

If the query contains quotation marks (for example, to enclose a literal value), place the literal between two pairs of single quotation marks. You also need to enclose the query in single quotes.

('select * from venue where venuestate=''NV''')

And that.

Also, in here

If there are quotation marks in the query (for example, to enclose literal values), they must be escaped (') in the query text.

It seems that single quotes need to be escaped by overlapping or "\" in the query.

Countermeasures

Simple remedy

First, follow the documentation and consider escaping single quotes.

As an example, let's submit a query from Python using the psycopg2 package.

unload_test.py


import psycopg2

conn = psycopg2.connect(
    host=host,
    dbname=dbname,
    port=port,
    user=user,
    password=password
)

unload_template = "UNLOAD ('{query}') TO 's3://bucket/key' CREDENTIALS 'aws_iam_role=arn:aws:iam::xxxxxxxxxxxx' delimiter '\t' allowoverwrite"
query = unload_template.format(query="select * from address where name = ''taro''")
(Or)query = unload_template.format("select * from address where name = \\'taro\\'")

cur = conn.cursor()
cur.execute(query)

(In Python's manner, the string is enclosed in single quotes, but it is difficult to escape further, so it is double quotes.)

If you want to escape by overlapping single quotes, you need to escape two, and if you escape with backslash, you also need to escape backslash, so you need to write two backslashes + single quote.

A simple query like this is still okay, but it's hard to remember to escape when conditions get complicated and you need to embed a lot of literals, and above all, it's annoying.

It's even more complicated when you think about using placeholders.

psycopg2 has a query configuration class called psycopg2.sql, let's use it.

If you pass a string to psycopg2.sql.SQL (), you'll get an object that looks like a normal Python string.

This object is a great thing that you can insert a string with format () in the part enclosed in {}, and it acts as a placeholder (that is, it is not sent to the DB after being assembled).

Use psycopg2.sql.Literal () to insert literals.

unload_test.py


import psycopg2
from psycopg2 import sql

conn = psycopg2.connect(
    host=host,
    dbname=dbname,
    port=port,
    user=user,
    password=password
)

unload_template = sql.SQL("UNLOAD ('{query}') TO 's3://bucket/key' CREDENTIALS 'aws_iam_role=arn:aws:iam::xxxxxxxxxxxx' delimiter '\t' allowoverwrite")

query_template = sql.SQL("select * from address where name = {name}")
query = query_template.format(name=sql.Literal("taro"))

unload = unload_template.format(query=query)
    
cur = conn.cursor()
cur.execute(unload)

The string inserted by psycopg2.sql.Literal () is automatically enclosed in single quotes, but if you use it with the UNLOAD command, a syntax error will occur.

psycopg2.errors.SyntaxError: syntax error at or near "taro"
LINE 1: ...OAD ('select * from address where name = 'taro'') TO 's...

The first idea to avoid this is to escape the automatically inserted single quotes, but that doesn't work.

For example, if you try to use a backslash,

unload_test.py


query_template = sql.SQL("select * from address where name = \\{name}")
query = query_template.format(name=sql.Literal("taro\\"))
psycopg2.errors.SyntaxError: syntax error at or near "taro"
LINE 1: ...OAD ('select * from address where name = '\taro\\'') TO 's...

sql.Literal () automatically escapes the backslash, and as a result both are inserted and it doesn't work.

Also, if you put another single quote in sql.Literal (),

unload_test.py


query_template = sql.SQL("select * from address where name = {name}")
query = query_template.format(name=sql.Literal("'taro'"))
psycopg2.errors.SyntaxError: syntax error at or near "taro"
LINE 1: ...OAD ('select * from address where name = '''taro'''') TO 's...

Since sql.Literal () escapes single quotes, one more single quote will be included and it will not work.

It only works if you add single quotes before and after the placeholder.

unload_test.py


query_template = sql.SQL("select * from address where name = '{name}'")
query = query_template.format(name=sql.Literal("taro"))

However, when the query becomes complicated and the number of placeholders increases, it is difficult to enclose everything in single quotes.

Clear remedy

When I was wandering around the ocean on the Internet, there was a person who said something like this.

https://stackoverflow.com/a/55962613

You can also use postgres style :

unload 
($$
select * from table where id='ABC'
$$)
to 's3://bucket/queries_results/20150324/table_dump/'
credentials 'aws_access_key_id=;aws_secret_access_key='
;

Eureka! Apparently, if you write like UNLOAD ($$ {query} $$), you can use single quotes without escaping.

If you try it while wondering if it is true, it will work.

unload_test.py


unload_template = sql.SQL("UNLOAD ($$ {query} $$) TO 's3://bucket/key' CREDENTIALS 'aws_iam_role=arn:aws:iam::xxxxxxxxxxxx' delimiter '\t' allowoverwrite")

query_template = sql.SQL("select * from address where name = {name}")
query = query_template.format(name=sql.Literal("taro"))

unload = unload_template.format(query=query)

In this case, you don't even have to enclose the query in single quotes in UNLOAD ().

By the way, enclosing it will result in a syntax error.

unload_test.py


unload_template = sql.SQL("UNLOAD ('$$ {query} $$') TO 's3://bucket/key' CREDENTIALS 'aws_iam_role=arn:aws:iam::xxxxxxxxxxxx' delimiter '\t' allowoverwrite")

query_template = sql.SQL("select * from address where name = {name}")
query = query_template.format(name=sql.Literal("taro"))

unload = unload_template.format(query=query)
psycopg2.errors.SyntaxError: syntax error at or near "taro"
LINE 1: ...('$$ select * from address where name = 'taro' $$') t...

Conclusion

When UNLOADing the query result from Redshift, it is very convenient to write UNLOAD (\ $ \ $ {query} \ $ \ $) because it is not necessary to escape the single quote in the query.

(Bonus) What is this "$$" in the first place?

Regarding this notation using $$, the respondent of stackoverflow mentioned earlier said "postgres style".

So, if you go to the official PostgreSQL documentation instead of Redshift, you will find the following description.

https://www.postgresql.jp/document/8.0/html/sql-syntax.html

4.1.2.2. String constants quoted with dollar signs

In most cases, the syntax for specifying string constants in SQL is convenient, but I understand that if there are many single quotes or backslashes in the target string, they all have to be duplicated. It will be difficult. To make the query more readable even in these situations, PostgreSQL provides another way to specify a string constant called "dollar quotes". A dollar quoted string constant is a dollar sign ($), a "tag" consisting of zero or more optional characters, a dollar sign, any sequence of characters that make up a string constant, a dollar sign, this It consists of the same tags and dollar symbols specified at the beginning of the quotation marks. For example, there are two ways to specify the string "Dianne's horse" using dollar quotes:

$$Dianne's horse$$
$SomeTag$Dianne's horse$SomeTag$

The dollar quoted string has already been escaped, so you can write all string constants as they are. Neither the backslash nor the dollar sign is special unless the sequence matches the start tag.

The notation using $$ seems to be "dollar quotation".

It's also clear that this $$ is a substitute for quotes, which eliminates the need to enclose the query in single quotes with the UNLOAD command.

Since Redshift is based on PostgreSQL, you can use this grammar as well.

Recommended Posts

Use single quotes without escaping with the AWS Redshift UNLOAD command
Use AWS interpreter with Pycharm
I want to use only the SMTP MAIL FROM command and RCPT TO command without sending mail with Python's smtplib
Use the preview feature with aws-cli
Using cgo with the go command
Hit the top command with htop
I get a Python No module named'encodings' error with the aws command