[PYTHON] SQL format with sqlparse

I was looking for a tool to format SQL locally and found a tool called sqlparse so I tried it.

https://github.com/andialbrecht/sqlparse

Installation

sqlparse is made by Python and can be installed using pip.

$ pip install sqlparse

Use on the web

sqlparse is used in an online service called SQLFormat, which allows you to pass SQL and format it with a browser or API.

SQLFormat - Online SQL Formatter

Used on the command line

A command called sqlformat that becomes available when you install sqlparse If you enter SQL as a file or standard input, the formatted SQL will be output.

-rLine breaks / indents,-k upperYou can capitalize the keyword with. For other options-hYou can check it as an option.

$ SQL='select t1.c1 A, t2.c2 B from t1 join t2 on t1.id = t2.id where t1.c1 = "HOGE";'
$ echo "$SQL" | sqlformat -r -k upper -
SELECT t1.c1 A,
       t2.c2 B
FROM t1
JOIN t2 ON t1.id = t2.id
WHERE t1.c1 = "HOGE";

Use as a library

sqlparse can also be used as a library.

format

sqlparse.format allows you to format like the sqlformat command.

>>> import sqlparse
>>> sql = 'select t1.c1 A, t2.c2 B from t1 join t2 on t1.id = t2.id where t1.c1 = "HOGE";'
>>> print sqlparse.format(sql, reindent=True, keyword_case='upper')
SELECT t1.c1 A,
       t2.c2 B
FROM t1
JOIN t2 ON t1.id = t2.id
WHERE t1.c1 = "HOGE";

Perth

You can get the parsed result with sqlparse.parse. sqlparse.parse returns a tuple of Statements that represent SQL statements.

>>> parsed = sqlparse.parse(sql)
>>> parsed
(<Statement 'select...' at 0x1077c6160>,)

The input to sqlparse.parse can also contain multiple SQL statements as follows: In this case, the tuple has multiple elements.

>>> sqlparse.parse("select 1; select 2;")
(<Statement 'select...' at 0x1077c6958>,
 <Statement 'select...' at 0x1077c6848>)

You can get a list of tokens with Statement.tokens.

>>> stmt = parsed[0]
>>> for t in stmt.tokens:
...     print type(t), t
...     
<class 'sqlparse.sql.Token'> select
<class 'sqlparse.sql.Token'>  
<class 'sqlparse.sql.IdentifierList'> t1.c1 A, t2.c2 B
<class 'sqlparse.sql.Token'>  
<class 'sqlparse.sql.Token'> from
<class 'sqlparse.sql.Token'>  
<class 'sqlparse.sql.Identifier'> t1
<class 'sqlparse.sql.Token'>  
<class 'sqlparse.sql.Token'> join
<class 'sqlparse.sql.Token'>  
<class 'sqlparse.sql.Identifier'> t2
<class 'sqlparse.sql.Token'>  
<class 'sqlparse.sql.Token'> on
<class 'sqlparse.sql.Token'>  
<class 'sqlparse.sql.Comparison'> t1.id = t2.id
<class 'sqlparse.sql.Token'>  
<class 'sqlparse.sql.Where'> where t1.c1 = "HOGE";

Formatting method customization

There are many styles of SQL formatting. For example, the column names in the SELECT clause may be listed with a comma at the end of the line, or with a comma at the beginning of the line, as you will get in the output of sqlformat.

You can change the format method to some extent with the options of the format method, but there are some options such as the position of commas that cannot be handled.

Let's change the filter that sqlparse controls the formatting method so that the column names in the SELECT clause are arranged with the comma at the beginning.

First, the implementation is as follows to understand the processing flow of sqlparse.format.

__init__.py


def format(sql, **options):
    """Format *sql* according to *options*.
    Available options are documented in :ref:`formatting`.
    In addition to the formatting options this function accepts the
    keyword "encoding" which determines the encoding of the statement.
    :returns: The formatted SQL statement as string.
    """
    encoding = options.pop('encoding', None)
    stack = engine.FilterStack()
    options = formatter.validate_options(options)
    stack = formatter.build_filter_stack(stack, options)
    stack.postprocess.append(filters.SerializerUnicode())
    return ''.join(stack.run(sql, encoding))

stack = engine.FilterStack()Create a parser with, stack.run(sql, encoding)You are executing the SQL parsing given in.



 You can add filters to the FilterStack to control how it is formatted. In sqlparse.format, `` `formatter.validate_options``` and `` `formatter.build_filter_stack``` are set based on the value of the option.


#### **`stack.postprocess.append(filters.SerializerUnicode())Is a filter for converting a list of tokens to a string.`**

For the sake of simplicity, let's consider an example that does not take options. Let's start by outputting the input as it is.

>>> from sqlparse import engine
>>> stack = engine.FilterStack()
>>> stack.postprocess.append(filters.SerializerUnicode())
>>> print stack.run(sql).next()
select t1.c1 A, t2.c2 B from t1 join t2 on t1.id = t2.id where t1.c1 = "HOGE";

sqlparse.filters.Add ReindentFilter and the results will be indented.



#### **`If you use a filter that processes SQL, such as ReindentFilter, stack.enable_grouping()Token grouping must be enabled in.`**
```enable_grouping()Token grouping must be enabled in.


```pycon
>>> from sqlparse.filters import ReindentFilter
>>> stack = engine.FilterStack()
>>> stack.enable_grouping()
>>> stack.stmtprocess.append(ReindentFilter())
>>> stack.postprocess.append(filters.SerializerUnicode())
>>> print stack.run(sql).next()
select t1.c1 A,
       t2.c2 B
from t1
join t2 on t1.id = t2.id
where t1.c1 = "HOGE";

To change the behavior so that the comma is at the beginning, create a class that overrides `_process_identifierlist``` of `ReindentFilter```.

from sqlparse.sql import Function                                                  
                                                                                   
class MyReindentFilter(ReindentFilter):                                            
    def _process_identifierlist(self, tlist):                                      
        identifiers = list(tlist.get_identifiers())                                
        if len(identifiers) > 1 and not tlist.within(Function):                    
                                                                                   
            first = identifiers[0]                                                 
            self.indent += 1                                                       
            tlist.insert_before(first, self.nl())                                  
            self.offset -= 1                                                       
            tlist.insert_after(first, self.nl())                                   
                                                                                   
            for token in identifiers[1:len(identifiers)-1]:                        
                prev = tlist.token_prev(tlist.token_index(token), False)           
                if prev and prev.is_whitespace():                                  
                    prev.value = ''                                                
                tlist.insert_after(token, self.nl())                               
                                                                                   
            last = identifiers[-1]                                                 
            prev = tlist.token_prev(tlist.token_index(last), False)                
            if prev and prev.is_whitespace():                                      
                prev.value = ''                                                    
            self.offset += 1                                                       
            self.indent -= 1                                                       
                                                                                   
        self._process_default(tlist) 

reindentfilterInstead ofmyreindentfilterYou can see that it is formatted so that the comma comes first.

>>> stack = engine.FilterStack()
>>> stack.enable_grouping()
>>> stack.stmtprocess.append(MyReindentFilter())
>>> stack.postprocess.append(filters.SerializerUnicode())
>>> print stack.run(sql).next()
select
  t1.c1 A
 ,t2.c2 B
from t1
join t2 on t1.id = t2.id
where t1.c1 = "HOGE"
  and t2.c2 = 1;

Even in the subquery, the column names are correctly arranged according to the indent depth as shown below.

>>> print stack.run('select a, b, c FROM (select a, b, c FROM t1) t2;').next()
select
  a
 ,b
 ,c
FROM
  (select
     a
    ,b
    ,c
   FROM t1) t2;

reindentfilterOther methods and filters also allow you to control where clause indentation, keyword case, and so on.

Recommended Posts

SQL format with sqlparse
Format json with Vim (with python)
String format with Python% operator
Use Azure SQL Database with SQLAlchemy
Parsing CSV format data using SQL
Vector format operation & conversion with arcpy
Merge JSON format data with Ansible
[Python] Format when to_csv with pandas
Execute SQL with Django + Postgresql (including PreparedStatement)