[PYTHON] Pretty print of ML language using parser combinator

This article is the 7th day article of Language processing system Advent Calender.

Today, I will write an example of pretty prints in ML languages ats-beautify [[1]](# 1). ats-beautify is implemented only in python and can be used as a plugin for sublime-text2 or from various editors using command line calls.

motivation

There is a language called ATS2 that has a grammar similar to SML. ATS can certify the program code, so you can build safe code. Moreover, since it outputs C language, you can replace only a part of the C language source with the code output by ATS and replace it with a complete program with proof and use it immediately.

However, the program code written by the author, Hongway, was not easy to read. However, he is the author of the language. You should be writing with a certain degree of commitment. Even if I ask you to rewrite it neatly, I feel that it will not be accepted easily. I thought it would be good to write a solid pretty printer and talk about how it would be beautiful if you used it. For the time being, I thought I should make a pretty printer with reference to OCaml.

However, unlike languages such as C and Ruby that end with end, it was a little difficult to create a beauty for ML languages, and it didn't seem easy to create.

I was able to make some things by modifying the parser, but it doesn't work because I need information that isn't in the syntax tree. OCaml pretty printers seem to have their own parsers!

Umm. I want to make it easy and beautiful. But I can't make it. Such a situation continued.

Around the summer, there was a study session on PEG and parsing Vol.1 [[2]](# 2). I was wondering if there was anything I could do, and suddenly I came up with it.

** "Yes! Let's make a parser combinator and make a pretty printer!" **

So I created a parser combinator that can add nested information to the grammar. You can go with this. I found that I could make it to some extent.

How to use a pretty printer

parse(src)

If you pass the source code to the parse function, the pretty printed string will be returned.

How to use the parser combinator with pretty printer function

Suddenly it's big, but I write an ATS parser as shown below, as long as I can make a pretty print that is not perfect. The point is that if you write-where you want to put a nest, it will be nested. The reason I chose-is that-in Sublime Text2-is beautifully colored. You will know how to make grammar once you get used to it. In the case of python, many operators cannot be used, so I try to write a / b / c as or (a, b, c) with reference to the parser combinator of go, but I feel that I can not understand it unless I am familiar with it. .. Well, the parser combinator is surely such a thing.

keywords = reg(r"^(begin|end|if|else|then|let|in|val|implement|local|typedef|sortdef|datatype|extern|lam|try|with|fnx|fn|fun|case|of|orelse|macrodef|macdef|staload|dynload|open|struct|module|and|while|do|done)\b")
semi = notp(";;") >> p(";")
exp = p(lambda i: p(exp4, rep[semi, exp4], opt(semi))(i))
exps = p(exp, opt(semi))
id = orp(
    notp(keywords) >> reg(r"^[\$\\]?[_a-zA-Z0-9]+"),
    str(":<cloref>"),
    reg(r'^[+\-*/.<>:@=^|~?]+') ^ (lambda i: i if re.search(r'^(=|=>|=<|->|\||)$', i[1]) is None else None),
    reg(r'^[!]'),
    reg(r'^("(\\.|[^"])*"|\'(\\.|[^\'])\')')
)
assign = p(lambda i: p(app, "=", -exp)(i))
sexp = orp(
    p("@(", -opt(exp), ")"),
    p("@{", -p(assign, rep(",", assign)), "}"),
    p("@[", -opt(exp), "]"),
    p("'(", -opt(exp), ")"),
    p(",(", -opt(exp), ")"),
    p("'{", -p(assign, rep(",", assign)), "}"),
    p("'[", -opt(exp), "]"),
    id,
    p("begin", -exp, "end"),
    p("(", -opt(exp), ")"),
    p("{", -p(lambda i: rep(toplevel)(i)), "}"),
    p("[", -opt(exp), "]")
)
app = rep1(sexp)
exp1 = orp(
    p("lam", -p[app], orp("=>", "=<cloptr1>", "=<cloref>"), -p(lambda i: exp2(i))),
    p("let", -p(lambda i: rep(toplevel, opt(";;"))(i)), "in", -opt(exp), "end"),
    p("if", -exps, "then", -p(lambda i: exp4(i)), opt(p("else", exp))),
    p("case", -exps, "of", -p(opt("|"), app, "=>", -exp, rep("|", app, "=>", -exp))),
    p("try", -exps, "with", opt("|"), -app, "=>", exp, rep["|", app, "=>", -exp]),
    p("while", -exps, "do", -exps, "done"),
    app
)

exp2 = p(exp1, opt(orp("=", "orelse"), p(lambda i: exp2(i))))
exp3 = p(exp2, rep(",", exp2))
exp4 = p(exp3, opt("->", exp))
prog = p(lambda i: rep(toplevel)(i))
struct = p(lambda i: orp(
    p("struct", -prog, "end"),
    -struct_exp
)(i))
struct_exp = rep1(orp(
    id,
    p("(", -opt(struct), ")")
))
datatype = p[app, "=", opt("|"), -app, opt("of", -app), rep("|", -app, opt("of", -app))]
toplevel = orp(
    p(orp("fn", "fnx", "fun"), -p[app], opt("=", -exp)),
    p("extern", -p[orp("fn", "fnx", "fun"), app]),
    p(orp("macdef", "macrodef"), -p[app], opt("=", -exp)),
    p("val", -p[app], "=", -exp),
    p("implement", -p[app], "=", -exp),
    p(orp("typedef", "sortdef"), -p[app], "=", -exp),
    p("and", -p[app], "=", -exp),
    p(orp("#include", "staload", "dynload"), -exp),
    p("#define", -sexp, opt("(", -exps, ")"), -sexp),
    p("local", -p[prog], "in", -p[prog], "end"),
    p("datatype", -datatype, rep("and", -datatype)),
    p("exception", -id, "of", -exp),
    p("open", -p[id, rep(".", id)]),
    p(exp, opt(semi)),
    p("module", -p[app], "=", struct)
)

Implementation of parser combinator with pretty printer function

First, create a simple parser combinator library.

The language for implementation is python because I wanted to use it with sublimetext2. Python has a class, and the operators can be changed a little. There is a way to use the method chain, but I decided to use the function by referring to go parser combinator goparsec [[3]](# 3).

import re
import sys

sys.setrecursionlimit(1000*1000)


class Parser(object):
    def __rshift__(self, that):
        return p(self, that) ^ (lambda a: a[1])

    def __lshift__(self, that):
        return p(self, that) ^ (lambda a: a[0])

    def __xor__(self, that):
        return Action(self, that)

    def __neg__(self):
        return Action(self, lambda a: [NestP, a, NestM])

    class static(type):
        def __getitem__(self, i):
            return self(*i) if isinstance(i, tuple) else self(i)
    __metaclass__ = static


class nreg(Parser):
    def __init__(self, param):
        self.param = re.compile(param)

    def __call__(self, i):
        m = self.param.search(i)
        return None if m is None else [m.group(0), i[len(m.group(0)):]]


class orp(Parser):
    def __init__(self, *params):
        self.params = map(lambda v: st(v) if isinstance(v, basestring) else v, params)

    def __call__(self, i):
        for v in self.params:
            r = v(i)
            if r is not None:
                return r
        return None


class nstr(Parser):
    def __init__(self, param):
        self.param = param

    def __call__(self, i):
        return None if not i.startswith(self.param) else [self.param, i[len(self.param):]]


class p(Parser):
    def __init__(self, *params):
        self.params = map(lambda v: st(v) if isinstance(v, basestring) else v, params)

    def __call__(self, i):
        rs = []
        for v in self.params:
            r = v(i)
            if r is None:
                return None
            rs.append(r[0])
            i = r[1]
        return [rs, i]


class Action(Parser):
    def __init__(self, thiz, action):
        self.thiz = thiz
        self.action = action

    def __call__(self, i):
        r = self.thiz(i)
        if r is None:
            return None
        else:
            r2 = self.action(r[0])
            return None if r2 is None else [r2, r[1]]


class opt(Parser):
    def __init__(self, *thiz):
        self.thiz = p(*thiz)

    def __call__(self, i):
        r = self.thiz(i)
        return [[], i] if r is None else r


class rep(Parser):
    def __init__(self, *thiz):
        self.thiz = p(*thiz)

    def __call__(self, i):
        rs = []
        while(True):
            r = self.thiz(i)
            if r is None:
                return [rs, i]
            rs.append(r[0])
            i = r[1]


def rep1(*thiz):
    return rep(*thiz) ^ (lambda p: None if len(p) < 1 else p)


class notp(Parser):
    def __init__(self, *thiz):
        self.thiz = orp(*thiz)

    def __call__(self, i):
        return [[], i] if self.thiz(i) is None else None


class Any1(Parser):
    def __call__(self, i):
        return None if len(i) == 0 else [i[0], i[1:]]
any1 = Any1()

Customize this to some extent.

blockcomment = p(nstr("(*"), rep(orp(notp(nreg(r"^(\(\*|\*\))"), any1), (lambda i: blockcomment(i)))), nstr("*)"))
skip = rep(orp(nreg(r'^(\s|/\*.*?\*/|//[^\r\n]*)+'), blockcomment))

def st(s):
    return p(skip, nstr(s))


def reg(s):
    return p(skip, nreg(s))

Now, create two objects for nesting.

class Nest:
    def __init__(self, name, n):
        self.name = name
        self.n = n

    def __repr__(self):
        return self.name
NestP = Nest("NestP", 1)
NestM = Nest("NestM", -1)

It means that NestP raises the nest by one and NestM lowers it by one. This is done by using the Parser class __neg__ and adding-to allow NestP and NestM to be added before and after the parser.

Pretty print

The pretty print function consists of two functions, flat and cnv. Since the parse result includes all tokens + nesting information, just adjust the blanks with and output the nesting information.

def flat(a, rc):
    if isinstance(a, list):
        for i in a:
            rc = flat(i, rc)
    else:
        rc.append(a)
    return rc


def cnv(e):
    reg2 = re.compile(r'\n')
    whiteSpace = re.compile(r'^(\s|\s*\(\*)')
    reg1 = re.compile(r'\n')

    e = flat(e, [])
    e2 = []
    i = 0
    while(i < len(e)):
        s = [e[i]]
        if isinstance(s[0], basestring) and reg2.search(s[0]) is not None and whiteSpace.search(s[0]) is not None:
            s = []
            while(i < len(e)):
                s2 = e[i]
                if s2 is NestM:
                    e2.append(s2)
                else:
                    s.append(s2)
                    if whiteSpace.search(s2) is None:
                        break
                i += 1
        i += 1
        e2.extend(s)

    nest = 0
    e3 = []
    for s in e2:
        if isinstance(s, Nest):
            nest += s.n
        else:
            m = reg2.search(s)
            if m is not None:
                s = reg1.sub("\n"+("  " * nest), s)
            e3.append(s)
    return "".join(e3)

These functions are called as follows.

regparse = re.compile(r"^[ \t]+", re.M)


def parse(s):
    return cnv(prog(regparse.sub("", s)))

Summary

  1. Create a parser combinator dedicated to creating a pretty printer.
  2. The parser combinator automatically creates a syntax tree and has token information including spaces and comments.
  3. Allow nesting information to be added to the parser combinator.
  4. Create a process to print based on the nest information.
  5. It doesn't have to be perfect, so make a parser for pretty prints.

By doing the above, you can make a pretty print using the parser combinator.

ATS's pretty printers aren't perfect yet, but expanding this parser will make it even better. Also, if you use this mechanism, you should be able to write beautifully on other ML pretty printers.

Link

- [[2]](# r2) Arekore Study Group on PEG and Parsing Vol.1 http://connpass.com/event/16630/

Recommended Posts

Pretty print of ML language using parser combinator
[MacOS] Installation of Go (Go language) using goenv
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-75 (using scikit-learn): weight of features
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence