Compiler in Python: PL / 0 syntax tree

This goal

The general flow of creating a compiler is as follows, and previous was done up to number 1. This time, we will create the second syntax tree (only a part).

  1. Break it down into tokens with a parser.
  2. Parse the token structure to create a parse tree.
  3. Delete unnecessary talks and create an abstract syntax tree (AST).
  4. Follow the AST node to generate the code.

Suppression of unnecessary tokens

With pyparsing, you don't have to keep symbols such as';' in the parsing result because of the structuring feature that comes later. If you use Suppress (), the token will not be included in the result. If you do not suppress it first, it will be difficult to see the result, so let's start with this topic.

I will repost the parse results up to the last time.

['VAR', 'x', ',', 'squ', ';', 'PROCEDURE', 'square', ';', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', ';', 'BEGIN', 'x', ':=', '1', ';', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', ';', 'x', ':=', 'x', '+', '1', ';', 'END', 'END', '.']

Let's suppress (),;. And check the effect.

#Add to the beginning. 
LPAR, RPAR, COMMA, SEMICOLON, DOT = map(Suppress, "(),;.")

#Replace strings with constants
# before
# factor << (ident | number | "(" + expression + ")")
# after
factor << (ident | number | LPAR + expression + RPAR)

#Similarly, ; .Also replace with a constant

Let's parse the sample program as before. The symbol has disappeared.

$ python pl0_parser.py ex1.pl0
['VAR', 'x', 'squ', 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', 'x', '+', '1', 'END', 'END']

Tree structure using Group ()

The perspective results we have seen so far are one-dimensional arrays and have no structure. Let's create a structure by adding group information to the parser.

To give a static structure, use Group () to combine the tokens into one. Let's look at a concrete example. To group the list of variables, make the following changes:

# 11. var
var = VAR + Group(ident + ZeroOrMore(COMMA + ident)) + SEMICOLON

This is the result of parsing again. The first variable list is in parentheses!

['VAR', ['x', 'squ'], 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', 'x', '+', '1', 'END', 'END']

Then, why not enclose it in Group ()? But in pyparsing, there is another conventional way to use .setParseAction.

How to use .setParseAction ()

With pyparsing it is possible to call a callback when the parser succeeds in parsing. A list of tokens is passed to this callback and you can replace the tokens with the return value. Now let's put the variable declaration in [] as before.

# 11. var
def var_list(tokens):
    tokens = tokens.asList()
    return [tokens[0], tokens[1:]]
var = VAR + ident + ZeroOrMore(COMMA + ident) + SEMICOLON
var.setParseAction(var_list)

When I parsed it, I got the same result as before.

['VAR', ['x', 'squ'], 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', 'x', '+', '1', 'END', 'END']

A little advanced usage.

Instead of calling the function, try calling the constructor of the class. If you preprocess with Group (), you can write the class neatly.

# 11. var
class Var(object):
    def __init__(self, tokens):
        tokens = tokens.asList()
        self.variables = tokens[1]
var = VAR + Group(ident + ZeroOrMore(COMMA + ident)) + SEMICOLON
var.setParseAction(Var)

result. Oh! The object has entered. This allows you to ** generate an AST directly when you run the parser **.

[<__main__.Var object at 0x10d418710>, 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', 'x', '+', '1', 'END', 'END']

The callback can be a function or a class as long as it can be called. Be careful as the meaning of the arguments depends on the number of arguments.

original_string = Original string currently being parsed location = location of the matched string tokens = an array of matched tokens. The token is a ParseResults object

Expression tree structure

Expressions have a complicated order of operations, but in the end they are basically terms and operators. Nesting is mandatory considering operator precedence. It's a hassle to write by yourself, but in pyparsing there is a utility called ʻinfixNotation`, which automatically creates a parser just by defining the order of the operators. Delete the parser based on the original BNF.

The sign that precedes a term is the unary operator, and the usual four-rule operation is the binary operator. The comparison operator (<<= >> = = #) is also a companion to the binary operator. Let's actually define it.

Using oneOf means that the operators are the same. Note that if you inadvertently write this in two lines, the result will violate the rule that "same operators are calculated from the left".

opAssoc.RIGHT and LEFT indicate whether the operator is tied to the right or left side. The operator # means! = In PL / 0.

# term = Forward()
# factor = Forward()
# expression = Optional(oneOf("+ -")) + term + ZeroOrMore(oneOf("+ -") + term)
# term << (factor + ZeroOrMore(oneOf("* /") + factor))
# factor << (ident | number | LPAR + expression + RPAR)

#infixNotation defines operator precedence.
#Write the same operator on one line.
UNARY, BINARY, TERNARY = 1, 2, 3
factor = ident | number
expression = infixNotation(
    factor,
    [
        (oneOf("+ -"), UNARY, opAssoc.RIGHT),  #The code has the highest priority.
        (oneOf("* /"), BINARY, opAssoc.LEFT),  #Multiplication and division take precedence over addition and subtraction
        (oneOf("+ -"), BINARY, opAssoc.LEFT),
    ]
)

Rewrite the condition in the same way.

# 4. condition
#condition = ODD + expression | expression + oneOf('= # < <= > >=') + expression
condition = infixNotation(
    expression,
    [
        (ODD, UNARY, opAssoc.RIGHT),
        (oneOf("< <= > >="), BINARY, opAssoc.LEFT),
        (oneOf("= #"), BINARY, opAssoc.LEFT),
    ]
)

Execution result. You can see that the expression is enclosed in [].

['VAR', ['x', 'squ'], 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', ['x', '*', 'x'], 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', ['x', '<=', '10'], 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', ['x', '+', '1'], 'END', 'END']

By the way, the four arithmetic operations can be parsed correctly.

>>> print expression.parseString('1 + 2 * 3 + 4')
[['1', '+', ['2', '*', '3'], '+', '4']]
>>> print expression.parseString('1 + 2 / 3 * 4 - -5')
[['1', '+', ['2', '/', '3', '*', '4'], '-', ['-', '5']]]

Summary

I made a syntax tree from a list of tokens. We also generated a syntax tree for expressions that takes into account some of the grammars defined in BNF and operator precedence. I implemented only a part because there is a method to directly generate AST in pyparsing, and even if it is implemented, it will be discarded. Next time, generate an AST that supports all grammars.

Source

pl0_parser.py


# -*- coding: utf-8 -*-
from pyparsing import *

LPAR, RPAR, COMMA, SEMICOLON, DOT = map(Suppress, "(),;.")

# 1. reserved keyword
(CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD) = map(CaselessKeyword,
"CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD".replace(",", "").split())
keyword = MatchFirst((CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD))

# 2. identifier
ident = ~keyword + Word(alphas, alphanums + "_")

# 3. expression
number = Regex(r"\d+(\.\d*)?([eE][+-]?\d+)?")
UNARY, BINARY, TERNARY = 1, 2, 3
factor = ident | number
expression = infixNotation(
    factor,
    [
        (oneOf("+ -"), UNARY, opAssoc.RIGHT),  #Code has the highest priority
        (oneOf("* /"), BINARY, opAssoc.LEFT),  #Multiplication and division take precedence over addition and subtraction
        (oneOf("+ -"), BINARY, opAssoc.LEFT),
    ]
)


# 4. condition
#condition = ODD + expression | expression + oneOf('= # < <= > >=') + expression
condition = infixNotation(
    expression,
    [
     	(ODD, UNARY, opAssoc.RIGHT),
	(oneOf("< <= > >="), BINARY, opAssoc.LEFT),
	(oneOf("= #"), BINARY, opAssoc.LEFT),
    ]
)

# 5. assignment
assign_statement = ident + ":=" + expression

# 6. call
call_statement = CALL + ident

# 7. if-then
statement = Forward()
if_statement = IF + condition + THEN + statement

# 8. while-do
while_statement = WHILE + condition + DO + statement

# 9. statement
statement << Optional(assign_statement
                      | call_statement
                      | BEGIN + statement + ZeroOrMore(SEMICOLON + statement) + END
                      | if_statement
                      | while_statement
)

# 10. const
const = CONST + Group(Group(ident + "=" + number) + ZeroOrMore(COMMA + ident + "=" + number)) + SEMICOLON

# 11. var
var = VAR + Group(ident + ZeroOrMore(COMMA + ident)) + SEMICOLON

# 12. procedure
block = Forward()
procedure = PROCEDURE + ident + SEMICOLON + block + SEMICOLON

# 13. block
block << Optional(const) + Optional(var) + ZeroOrMore(procedure) + statement

# 14. program
program = block + DOT

if __name__ == '__main__':
    import sys
    with open(sys.argv[1], 'r') as fp:
        txt = fp.read()
        print program.parseString(txt)



Recommended Posts

Compiler in Python: PL / 0 syntax tree
Compiler in Python: PL / 0 Abstract Syntax Tree (AST)
Compiler in Python: PL / 0 parser
Algorithm (segment tree) in Python (practice)
Output tree structure of files in Python
Differences in syntax between Python and Java
Draw a tree in Python 3 using graphviz
Delayed segment tree in Python (debug request)
Manipulate namespaced XML in Python (Element Tree)
Quadtree in Python --2
Python in optimization
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Epoch in Python
Discord in Python
Sudoku in Python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
with syntax (Python)
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Python syntax-control syntax
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
Sorted list in Python
Daily AtCoder # 36 in Python
Daily AtCoder # 2 in Python
Implement Enigma in python
Daily AtCoder # 32 in Python
Daily AtCoder # 18 in Python
Singleton pattern in Python
File operations in Python
Key input in Python
Daily AtCoder # 33 in Python