The general flow of creating a compiler is as follows, and previous was done up to number 1. This time, we will create the second syntax tree (only a part).
With pyparsing, you don't have to keep symbols such as';' in the parsing result because of the structuring feature that comes later. If you use Suppress (), the token will not be included in the result. If you do not suppress it first, it will be difficult to see the result, so let's start with this topic.
I will repost the parse results up to the last time.
['VAR', 'x', ',', 'squ', ';', 'PROCEDURE', 'square', ';', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', ';', 'BEGIN', 'x', ':=', '1', ';', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', ';', 'x', ':=', 'x', '+', '1', ';', 'END', 'END', '.']
Let's suppress (),;. And check the effect.
#Add to the beginning.
LPAR, RPAR, COMMA, SEMICOLON, DOT = map(Suppress, "(),;.")
#Replace strings with constants
# before
# factor << (ident | number | "(" + expression + ")")
# after
factor << (ident | number | LPAR + expression + RPAR)
#Similarly, ; .Also replace with a constant
Let's parse the sample program as before. The symbol has disappeared.
$ python pl0_parser.py ex1.pl0
['VAR', 'x', 'squ', 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', 'x', '+', '1', 'END', 'END']
The perspective results we have seen so far are one-dimensional arrays and have no structure. Let's create a structure by adding group information to the parser.
To give a static structure, use Group () to combine the tokens into one. Let's look at a concrete example. To group the list of variables, make the following changes:
# 11. var
var = VAR + Group(ident + ZeroOrMore(COMMA + ident)) + SEMICOLON
This is the result of parsing again. The first variable list is in parentheses!
['VAR', ['x', 'squ'], 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', 'x', '+', '1', 'END', 'END']
Then, why not enclose it in Group ()? But in pyparsing, there is another conventional way to use .setParseAction.
With pyparsing it is possible to call a callback when the parser succeeds in parsing. A list of tokens is passed to this callback and you can replace the tokens with the return value. Now let's put the variable declaration in [] as before.
# 11. var
def var_list(tokens):
tokens = tokens.asList()
return [tokens[0], tokens[1:]]
var = VAR + ident + ZeroOrMore(COMMA + ident) + SEMICOLON
var.setParseAction(var_list)
When I parsed it, I got the same result as before.
['VAR', ['x', 'squ'], 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', 'x', '+', '1', 'END', 'END']
Instead of calling the function, try calling the constructor of the class. If you preprocess with Group (), you can write the class neatly.
# 11. var
class Var(object):
def __init__(self, tokens):
tokens = tokens.asList()
self.variables = tokens[1]
var = VAR + Group(ident + ZeroOrMore(COMMA + ident)) + SEMICOLON
var.setParseAction(Var)
result. Oh! The object has entered. This allows you to ** generate an AST directly when you run the parser **.
[<__main__.Var object at 0x10d418710>, 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', 'x', '+', '1', 'END', 'END']
The callback can be a function or a class as long as it can be called. Be careful as the meaning of the arguments depends on the number of arguments.
original_string = Original string currently being parsed location = location of the matched string tokens = an array of matched tokens. The token is a ParseResults object
Expressions have a complicated order of operations, but in the end they are basically terms and operators. Nesting is mandatory considering operator precedence. It's a hassle to write by yourself, but in pyparsing there is a utility called ʻinfixNotation`, which automatically creates a parser just by defining the order of the operators. Delete the parser based on the original BNF.
The sign that precedes a term is the unary operator, and the usual four-rule operation is the binary operator. The comparison operator (<<= >> = = #) is also a companion to the binary operator. Let's actually define it.
Using oneOf means that the operators are the same. Note that if you inadvertently write this in two lines, the result will violate the rule that "same operators are calculated from the left".
opAssoc.RIGHT and LEFT indicate whether the operator is tied to the right or left side. The operator # means! = In PL / 0.
# term = Forward()
# factor = Forward()
# expression = Optional(oneOf("+ -")) + term + ZeroOrMore(oneOf("+ -") + term)
# term << (factor + ZeroOrMore(oneOf("* /") + factor))
# factor << (ident | number | LPAR + expression + RPAR)
#infixNotation defines operator precedence.
#Write the same operator on one line.
UNARY, BINARY, TERNARY = 1, 2, 3
factor = ident | number
expression = infixNotation(
factor,
[
(oneOf("+ -"), UNARY, opAssoc.RIGHT), #The code has the highest priority.
(oneOf("* /"), BINARY, opAssoc.LEFT), #Multiplication and division take precedence over addition and subtraction
(oneOf("+ -"), BINARY, opAssoc.LEFT),
]
)
Rewrite the condition in the same way.
# 4. condition
#condition = ODD + expression | expression + oneOf('= # < <= > >=') + expression
condition = infixNotation(
expression,
[
(ODD, UNARY, opAssoc.RIGHT),
(oneOf("< <= > >="), BINARY, opAssoc.LEFT),
(oneOf("= #"), BINARY, opAssoc.LEFT),
]
)
Execution result. You can see that the expression is enclosed in [].
['VAR', ['x', 'squ'], 'PROCEDURE', 'square', 'BEGIN', 'squ', ':=', ['x', '*', 'x'], 'END', 'BEGIN', 'x', ':=', '1', 'WHILE', ['x', '<=', '10'], 'DO', 'BEGIN', 'CALL', 'square', 'x', ':=', ['x', '+', '1'], 'END', 'END']
By the way, the four arithmetic operations can be parsed correctly.
>>> print expression.parseString('1 + 2 * 3 + 4')
[['1', '+', ['2', '*', '3'], '+', '4']]
>>> print expression.parseString('1 + 2 / 3 * 4 - -5')
[['1', '+', ['2', '/', '3', '*', '4'], '-', ['-', '5']]]
I made a syntax tree from a list of tokens. We also generated a syntax tree for expressions that takes into account some of the grammars defined in BNF and operator precedence. I implemented only a part because there is a method to directly generate AST in pyparsing, and even if it is implemented, it will be discarded. Next time, generate an AST that supports all grammars.
pl0_parser.py
# -*- coding: utf-8 -*-
from pyparsing import *
LPAR, RPAR, COMMA, SEMICOLON, DOT = map(Suppress, "(),;.")
# 1. reserved keyword
(CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD) = map(CaselessKeyword,
"CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD".replace(",", "").split())
keyword = MatchFirst((CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD))
# 2. identifier
ident = ~keyword + Word(alphas, alphanums + "_")
# 3. expression
number = Regex(r"\d+(\.\d*)?([eE][+-]?\d+)?")
UNARY, BINARY, TERNARY = 1, 2, 3
factor = ident | number
expression = infixNotation(
factor,
[
(oneOf("+ -"), UNARY, opAssoc.RIGHT), #Code has the highest priority
(oneOf("* /"), BINARY, opAssoc.LEFT), #Multiplication and division take precedence over addition and subtraction
(oneOf("+ -"), BINARY, opAssoc.LEFT),
]
)
# 4. condition
#condition = ODD + expression | expression + oneOf('= # < <= > >=') + expression
condition = infixNotation(
expression,
[
(ODD, UNARY, opAssoc.RIGHT),
(oneOf("< <= > >="), BINARY, opAssoc.LEFT),
(oneOf("= #"), BINARY, opAssoc.LEFT),
]
)
# 5. assignment
assign_statement = ident + ":=" + expression
# 6. call
call_statement = CALL + ident
# 7. if-then
statement = Forward()
if_statement = IF + condition + THEN + statement
# 8. while-do
while_statement = WHILE + condition + DO + statement
# 9. statement
statement << Optional(assign_statement
| call_statement
| BEGIN + statement + ZeroOrMore(SEMICOLON + statement) + END
| if_statement
| while_statement
)
# 10. const
const = CONST + Group(Group(ident + "=" + number) + ZeroOrMore(COMMA + ident + "=" + number)) + SEMICOLON
# 11. var
var = VAR + Group(ident + ZeroOrMore(COMMA + ident)) + SEMICOLON
# 12. procedure
block = Forward()
procedure = PROCEDURE + ident + SEMICOLON + block + SEMICOLON
# 13. block
block << Optional(const) + Optional(var) + ZeroOrMore(procedure) + statement
# 14. program
program = block + DOT
if __name__ == '__main__':
import sys
with open(sys.argv[1], 'r') as fp:
txt = fp.read()
print program.parseString(txt)
Recommended Posts