Compiler in Python: PL / 0 parser

Motivation

At work I needed to convert from one DSL to another, so I decided to write a compiler using pyparsing. First, in order to understand pyparsing, let's implement a parser using PL / 0 as an example.

What language is PL / 0?

PL / 0 is a language with a grammar similar to Pascal. It is a very small language specification for educational purposes. Actually, I also used it as a teaching material for compiler training when I was in my third year of college.

Below is a sample from Wikipedia.

ex1.pl0


VAR x, squ;

PROCEDURE square;
BEGIN
   squ := x * x
END;

BEGIN
   x := 1;
   WHILE x <= 10 DO
   BEGIN
      CALL square;
      ! squ;
      x := x + 1;
   END
END.

grammar

To write a parser, you need a grammar definition. One of the grammar notations is [BNF](http://en.wikipedia.org/wiki/%E3%83%90%E3%83%83%E3%82%AB%E3%82%B9%E3 % 83% BB% E3% 83% 8A% E3% 82% A6% E3% 82% A2% E8% A8% 98% E6% B3% 95). The PL / 0 (E) BNF below was obtained from the linked Wikipedia. BNF consists of terminal symbols and production rules.

The PL / 0 BNF on Wikipedia did not contain an ident, so I added it.

Reserved words are in lowercase, but you'll need to capitalize to parse the sample above.

pl0_bnf.txt


ident = alpha { alpha | number | '_' } .

program = block "." .

block = [ "const" ident "=" number {"," ident "=" number} ";"]
        [ "var" ident {"," ident} ";"]
        { "procedure" ident ";" block ";" } statement .
statement = [ ident ":=" expression | "call" ident |
            "begin" statement {";" statement } "end" |
            "if" condition "then" statement |
            "while" condition "do" statement ].
condition = "odd" expression |
            expression ("="|"#"|"<"|"<="|">"|">=") expression .
expression = [ "+"|"-"] term { ("+"|"-") term}.
term = factor {("*"|"/") factor}.
factor = ident | number | "(" expression ")".

Parser

It's hard to suddenly write a parser that completely satisfies the grammar. What should i do? It is a good idea to start near the terminal symbol, implement and test each minimum building block, and then combine and stack them.

Specifically, it is a reserved keyword, an identifier, an expression, a statement, a block, a declaration, and a program. Surprisingly, the basics such as variable declaration will be implemented near the end.

** It is very important to make a policy first. ** Implement a production rule If you find an unimplemented production rule while you're doing it, you're likely to get stuck in the depths of nested and recursive references if you implement it and go back and forth.

Let's implement it in the following order.

  1. keyword
  2. ident
  3. term, factor, expression
  4. condition
  5. assignment (ident: = expression)
  6. call
  7. if, then
  8. while, do
  9. statement, begin-end
  10. const
  11. var
  12. procedure
  13. block
  14. program

reserved keyword --reserved word

The reserved words for PL / 0 are const, var, procedure, call, begin, end, if, then, while, do, odd as far as they are read from BNF. pyparsing allows you to declare reserved words with Keyword (). Declare caseless reserved words with CaselessKeyword ().

CONST = CaselessKeyword('CONST')
VAR = CaselessKeyword('VAR')
 :

Instead of writing, it seems that it is recommended to write as follows. Isn't there a more DRY way of writing ...

from pyparsing import CaselessKeyword, MatchFirst
(CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD) = map(CaselessKeyword,
"CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD".replace(",","").split())

keyword = MatchFirst((CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD))

The result was recognized as CONST regardless of case.

>>> print keyword.parseString('CONST')
['CONST']
>>> print keyword.parseString('const')
['CONST']

identifier --identifier

The definition of an identifier generally starts with an alphabet, followed by an alphabet or number and _, so we will follow that. Also, reserved words cannot be identifiers, so write ~ keyword at the beginning.

from pyparsing import Word, alphas, alphanums
ident = ~keyword + Word(alphas, alphanums+"_")

As a result of parsing valid identifiers, invalid identifiers, and reserved words, only valid identifiers succeeded.

>>> print repr(ident.parseString('valid_id'))
(['valid_id'], {})
>>> print repr(ident.parseString('0123bad_id'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pyparsing.py", line 1041, in parseString
    raise exc
pyparsing.ParseException: Expected W:(abcd...,abcd...) (at char 0), (line:1, col:1)
>>> print repr(ident.parseString('CONST'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pyparsing.py", line 1041, in parseString
    raise exc
pyparsing.ParseException: Found unwanted token, {"CONST" | "VAR" | "PROCEDURE" | "CALL" | "BEGIN" | "END" | "IF" | "THEN" | "WHILE" | "DO" | "ODD"} (at char 0), (line:1, col:1)

expression --expression

The PL / 0 formula is complete with the following BNF:

expression = [ "+"|"-"] term { ("+"|"-") term}.
term = factor {("*"|"/") factor}.
factor = ident | number | "(" expression ")"

The implementation of pyparsing looks like this: When the production rule is recursively referenced, use Forward to prepare the box first. Use << to set the production rule later. Using oneOf means that the operators have the same precedence.

<<Inadvertently with the usual glue=If you use, you will get an error of unknown cause and you will be troubled with debugging. Also<<When using, the right side()Let's wrap it up with.x << y | zWhenx << (y | z)は演算子の優先順位に起因して計算結果がこWhenななるため。当然利用したいのは後者です。これもパース時に原因不明のエラーWhenなり、デバッグに悩まされます。pyparsingの仕様なので、そういうものだWhen思って気をつけるしかないです。

number = Regex(r"\d+(\.\d*)?([eE][+-]?\d+)?")
term = Forward()
factor = Forward()
expression = Optional(oneOf("+ -")) + term + ZeroOrMore( oneOf("+ -") + term)
term << factor + ZeroOrMore(oneOf("* /") + factor)
factor << ident | number | "(" + expression + ")"

Let's test it. It seems that it is parsed correctly.

>>> expression.parseString('123')
(['123'], {})
>>> expression.parseString('123+456')
(['123', '+', '456'], {})
>>> expression.parseString('(x+y)*z')
(['(', 'x', '+', 'y', ')', '*', 'z'], {})

condition --conditional expression

PL / 0 has a unary operator called "odd" for some reason. Other than that, it's just ordinary binary operators. pyparsing has a mechanism for generating a conditional expression parser called infixNotation, but here we will implement it faithfully to BNF.

condition = "odd" expression |
            expression ("="|"#"|"<"|"<="|">"|">=") expression .

The implementation is as follows:

condition = ODD + expression | expression + oneOf('= # < <= > >=') + expression

test

>>> condition.parseString('odd 1')
(['ODD', '1'], {})
>>> condition.parseString('3 <= 1')
(['3', '<=', '1'], {})

assign --assignment statement

The BNF for assignment is ʻident ":=" expression`. It's easy because both ident and expression are already implemented.

assign_statement = ident + ":=" + expression

call --Procedure call statement

Since the same thing is repeated, we will only implement it from here.

call_statement = CALL + ident

if-then --IF statement

Define an IF statement. The statement that represents the whole sentence has not appeared yet, so I will declare it in Forward.

statement = Forward()
if_statement = IF + condition + THEN + statement

while-do --WHILE statement

while_statement = WHILE + condition + DO + statement

statement --statement

Now that we have all the parts, let's finally define the sentence generation rules. BNF is as follows.

statement = [ ident ":=" expression | "call" ident |
            "begin" statement {";" statement } "end" |
            "if" condition "then" statement |
            "while" condition "do" statement ].

Implement with pyparsing as it is.

statement = Optional(assign_statement
                   | call_statement
                   | BEGIN + statement + ZeroOrMore(";" + statement) + END
                   | if_statement
                   | while_statement
                   )

const --constant declaration

I'm getting tired, so I just implemented it.

const = CONST + ident + "=" + number + ZeroOrMore("," + ident + "=" + number) + ";"

var --Variable declaration

var = VAR + ident + ZeroOrMore("," + ident) + ";"

procedure --Procedure declaration

Focus only on this part of BNF " procedure "ident"; "block"; ". The outer repetition etc. is performed when the block is implemented.

block = Forward()
procedure = PROCEDURE + ident + ";" + block + ";"

block

block << Optional(const) + Optional(var) + ZeroOrMore(procedure) + statement

program This is the top-level production rule. Thank you for your hard work.

program = block + "."

test

Let's parse the sample program posted at the beginning.

$ python pl0_parser.py ex1.pl0
Traceback (most recent call last):
  File "pl0_parser.py", line 64, in <module>
    print program.parseString(txt)
  File "/usr/lib/python2.7/dist-packages/pyparsing.py", line 1041, in parseString
    raise exc
pyparsing.ParseException: Expected "." (at char 59), (line:8, col:1)

It doesn't pass. Why? Since it says line: 8, col: 1, BEGIN-END cannot be parsed, and it seems that he is expecting a "." At the end of the program. Can't you get a little more kind error around here? I will test it only with the statement parser for confirmation.

>>> statement.parseString('''\
... BEGIN
...    x := 1;
...    WHILE x <= 10 DO
...    BEGIN
...       CALL square;
...       ! squ;
...       x := x + 1;
...    END
... END
... ''')
([], {})

The result is empty and certainly fails. Failures that do not crash are cumbersome to debug. If you look closely, you will find an unfamiliar sentence, ! Squ;. This is an extended PL / 0 syntax and is not defined in the BNF of this implementation. Delete it and run it again.

$ python pl0_parser.py ex1.pl0
['VAR', 'x', ',', 'squ', ';', 'PROCEDURE', 'square', ';', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', ';', 'BEGIN', 'x', ':=', '1', ';', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', ';', 'x', ':=', 'x', '+', '1', ';', 'END', 'END', '.']

It went well!

Next time, I would like to make a syntax tree (AST) in earnest and do code generation.

Source code

Finally, I will post the source of the parser that I have implemented so far.

pl0_parser.py


from pyparsing import CaselessKeyword, MatchFirst, Word, alphas, alphanums, Forward, Optional, oneOf, ZeroOrMore, Regex


# 1. reserved keyword
(CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD) = map(CaselessKeyword,
"CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD".replace(",", "").split())
keyword = MatchFirst((CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD))

# 2. identifier
ident = ~keyword + Word(alphas, alphanums + "_")

# 3. expression
number = Regex(r"\d+(\.\d*)?([eE][+-]?\d+)?")
term = Forward()
factor = Forward()
expression = Optional(oneOf("+ -")) + term + ZeroOrMore(oneOf("+ -") + term)
term << (factor + ZeroOrMore(oneOf("* /") + factor))
factor << (ident | number | "(" + expression + ")")

# 4. condition
condition = ODD + expression | expression + oneOf('= # < <= > >=') + expression

# 5. assignment
assign_statement = ident + ":=" + expression

# 6. call
call_statement = CALL + ident

# 7. if-then
statement = Forward()
if_statement = IF + condition + THEN + statement

# 8. while-do
while_statement = WHILE + condition + DO + statement

# 9. statement
statement << Optional(assign_statement
                      | call_statement
                      | BEGIN + statement + ZeroOrMore(";" + statement) + END
                      | if_statement
                      | while_statement
)

# 10. const
const = CONST + ident + "=" + number + ZeroOrMore("," + ident + "=" + number) + ";"

# 11. var
var = VAR + ident + ZeroOrMore("," + ident) + ";"

# 12. procedure
block = Forward()
procedure = PROCEDURE + ident + ";" + block + ";"

# 13. block
block << Optional(const) + Optional(var) + ZeroOrMore(procedure) + statement

# 14. program
program = block + "."

if __name__ == '__main__':
    import sys
    with open(sys.argv[1], 'r') as fp:
        txt = fp.read()
        print program.parseString(txt)

Recommended Posts

Compiler in Python: PL / 0 parser
Compiler in Python: PL / 0 syntax tree
Compiler in Python: PL / 0 Abstract Syntax Tree (AST)
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
python radius parser
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Epoch in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
Sorted list in Python
Daily AtCoder # 36 in Python
Clustering text in Python
Daily AtCoder # 2 in Python
Implement Enigma in python
Daily AtCoder # 32 in Python
Daily AtCoder # 6 in Python
Daily AtCoder # 18 in Python
Edit fonts in Python
File operations in Python
Read DXF in python
Daily AtCoder # 53 in Python
Key input in Python
Use config.ini in Python
Daily AtCoder # 33 in Python
Solve ABC168D in Python
Logistic distribution in Python
Daily AtCoder # 7 in Python