[PYTHON] [Translation] SLY (Sly Lex Yacc)

The document of SLY, which is a library for generating lexical analyzers and parsers for Python, has been translated into Japanese. The original is here: https://sly.readthedocs.io/en/latest/sly.html

For the selection of translations, I referred to the Japanese translations of Science's compiler principles, techniques, tools I & II (first edition), bison, and flex. Thank you very much.

SLY (Sly Lex Yacc)

This document introduces the outline of lexical analysis processing and parsing processing by SLY. Due to the complexity of the parsing process, it is highly recommended that you read the entire document (just a touch) before doing any major development with SLY.

SLY requires Python 3.6 or higher. If you're using an older version, give up if you're out of luck. I'm sorry.

Preface

SLY is a library for writing parsers and compilers. It is modeled after the traditional compiler generation tools lex and yacc, and implements the LALR (1) parsing algorithm as they use it. Most of the features available on lex and yacc are also on SLY. Note that SLY does not provide enough additional features (eg, automatic generation of abstract syntax trees and depth-first patrol). Also, this should not be considered a parsing framework. Instead, you'll find that a Python parser is a good skeleton for a writing library.

The rest of this document assumes that you are familiar with parsing conventions, syntax-driven translations, and the use of lex and yacc-like compiler generation tools for other languages. If you are unfamiliar with these subjects, you should read an introductory book such as "Compilers: Principles, Techniques, and Tools" by Aho, Sethi, Ullman et al. John Levine's "Lex and Yacc" from O & # 39; Reilly is also affordable. In fact, you can use the O & # 39; Reilly book, which deals with substantially the same concept as a reference for SLY.

Overview of SLY

SLY provides two independent classes, Lexer and Parser. The Lexer class is used to split the input text into token strings specified by regular expression rules. The Parser class is used to recognize language syntax written in the form of context-free grammars. These two classes are usually used together to create a parser. Of course, this is not such a limitation, and there is room for flexibility. The basics are explained in the next two parts.

Description of lexical analyzer

Suppose you want to parse the following strings when writing a programming language.

x = 3 + 42 * (s - t)

The first step in parsing is the process of splitting text into tokens. Each token has a type and a value. The above text can be written as a list of token tuples below.

[ ('ID','x'), ('EQUALS','='), ('NUMBER','3'),
  ('PLUS','+'), ('NUMBER','42'), ('TIMES','*'),
  ('LPAREN','('), ('ID','s'), ('MINUS','-'),
  ('ID','t'), ('RPAREN',')') ]

SLY's Lexer class does this. Here is a sample of a simple lexical analyzer that tokenizes the above text.

# calclex.py

from sly import Lexer

class CalcLexer(Lexer):
    # Set of token names.   This is always required
    tokens = { ID, NUMBER, PLUS, MINUS, TIMES,
               DIVIDE, ASSIGN, LPAREN, RPAREN }

    # String containing ignored characters between tokens
    ignore = ' \t'

    # Regular expression rules for tokens
    ID      = r'[a-zA-Z_][a-zA-Z0-9_]*'
    NUMBER  = r'\d+'
    PLUS    = r'\+'
    MINUS   = r'-'
    TIMES   = r'\*'
    DIVIDE  = r'/'
    ASSIGN  = r'='
    LPAREN  = r'\('
    RPAREN  = r'\)'

if __name__ == '__main__':
    data = 'x = 3 + 42 * (s - t)'
    lexer = CalcLexer()
    for tok in lexer.tokenize(data):
        print('type=%r, value=%r' % (tok.type, tok.value))

Doing this produces the following output:

type='ID', value='x'
type='ASSIGN', value='='
type='NUMBER', value='3'
type='PLUS', value='+'
type='NUMBER', value='42'
type='TIMES', value='*'
type='LPAREN', value='('
type='ID', value='s'
type='MINUS', value='-'
type='ID', value='t'
type='RPAREN', value=')'

The lexical analyzer has only one public method, tokenize (). This is a generator function that creates a stream of Token instances. The type and value attributes of Token hold the token type name and value, respectively.

Set of tokens

The lexical analyzer must specify in the tokens set any token type names that it may generate. This is always mandatory and is used in various verification processes.

An example of the code that specifies the token name.

class CalcLexer(Lexer):
    ...
    # Set of token names.   This is always required
    tokens = { ID, NUMBER, PLUS, MINUS, TIMES,
               DIVIDE, ASSIGN, LPAREN, RPAREN }
    ...

It is recommended to specify the token name in all uppercase letters.

Token collation pattern specifications

The token is specified by writing a regular expression rule compatible with the re module. The name of the rule must correspond to one of the token names shown in the tokens set. Example)

PLUS = r'\+'
MINUS = r'-'

Regular expression patterns are compiled with the re.VERBOSE flag to improve readability. In this mode, unescaped whitespace characters are ignored and comments are allowed. Use \ s to include blank characters in the pattern. Use [#] or \ # to collate the # character.

The order of the patterns listed in the Lexer class is the token collation. Longer tokens must always be specified before shorter tokens. For example, if you want to distinguish between = and == tokens, you need to specify == first. Example)

class MyLexer(Lexer):
    tokens = { ASSIGN, EQ, ...}
    ...
    EQ     = r'=='       # MUST APPEAR FIRST! (LONGER)
    ASSIGN = r'='

Discard text

A ignore special specification is provided to specify a collection of single characters that should be ignored in the input stream. This is typically used in skipping whitespace and other unwanted characters. Even if a character is specified in ignore, the character included as part of the regular expression pattern is not ignored. For example, if you have a quotation-marked text rule, it's okay if the pattern contains the ignore specified character. ignore is primarily used to ignore whitespace and other padding between tokens that are subject to parsing.

Also, by writing a regular expression rule with the prefix ignore_ in the name, other text patterns can be discarded. For example, the following parser has a rule that ignores comments and line breaks.

# calclex.py

from sly import Lexer

class CalcLexer(Lexer):
    ...
    # String containing ignored characters (between tokens)
    ignore = ' \t'

    # Other ignored patterns
    ignore_comment = r'\#.*'
    ignore_newline = r'\n+'
    ...

if __name__ == '__main__':
    data = '''x = 3 + 42
                * (s    # This is a comment
                    - t)'''
    lexer = CalcLexer()
    for tok in lexer.tokenize(data):
        print('type=%r, value=%r' % (tok.type, tok.value))

Addition of collation operation

When collating a specific token, you may want to perform some additional action in addition to collating. For example, there are numerical conversion processing and language reserved word search processing. One way to do this is to describe the behavior as a method and add a regular expression to associate it with the @ _ () decorator.

@_(r'\d+')
def NUMBER(self, t):
    t.value = int(t.value)   # Convert to a numeric value
    return t

This method has a single argument and accepts an instance of type Token. In the default operation, the token name ('NUMBER', etc.) is stored in t.type. If necessary, the token type and token value may be changed within the function. Finally, the processed token needs to be returned as a return value. If the function does not return a return value, the token is discarded and the next token is read.

The @ _ () decorator is automatically defined in the Lexer class. Therefore, import etc. are unnecessary. You may have multiple regular expression rules. Example:

@_(r'0x[0-9a-fA-F]+',
   r'\d+')
def NUMBER(self, t):
    if t.value.startswith('0x'):
        t.value = int(t.value[2:], 16)
    else:
        t.value = int(t.value)
    return t

Instead of using the @ _ () decorator, you may write a method with the same name as the token specified in the string immediately after. Example:

NUMBER = r'\d+'
...
def NUMBER(self, t):
    t.value = int(t.value)
    return t

This technique can be useful for debugging lexical analyzers. You can temporarily associate a method with a token and have it executed when the token appears. When you're done, you can remove the method and restore the behavior of the lexical analyzer.

Token reassignment

Token reassignment may be required under certain conditions. Consider collating identifiers such as "abc", "python", and "guido". Certain identifiers, such as "if", "else", and "while", should be treated as special keywords. This can be achieved by including a token reassignment rule in the lexical analyzer description.

# calclex.py

from sly import Lexer

class CalcLexer(Lexer):
    tokens = { ID, IF, ELSE, WHILE }
    # String containing ignored characters (between tokens)
    ignore = ' \t'

    # Base ID rule
    ID = r'[a-zA-Z_][a-zA-Z0-9_]*'

    # Special cases
    ID['if'] = IF
    ID['else'] = ELSE
    ID['while'] = WHILE

When parsing the identifier, this exception replaces the collation value of a particular token with a new token type. In the above example, a IF token is generated when the value of the identifier is" if ".

Tracking line numbers and positions

In the default operation, the lexical analyzer knows nothing about line numbers. The reason is that the lexical analyzer is not given a definition of the input "line" (for example, a newline character or whether the input is text data in the first place). In order to give such information, you may add a special specification regarding line breaks. Let's do this with the ignore_newline exception.

# Define a rule so we can track line numbers
@_(r'\n+')
def ignore_newline(self, t):
    self.lineno += len(t.value)

Due to a special case, the lineno attribute of the lexical analyzer has been updated. After the line number is updated, the token is destroyed because nothing is returned.

The lexical analyzer does not do anything similar to girder position tracking automatically. Instead, record the location information of each token in the token's index attribute. By using this. It may be possible to calculate the digit position. For example, a backward search may be performed until the immediately preceding line break is found.

# Compute column.
#     input is the input text string
#     token is a token instance
def find_column(text, token):
    last_cr = text.rfind('\n', 0, token.index)
    if last_cr < 0:
        last_cr = 0
    column = (token.index - last_cr) + 1
    return column

Digit position information is only needed in the context of error handling. Therefore, the calculation process of the digit position can be performed as needed, not for each token.

Character constant

Character constants can be defined in the literals set of classes. Example)

class MyLexer(Lexer):
    ...
    literals = { '+','-','*','/' }
    ...

The character constant is just a single character returned in the "as is" state encountered by the lexical analyzer. Character constants are checked after all defined regular expression rules. Therefore, the rule that starts with any one of the character constants has priority over the character constants.

The character constant is stored in the type and value attributes when it is returned. Example) '+'

A token method can be described as an additional action to be performed when the constants are collated. However, the token method must be implemented to set the appropriate token type. Example:

class MyLexer(Lexer):

     literals = { '{', '}' }

     def __init__(self):
         self.nesting_level = 0

     @_(r'\{')
     def lbrace(self, t):
         t.type = '{'      # Set token type to the expected literal
         self.nesting_level += 1
         return t

     @_(r'\}')
     def rbrace(t):
         t.type = '}'      # Set token type to the expected literal
         self.nesting_level -=1
         return t

Error handling

If an invalid character is detected during lexical analysis, the lexical analysis process is stopped. On the other hand, you can add a error () method to handle lexical analysis errors. The error handling method receives one Token. The value attribute of this token contains the entire text before it was tokenized. A typical handler looks at this text and somehow skips it. Example:

class MyLexer(Lexer):
    ...
    # Error handling rule
    def error(self, t):
        print("Illegal character '%s'" % t.value[0])
        self.index += 1

In this case, the character in question is printed and the position information of the lexical analysis is updated to skip one character. Analyst error handling often causes difficult problems. Error handling will require skipping to logically identifiable synchronization points such as semicolons, blank lines, and similar symbols.

If the error () method returns an unprocessed token, a ERROR token will appear in the stream. This is useful if the parser wants to check the error token, for example, to improve the error message or handle other errors.

A more complete example

For reference, here is a more complete example of practicing many of these concepts.

# calclex.py

from sly import Lexer

class CalcLexer(Lexer):
    # Set of token names.   This is always required
    tokens = { NUMBER, ID, WHILE, IF, ELSE, PRINT,
               PLUS, MINUS, TIMES, DIVIDE, ASSIGN,
               EQ, LT, LE, GT, GE, NE }


    literals = { '(', ')', '{', '}', ';' }

    # String containing ignored characters
    ignore = ' \t'

    # Regular expression rules for tokens
    PLUS    = r'\+'
    MINUS   = r'-'
    TIMES   = r'\*'
    DIVIDE  = r'/'
    EQ      = r'=='
    ASSIGN  = r'='
    LE      = r'<='
    LT      = r'<'
    GE      = r'>='
    GT      = r'>'
    NE      = r'!='

    @_(r'\d+')
    def NUMBER(self, t):
        t.value = int(t.value)
        return t

    # Identifiers and keywords
    ID = r'[a-zA-Z_][a-zA-Z0-9_]*'
    ID['if'] = IF
    ID['else'] = ELSE
    ID['while'] = WHILE
    ID['print'] = PRINT

    ignore_comment = r'\#.*'

    # Line number tracking
    @_(r'\n+')
    def ignore_newline(self, t):
        self.lineno += t.value.count('\n')

    def error(self, t):
        print('Line %d: Bad character %r' % (self.lineno, t.value[0]))
        self.index += 1

if __name__ == '__main__':
    data = '''
# Counting
x = 0;
while (x < 10) {
    print x:
    x = x + 1;
}
'''
    lexer = CalcLexer()
    for tok in lexer.tokenize(data):
        print(tok)

When I run this code, I get the following output:

Token(type='ID', value='x', lineno=3, index=20)
Token(type='ASSIGN', value='=', lineno=3, index=22)
Token(type='NUMBER', value=0, lineno=3, index=24)
Token(type=';', value=';', lineno=3, index=25)
Token(type='WHILE', value='while', lineno=4, index=31)
Token(type='(', value='(', lineno=4, index=37)
Token(type='ID', value='x', lineno=4, index=38)
Token(type='LT', value='<', lineno=4, index=40)
Token(type='NUMBER', value=10, lineno=4, index=42)
Token(type=')', value=')', lineno=4, index=44)
Token(type='{', value='{', lineno=4, index=46)
Token(type='PRINT', value='print', lineno=5, index=56)
Token(type='ID', value='x', lineno=5, index=62)
Line 5: Bad character ':'
Token(type='ID', value='x', lineno=6, index=73)
Token(type='ASSIGN', value='=', lineno=6, index=75)
Token(type='ID', value='x', lineno=6, index=77)
Token(type='PLUS', value='+', lineno=6, index=79)
Token(type='NUMBER', value=1, lineno=6, index=81)
Token(type=';', value=';', lineno=6, index=82)
Token(type='}', value='}', lineno=7, index=88)

Let's dig a little deeper into this example. It may take some time to interpret, but all the gist of the lexical analyzer description is given here. The token must be specified in the regular expression rule. It can be accompanied by an action to be executed when a certain pattern is detected. Some features, such as character constants, save you the trouble of creating individual regular expression rules. You can also add error handling.

Description of parser

The Parser class is used for parsing language syntax. Before giving an example, there is some background knowledge to keep in mind.

Background knowledge of parsing

When writing a parser, syntax is usually defined in BNF notation. For example, when parsing a simple mathematical expression, first write the following grammatical specification that eliminates ambiguity.

expr       : expr + term
           | expr - term
           | term

term       : term * factor
           | term / factor
           | factor

factor     : NUMBER
           | ( expr )

Symbols such as NUMBER,+, -, *, and / in the grammar are called terminal symbols and correspond to raw input tokens. Identifiers such as term and factor refer to grammatical rules consisting of a set of terminal symbols and other rules. These identifiers are known as nonterminals. By dividing the grammar into multiple layers (expr, term, etc.), it is possible to incorporate precedence rules for operators that are treated differently. In this example, multiplication and division take precedence over addition and subtraction.

The semantics that occur in parsing are often defined by a technique known as syntax-driven translation. In syntax-driven translation, symbols in a grammar are treated as one object. When various grammatical rules are recognized, values ​​are assigned to symbols and operations are performed on those values. Given the grammar of the mathematical formulas mentioned above, the calculation process of a simple computer can be described as follows, as follows.

Grammar                   Action
------------------------  --------------------------------
expr0   : expr1 + term    expr0.val = expr1.val + term.val
        | expr1 - term    expr0.val = expr1.val - term.val
        | term            expr0.val = term.val

term0   : term1 * factor  term0.val = term1.val * factor.val
        | term1 / factor  term0.val = term1.val / factor.val
        | factor          term0.val = factor.val

factor  : NUMBER          factor.val = int(NUMBER.val)
        | ( expr )        factor.val = expr.val

In this grammar, new values ​​are introduced through the NUMBER token. These values ​​are propagated by the actions shown above. For example, factor.val = int (NUMBER.val) propagates the value of NUMBER to factor. term0.val = factor.val propagates the value of factor to term. A rule like expr0.val = expr1.val + term1.val enforces the combination of values ​​and propagates the values ​​further. The following shows how the values ​​are propagated in the formula 2 + 3 * 4.

NUMBER.val=2 + NUMBER.val=3 * NUMBER.val=4    # NUMBER -> factor
factor.val=2 + NUMBER.val=3 * NUMBER.val=4    # factor -> term
term.val=2 + NUMBER.val=3 * NUMBER.val=4      # term -> expr
expr.val=2 + NUMBER.val=3 * NUMBER.val=4      # NUMBER -> factor
expr.val=2 + factor.val=3 * NUMBER.val=4      # factor -> term
expr.val=2 + term.val=3 * NUMBER.val=4        # NUMBER -> factor
expr.val=2 + term.val=3 * factor.val=4        # term * factor -> term
expr.val=2 + term.val=12                      # expr + term -> expr
expr.val=14

SLY uses a parsing technique known as LR parsing, or shift-reduce parsing. The LR parser is a bottom-up method that attempts to recognize the right side of various grammatical rules. If any of the input is found that matches the right side (of the grammar definition), the operation method according to it is executed, and the grammar symbol group corresponding to the right side is replaced with the grammar symbol on the left side.

LR parsing is generally implemented by shifting grammatical symbols onto the stack and trying to see if the stack and the next input fit into the grammar rule type. Details of the algorithm can be found in the compiler textbook. The following example shows the process of parsing the formula 3 + 5 * (10 -20) with the grammar defined above. In this example, the special symbol $ indicates the end of the input.

---- ---------------------  ---------------------   -------------------------------
1                           3 + 5 * ( 10 - 20 )$    Shift 3
2    3                        + 5 * ( 10 - 20 )$    Reduce factor : NUMBER
3    factor                   + 5 * ( 10 - 20 )$    Reduce term   : factor
4    term                     + 5 * ( 10 - 20 )$    Reduce expr : term
5    expr                     + 5 * ( 10 - 20 )$    Shift +
6    expr +                     5 * ( 10 - 20 )$    Shift 5
7    expr + 5                     * ( 10 - 20 )$    Reduce factor : NUMBER
8    expr + factor                * ( 10 - 20 )$    Reduce term   : factor
9    expr + term                  * ( 10 - 20 )$    Shift *
10   expr + term *                  ( 10 - 20 )$    Shift (
11   expr + term * (                  10 - 20 )$    Shift 10
12   expr + term * ( 10                  - 20 )$    Reduce factor : NUMBER
13   expr + term * ( factor              - 20 )$    Reduce term : factor
14   expr + term * ( term                - 20 )$    Reduce expr : term
15   expr + term * ( expr                - 20 )$    Shift -
16   expr + term * ( expr -                20 )$    Shift 20
17   expr + term * ( expr - 20                )$    Reduce factor : NUMBER
18   expr + term * ( expr - factor            )$    Reduce term : factor
19   expr + term * ( expr - term              )$    Reduce expr : expr - term
20   expr + term * ( expr                     )$    Shift )
21   expr + term * ( expr )                    $    Reduce factor : (expr)
22   expr + term * factor                      $    Reduce term : term * factor
23   expr + term                               $    Reduce expr : expr + term
24   expr                                      $    Reduce expr
25                                             $    Success!

When parsing a mathematical formula, the state machine behind it and the input token at hand determine the next action. When the next token is considered as part of a valid grammar rule (together with the elements on the stack), it is moved (stacked) onto the stack. When the first part of the stack fits the right side of the grammar rule, it is "reduce" and those symbols are replaced with the symbols on the left side. When this reduction occurs, the corresponding action (if any) is performed. If the input token is not moved and the top of the stack does not meet any of the syntax rules, a syntax error will occur and the parser will need to take recovery steps or take remedial action. Only when the parsing stack is empty and there are no input tokens is considered a parsing success.

It should be noted that the huge finite state machine on the back side is implemented in a huge collection of tables. The construction of these tables is not simple and is beyond the scope of the explanation. The reason why the parser moves the token onto the stack instead of reducing expr: expr + term in the 9th step of the above example is clarified by looking at the details of the procedure.

Parsing example

Suppose you want to create a parser that evaluates a simple arithmetic expression like the one introduced earlier. To do that with SLY, do this.

from sly import Parser
from calclex import CalcLexer

class CalcParser(Parser):
    # Get the token list from the lexer (required)
    tokens = CalcLexer.tokens

    # Grammar rules and actions
    @_('expr PLUS term')
    def expr(self, p):
        return p.expr + p.term

    @_('expr MINUS term')
    def expr(self, p):
        return p.expr - p.term

    @_('term')
    def expr(self, p):
        return p.term

    @_('term TIMES factor')
    def term(self, p):
        return p.term * p.factor

    @_('term DIVIDE factor')
    def term(self, p):
        return p.term / p.factor

    @_('factor')
    def term(self, p):
        return p.factor

    @_('NUMBER')
    def factor(self, p):
        return p.NUMBER

    @_('LPAREN expr RPAREN')
    def factor(self, p):
        return p.expr

if __name__ == '__main__':
    lexer = CalcLexer()
    parser = CalcParser()

    while True:
        try:
            text = input('calc > ')
            result = parser.parse(lexer.tokenize(text))
            print(result)
        except EOFError:
            break

In this example, each grammar rule is described as a method decorated by @ _ (rule). The very first grammatical rule (the first rule in BNF notation) defines the top level of parsing. The name of each method must match the name of the grammar rule to be parsed. The argument of the @ _ () decorator is a string that describes the right side of the grammar. The following grammar rules are

expr : expr PLUS term

It becomes such a method.

@_('expr PLUS term')
def expr(self, p):
    ...

When a grammar rule is recognized in the input, the method is invoked. The method receives a sequence of grammatical symbol values ​​with the argument p. There are two ways to refer to these symbols. The first uses symbol names as follows.

@_('expr PLUS term')
def expr(self, p):
    return p.expr + p.term

Besides, it can handle the index of p like an array.

@_('expr PLUS term')
def expr(self, p):
    return p[0] + p[2]

The p.symbol andp [i]of the token are assigned the same _value as the p.value attribute that the parser assigns to the token. Non-terminal symbols are the values ​​returned to the method in the rule.

If the grammar rules contain the same symbol name more than once, a number must be added at the end to clearly distinguish the symbol name. Example:

@_('expr PLUS expr')
def expr(self, p):
    return p.expr0 + p.expr1

Finally, we need to return the values ​​within each rule and make them correspond to the grammatical symbols. In this way, the value is propagated within the grammar.

You may have a different kind of behavior in the grammar. For example, a grammar definition may generate part of the syntax tree.

@_('expr PLUS term')
def expr(self, p):
    return ('+', p.expr, p.term)

You may also create an instance related to the abstract syntax tree.

class BinOp(object):
    def __init__(self, op, left, right):
        self.op = op
        self.left = left
        self.right = right

@_('expr PLUS term')
def expr(self, p):
    return BinOp('+', p.expr, p.term)

It is important that the method returns the value associated with the symbol (here "expr"). This is the propagation of the values ​​shown in the previous section.

Combination of grammar rule functions

If the grammar rules are similar, they may be integrated into a single method. For example, suppose you have two rules that generate one syntax tree. Python

@_('expr PLUS term')
def expr(self, p):
    return ('+', p.expr, p.term)

@_('expr MINUS term')
def expr(self, p):
    return ('-', p.expr, p.term)

Instead of two functions, a single function may be described as follows.

@_('expr PLUS term',
   'expr MINUS term')
def expr(self, p):
    return (p[1], p.expr, p.term)

In this example, the operator is either PLUS or MINUS. Since the symbol name cannot be used as a value, it is better to perform an array operation like p [1] instead.

In general, it is permissible to give multiple grammar rules to the @ _ () decorator of a method. When incorporating multiple grammatical rules into a single function, all the rules must have the same structure (for example, the numbers of terms and symbol names match). Otherwise, the action code to deal with it can be unnecessarily complicated.

Character literal

If desired, the grammar can include single-character tokens. Example:

@_('expr "+" term')
def expr(self, p):
    return p.expr + p.term

@_('expr "-" term')
def expr(self, p):
    return p.expr - p.term

Character literals must always be quoted, such as " + ". In addition, they must be declared in the corresponding lexical analyzer class literals.

class CalcLexer(Lexer):
    ...
    literals = { '+','-','*','/' }
    ...

Character constants are limited to a single character. That is, specifying constants such as <= and == is not legal. These constants should follow the usual lexical analysis rules (for example, define a rule like LE = r'<=').

Empty generation rule

If you don't want to generate anything, define a rule like this:

@_('')
def empty(self, p):
    pass

If you use an empty production rule, you may want to use the name "empty" as a symbol. If you need to include optional elements in the rule:

spam : optitem grok

optitem : item
        | empty

In SLY, it is incorporated as follows.

@_('optitem grok')
def spam(self, p):
    ...

@_('item')
def optitem(self, p):
    ...

@_('empty')
def optitem(self, p):
    ...

Note: You can write an empty rule anywhere by specifying an empty string. On the other hand, writing a "empty" rule and stating that it is "empty" that produces nothing will improve readability and make the intent clearer.

How to deal with ambiguous grammar

The grammar of the mathematical formulas shown above is written in a special format to eliminate ambiguity. However, in many cases it can be very difficult and cumbersome to write a grammar in this format. A more natural grammatical notation is the following compact notation.

expr : expr PLUS expr
     | expr MINUS expr
     | expr TIMES expr
     | expr DIVIDE expr
     | LPAREN expr RPAREN
     | NUMBER

Unfortunately, there is ambiguity in this grammatical specification. For example, when parsing the string "3 \ * 4 + 5", there is no way to determine how the operators are grouped. Is this expression "3 \ * 4) + 5" or "3 \ * (4 + 5)"?

Given an ambiguous grammar, messages such as "shift/reduce conflicts & quot;" and "reduce/reduce conflicts & quot;" are displayed. Shift/reduce conflicts occur when the parser generator cannot determine whether to reduce a rule or shift a symbol on the analysis stack. For example, consider the internal stack of parsing the string "3 \ * 4 + 5".

Step Symbol Stack  Input Tokens       Action
---- ------------- ----------------   -------------------------------
1    $                   3 * 4 + 5$   Shift 3
2    $ 3                   * 4 + 5$   Reduce : expr : NUMBER
3    $ expr                * 4 + 5$   Shift *
4    $ expr *                4 + 5$   Shift 4
5    $ expr * 4                + 5$   Reduce: expr : NUMBER
6    $ expr * expr             + 5$   SHIFT/REDUCE CONFLICT ????

The parser in this example has two options when it reaches the sixth stage. One is to reduce the rule expr: expr * expr on the stack. Another option is to move the token + onto the stack. Both options are completely legal under the rules of context-free grammar.

Normally, all movement/reduction collisions are resolved by movement selection. Therefore, the parser in the above example moves without reducing +. This strategy often works (for example, "if-then" and "if-then-else"), but not in arithmetic formulas. In fact, in the above example, the + movement is completely wrong. Multiplication has higher arithmetic priority than addition, and the reduction of expr * expr should be chosen.

To resolve ambiguities, especially in formula grammars, SLY allows prioritization and associative property assignments for tokens. To achieve this, add the variable precedence to the parser class.

class CalcParser(Parser):
    ...
    precedence = (
       ('left', PLUS, MINUS),
       ('left', TIMES, DIVIDE),
    )

    # Rules where precedence is applied
    @_('expr PLUS expr')
    def expr(self, p):
        return p.expr0 + p.expr1

    @_('expr MINUS expr')
    def expr(self, p):
        return p.expr0 - p.expr1

    @_('expr TIMES expr')
    def expr(self, p):
        return p.expr0 * p.expr1

    @_('expr DIVIDE expr')
    def expr(self, p):
        return p.expr0 / p.expr1
    ...

This precedence specification specifies that PLUS/MINUS are left-joined with the same priority and TIMES/DIVIDE are left-joined with the same priority. In the precedence specification, tokens are arranged in order from bottom priority to highest priority. Therefore, this specification specifies that the TIMES / DIVIDE after the priority specification has a higher priority than the PLUS / MINUS.

Prioritization works by associating a priority level value or binding direction with a token. For example, in the above example:

PLUS      : level = 1,  assoc = 'left'
MINUS     : level = 1,  assoc = 'left'
TIMES     : level = 2,  assoc = 'left'
DIVIDE    : level = 2,  assoc = 'left'

These numbers are then used to give priority level values ​​and join directions to the individual distribution rules. _Always these are determined by the value of the rightmost terminal symbol. Example:

expr : expr PLUS expr           # level = 1, left
     | expr MINUS expr          # level = 1, left
     | expr TIMES expr          # level = 2, left
     | expr DIVIDE expr         # level = 2, left
     | LPAREN expr RPAREN       # level = None (not specified)
     | NUMBER                   # level = None (not specified)

When a move/reduction conflict occurs, the parser generator resolves the conflict using priority rules and join specifications.

  1. If the current token has a higher priority than the rules on the stack, it will be moved.
  2. If a grammar rule on the stack has high priority, it is reduced.
  3. If the current token and the grammar rule have the same precedence, the rule is reduced if it is left-joined and the token is moved if it is right-joined.
  4. If there is no information about the priority, the move/reduction collision is resolved by the move of the specified action.

For example, suppose TIMES comes as the next token that is parsed by expr PLUS expr. Since the priority level of TIMES is higher than PLUS, the movement is performed. Conversely, suppose expr TIMES expr is parsed and PLUS comes as the next token. Since the priority of PLUS is lower than that of TIMES, reduction is performed.

SLY does not report an error or conflict when the move/reduction conflict is resolved by the third method, even if there is a priority rule.

There is one problem with the prioritization method. You may want to change your priorities in a particular context. For example, consider the unary minus operator at 3 + 4 * -5. Mathematically, the unary minus is usually very high and is evaluated before precedence – multiplication. However, in our prioritization, MINUS has a lower priority than TIMES. To deal with this, a priority rule called "fictitious token" can be given.

class CalcParser(Parser):
    ...
    precedence = (
        ('left', PLUS, MINUS),
        ('left', TIMES, DIVIDE),
        ('right', UMINUS),            # Unary minus operator
    )

Here, the unary minus rule is described in the grammar file.

@_('MINUS expr %prec UMINUS')
def expr(p):
   return -p.expr

In this example, % prec UMINUS overrides the prescriptive precedence setting with the UMINUS precedence.

At first glance, the usage of UMINUS in this example may seem very confusing. UMINUS is neither an input token nor a grammar rule. You can think of this as the name given to the special marker in the priority table. When you use the % prec modifier, you are telling SLY that the priority of the expression is similar to the priority of the special marker, not the normal priority.

It is also possible to specify no join in the priority table. This method is used when you do not want to use operators in succession. For example, suppose you want to support the comparison operator < and >, but you don't want a combination like a <b <c. For this purpose, the priority is specified as follows.

     ...
     precedence = (
          ('nonassoc', LESSTHAN, GREATERTHAN),  # Nonassociative operators
          ('left', PLUS, MINUS),
          ('left', TIMES, DIVIDE),
          ('right', UMINUS),            # Unary minus operator
     )

This will generate a syntax error for input text such as a <b <c. Of course, it works for simple expressions like a <b.

Reduction/reduction collisions are triggered when multiple grammatical rules are applicable for a given set of symbols. This kind of conflict is almost always a mistake. This conflict is resolved by the first rule that appears in the grammar file. A reduction/reduction collision occurs when a set of different grammatical rules somehow attempts to produce the same set of symbols. Example:

assignment :  ID EQUALS NUMBER
           |  ID EQUALS expr

expr       : expr PLUS expr
           | expr MINUS expr
           | expr TIMES expr
           | expr DIVIDE expr
           | LPAREN expr RPAREN
           | NUMBER

In this example, there is a reduction/reduction collision between the two rules.

assignment  : ID EQUALS NUMBER
expr        : NUMBER

For example, when parsing a = 5, the parser should either reduce assignment: ID EQUALS NUMBER, or reduce 5 as expression and further assignment: ID EQUALS expr. It is not possible to specify whether the rule should be reduced.

It is a well-known fact that it is difficult to find reduction/reduction conflicts from grammar. When a reduction/reduction collision occurs, SLY issues the following warning message and asks for help.

WARNING: 1 reduce/reduce conflict
WARNING: reduce/reduce conflict in state 15 resolved using rule (assignment -> ID EQUALS NUMBER)
WARNING: rejected rule (expression -> NUMBER)

This message identifies that there are two rules in conflict. But this message doesn't tell us anything about why the parser made such a conclusion. To identify this, it would be necessary to examine the grammar and the contents of the parser debug file with a moderately high concentration of caffeine added.

Parser debugging

Even in the use of LR parsing algorithms, digging into movement/reduction collisions and reduction/reduction collisions is just a word of joy. To support the debug process, a debug file can be output when creating the SLY parsing table. To do this, add the debugfile attribute to the class.

class CalcParser(Parser):
    debugfile = 'parser.out'
    ...

By doing this, the entire grammar and the state of parsing are output to the specified file. The state of the parser is output in the following format.

state 2

    (7) factor -> LPAREN . expr RPAREN
    (1) expr -> . term
    (2) expr -> . expr MINUS term
    (3) expr -> . expr PLUS term
    (4) term -> . factor
    (5) term -> . term DIVIDE factor
    (6) term -> . term TIMES factor
    (7) factor -> . LPAREN expr RPAREN
    (8) factor -> . NUMBER
    LPAREN          shift and go to state 2
    NUMBER          shift and go to state 3

    factor                         shift and go to state 1
    term                           shift and go to state 4
    expr                           shift and go to state 6

The state traces grammatical rules that can then be part of the collation process. Within each rule, the current position in the parsing of that rule is indicated by the character ".". Other actions corresponding to valid input tokens are listed. By investigating these rules (with some practice), you will be able to track conflicts in parsing. It should be emphasized that not all movement/reduction collisions are mistaken. Examining debug files is the only way to make sure they are resolved correctly.

Syntax error handling

When creating a business-use parser, syntax error handling should not be neglected. No one wants a parser that can be overwhelmed by the signs of a problem. Instead, it is desirable to report multiple errors contained in the input to the user at once. To do this, you need to report the error, recover if possible, and continue the parsing process. This is common behavior found in compilers for languages ​​such as C, C ++, and Java.

In SLY, if a syntax error occurs during parsing, the error is detected immediately (that is, the parser does not read the token beyond where it caused the error). At that point, the parser goes into recovery mode, where you can try to continue parsing. In general, error recovery processing within an LR parser is a delicate topic that includes ancient techniques and black magic. The recovery mechanism provided by SLY is comparable to Unix yacc, see O & # 39; Reilly's "Lex and Yacc" for more details.

When a syntax error occurs, SLY does the following:

  1. When an error occurs, the error () method is first called with the token in question as an argument. Syntax errors due to the arrival of end-of-file will pass None instead. The parser then goes into "error recovery" mode, where the error () method is no longer called until at least three tokens have been successfully moved onto the parser stack.
  2. If error () does not perform a recovery action, the lookahead token in question is replaced with a special error token.
  3. If the problematic look-ahead token is already set to the error token, the first element of the parsing stack is removed.
  4. When the parser stack is rewound, the parser enters a restarted state and attempts to start parsing from its initial state.
  5. If the grammar rule accepts error as a token, it is moved to the parsing stack.
  6. If the parser stack heads to error, the look-ahead token is discarded until the parser moves a new symbol or reduces the rule involved in error.

Recovery process and resynchronization by error rule

The best way to try to handle syntax errors is to include the error token in your grammar rules. Consider a language that has grammatical rules for print statements.

@_('PRINT expr SEMI')
def statement(self, p):
    ...

Considering the possibility of a problem in the description, the following grammatical rules may be added.

@_('PRINT error SEMI')
def statement(self, p):
    print("Syntax error in print statement. Bad expression")

The error token in this example collates some token sequence until the semicolon appears. When the semicolon is reached, the rule is called and the error token disappears.

This type of recovery process is sometimes referred to as parsing resynchronization. The error token acts as a wildcard for incorrect input text, and the token immediately following the error token acts as a synchronous token.

Note that it is usually not possible for error to be placed as the rightmost token of an error rule. Example:

@_('PRINT error')
def statement(self, p):
    print("Syntax error in print statement. Bad expression")

The reason is that the first element of the fraudulent token activates the rule and is the target of movement, which makes recovery more difficult if there are consecutive fraudulent tokens. It's a good idea to have a semicolon, closing brace, and some other boundary delimiters that can be used as sync points.

Panic mode recovery process

Another error recovery measure is the panic mode recovery process, which discards tokens to the point where the parser can recover them by some means.

All of the panic mode recovery processing is implemented as a error () function. For example, the next function discards the token until it reaches the closing brace & # 39;} & # 39 ;. After that, the parser restarts from the initial state.

def error(self, p):
    print("Whoa. You are seriously hosed.")
    if not p:
        print("End of File!")
        return

    # Read ahead looking for a closing '}'
    while True:
        tok = next(self.tokens, None)
        if not tok or tok.type == 'RBRACE':
            break
    self.restart()

This function discards the invalid token and tells the parser that it has recovered from the error.

def error(self, p):
    if p:
         print("Syntax error at token", p.type)
         # Just discard the token and tell the parser it's okay.
         self.errok()
     else:
         print("Syntax error at EOF")

Provides details about the attributes and methods used.

--self.errok () This resets the parser and indicates that it is no longer in error recovery mode. This suppresses error token generation and resets the internal counter so thaterror ()can be called again when another syntax error is found. --self.tokens This is an enumerable sequence for parsing. By calling next (self.tokens), you can advance to the next token. --self.restart () Discards the entire parser stack and resets the parser to its initial state.

error () can pass the next look-ahead token to the parser by returning one token. This is useful when trying to synchronize with specific characters. Example:

def error(self, tok):
    # Read ahead looking for a terminating ";"
    while True:
        tok = next(self.tokens, None)           # Get the next token
        if not tok or tok.type == 'SEMI':
            break
        self.errok()

    # Return SEMI to the parser as the next lookahead token
    return tok

Timing of reporting syntax errors

If an invalid token is found while typing, SLY will often handle the error immediately. Note that SLY will then try to delay error handling until one or more grammar rules are reduced. This behavior can cause unexpected results due to a special state on the parsing table behind it known as the "default state". The default state is the state of the parser that reduces the same grammar rules regardless of the next input. Under such conditions, SLY chooses to proceed _without reading the next input token _ and reduces the grammar rules. If the continuing token is invalid, SLY will try to read it and report a syntax error. The behavioral specification in which the grammar rules are executed prior to these grammar errors may seem strange.

General theory of error handling

In ordinary languages, recovery from errors with error rules and resynchronization characters is the most reliable technique. You can now pick up errors in the grammar, recover relatively easily, and continue the parsing process. The panic mode recovery process is only useful in certain special applications where you want to sneak off the input text and take the path to resumption.

Tracking line numbers and location

Tracking location information is often a thorny issue when writing a compiler. By default, SLY keeps track of the line numbers and location of any token. Within the production rules, the following attributes are useful:

--p.lineno The line number of the terminal symbol at the left end of the production rule. --p.index The lexical analysis index of the terminal symbol at the left end of the production rule.

Example)

@_('expr PLUS expr')
def expr(self, p):
    line   = p.lineno      # line number of the PLUS token
    index  = p.index       # Index of the PLUS token in input text

SLY does not propagate line numbers for nonterminal symbols. If you need to do this, you need to store the line number yourself and propagate it to other data structures within the AST node.

Generation of AST (Abstract Syntax Tree)

SLY does not provide special functions for the generation of abstract syntax trees. However, such a construction process can be easily carried out by yourself.

As a simple method of tree structure generation, there is a method of generating tuples and lists with individual grammatical rule functions and propagating them. There are various realization methods, but one of them is shown.

@_('expr PLUS expr',
   'expr MINUS expr',
   'expr TIMES expr',
   'expr DIVIDE expr')
def expr(self, p):
    return ('binary-expression', p[1], p.expr0, p.expr1)

@_('LPAREN expr RPAREN')
def expr(self, p):
    return ('group-expression',p.expr])

@_('NUMBER')
def expr(self, p):
    return ('number-expression', p.NUMBER)

Another method is to create a data structure according to the type of abstract syntax tree node and generate the corresponding node type in the rule.

class Expr:
    pass

class BinOp(Expr):
    def __init__(self, op, left, right)
        self.op = op
        self.left = left
        self.right = right

class Number(Expr):
    def __init__(self, value):
        self.value = value

@_('expr PLUS expr',
   'expr MINUS expr',
   'expr TIMES expr',
   'expr DIVIDE expr')
def expr(self, p):
    return BinOp(p[1], p.expr0, p.expr1)

@_('LPAREN expr RPAREN')
def expr(self, p):
    return p.expr

@_('NUMBER')
def expr(self, p):
    return Number(p.NUMBER)

The advantage of this approach is that it can provide more complex semantic information, type checking, code generation capabilities, and other functionality for node classes.

Change the start symbol

Usually, the first rule that appears in the parsing class is the starting rule (top-level rule) of the grammar rule. To change this, add the start specification to the class. Example:

class CalcParser(Parser):
    start = 'foo'

    @_('A B')
    def bar(self, p):
        ...

    @_('bar X')
    def foo(self, p):     # Parsing starts here (start symbol above)
        ...

The start specification is useful for debugging parts of huge grammars.

Embedded operation

The parsing technique used by SLY is that the action is performed at the end of the rule. Suppose you have the following rules:

@_('A B C D')
def foo(self, p):
    print("Parsed a foo", p.A, p.B, p.C, p.D)

In this example, the operation code provided is executed after all of the symbols A, B, C, and D have been parsed. Occasionally, it may be useful to have a small piece of code executed during parsing. For example, suppose you want to perform some actions immediately after A is parsed. To do this, create an empty rule.

@_('A seen_A B C D')
def foo(self, p):
    print("Parsed a foo", p.A, p.B, p.C, p.D)
    print("seen_A returned", p.seen_A])

@_('')
def seen_A(self, p):
    print("Saw an A = ", p[-1])   # Access grammar symbol to the left
    return 'some_value'           # Assign value to seen_A

In this example, the empty seen_A rule is executed immediately after the A is moved to the parsing stack. In this rule, p [-1] refers to the symbol to the immediate left of the seen_A symbol on the stack. In the foo rule above, it is the value of A. As with any other rule, the embedding action returns a value, which returns the value.

The use of embedding behavior rarely causes extra movement/reduction collisions. For example, suppose you have a grammar that does not cause conflicts.

@_('abcd',
   'abcx')
def foo(self, p):
    pass

@_('A B C D')
def abcd(self, p):
    pass

@_('A B C X')
def abcx(self, p):
    pass

Here, it is assumed that the embedding operation is inserted in one of the rules.

@_('abcd',
   'abcx')
def foo(self, p):
    pass

@_('A B C D')
def abcd(self, p):
    pass

@_('A B seen_AB C X')
def abcx(self, p):
    pass

@_('')
def seen_AB(self, p):
    pass

This inserts extra movement/reduction collisions. This conflict is caused by the fact that the same symbol C appears next to each other in both the abcd and abcx rules. The parser may either move the symbol (abcd rule) or reduce the ruleseen_AB (abcx rule).

A common use for embedding rules is to control other aspects of parsing, such as the scope of local variables. For example, to parse C code, write the following code.

@_('LBRACE new_scope statements RBRACE')
def statements(self, p):
    # Action code
    ...
    pop_scope()        # Return to previous scope

@_('')
def new_scope(self, p):
    # Create a new scope for local variables
    create_scope()
    ...

The new_scope in this example is executed immediately after the LBRACE ({) symbol is parsed. This corrects the behavior of the internal symbol table and the parser. When the rule statements is complete, the code undoes the operations performed in the embed operation (such aspop_scope ()).

Recommended Posts

[Translation] SLY (Sly Lex Yacc)