Kinx Library-Parser Combinator (Extra Edition)

Introduction

** "Looks like JavaScript, brain (contents) is Ruby, (stability is AC / DC)" ** Scripting language Kinx ). This time is also a parser combinator.

Reference
First motivation ... Script language KINX (introduction)
All links to individual articles are collected here.
Repository ... https://github.com/Kray-G/kinx
We are waiting for Pull Request etc.

Not long ago, I ran it with Kinx Library-Parser Combinator (Part 2). What is this time?

Yes, I would like to introduce the interface. I'm sorry, I forgot to do it last time.

Parsek

Instances of the Parsek class have the following methods: The $ in the statement indicates the actual Parsek instance that was instantiated.

Predefined parser

The following predefined parsers already exist in the instance of the Parsek class.

Predefined parser	Contents
$.eof	A parser that succeeds if it is EOF.
$.any	A parser that succeeds with any single character.
$.all	A parser that processes everything until the end and succeeds.
$.index	Returns the current position (`{ offset, line, column}`). offset is 0 starting point, line/column is one starting point.
$.letter	`$.regex(/[a-zA-Z]/)`Same as.
$.letters	`$.regex(/[a-zA-Z]*/)`Same as.
$.digit	`$.regex(/[0-9]/)`Same as.
$.digits	`$.regex(/[0-9]*/)`Same as.
$.whitespace	`$.regex(/\s+/)`Same as.
$.optWhitespace	`$.regex(/\s*/)`Same as.
$.cr	`$.string("\r")`Same as.
$.lf	`$.string("\n")`Same as.
$.crlf	`$.string("\r\n")`Same as.
$.newline	`$.alt($.crlf, $.lf, $.cr)`Same as.
$.end	`$.alt($.newline, $.eof)`Same as.

Parser generation function

An instance of the Parsek class has the ability to generate a parser with the following methods:

Parser generation function	Contents
$.string(str)	`str`Creates a parser that succeeds if it matches the string specified in.
$.regex(re, groupIndex)	`re`Creates a parser that succeeds if it matches the regular expression specified in. value is`groupIndex`Contains the capture result of the group specified in.
$.succeed(result)	`result`To create a parser that always succeeds.
$.fail(message)	`message`Create a parser that fails by leaving a message.
$.oneOf(chars)	`chars`Create a parser that succeeds if it matches one character contained in.
$.noneOf(chars)	`chars`Create a parser that succeeds if it does not match one character contained in.
$.sepBy(content, separator)	`content`A string that can be parsed with`separator`Create a parser that returns the result divided by the result that matches.
$.sepBy1(content, separator)	`sepBy`Is allowed 0,`sepBy1`Fails if at least one does not match.
$.lazy(description, f)	`f`Specifies a function that returns a parser to, and creates a parser when it is first used.
$.alt(...parsers)	Creates a parser that succeeds if it matches any of the specified parsers. The application is done in the order specified.
$.takeWhile(predicate)	The next character is`predicate`Create a parser that will continue to succeed while it is true.
$.seq(...parsers)	Creates a parser that succeeds when multiple specified parsers are matched in sequence. Results are returned as an array of all results.
$.seqMap(...parsers, mapf)	Creates a parser that succeeds when multiple specified parsers are matched in sequence. The result is an array of all the results, but it returns the value returned by the function specified by mapf.

ParsekParser

The parser generated by the method of the Parsek class is an instance of the ParsekParser class. This class defines methods for chaining parsers.

Execution of perspective

Once all the parsers are ready, run the parse with the following method.

Method	Contents
parser.parseAll(target)	`target`Against`parser`To parse using.

Parser generation

Parser generation function	Contents
parser.desc(description)	`parser`Set the message to be displayed when parsing by is unsuccessful.
parser.or(nextParser)	`parser`If parsing by`nextParser`Create a parser that attempts to.
parser.and(nextParser)	`parser`If you succeed in parsing by`nextParser`Create a parser to execute. The result is returned as an array.
parser.then(nextParser)	`parser`If you succeed in parsing by`nextParser`Create a parser to execute. Result is`nextParser`Return the one`parser`The result of is ignored.
parser.skip(nextParser)	`parser`If you succeed in parsing by`nextParser`Create a parser to execute. Result is`parser`Return the one`nextParser`The result of is ignored.
parser.many()	`parser`Create a parser that parses the occurrence of 0 or more times.
parser.many1()	`parser`Create a parser that parses that appears more than once.
parser.times(min, max)	`parser`But`min`More than once`max`Create a parser that parses the appearance repeatedly no more than once. ..`max`If is omitted, just`min`Success in case of number of times.
parser.sepBy(separator)	`parser`A string that can be parsed with`separator`Create a parser that returns the result divided by the result that matches.
parser.sepBy1(separator)	`sepBy`Is allowed 0,`sepBy1`Fails if at least one does not match.
parser.map(f)	`parser`Use the result of the above converted by the function f.
parser.chain(f)	`parser`Pass the result of to the function f, and connect the parsers to use the new parser returned by f as the next parser.
parser./(nextParser)	`parser.or(nextParser)`Another name for.
parser.+(nextParser)	`nextParser`Without specifying`parser.+()`When used as`parser.many1()`Another name for.`nextParser`If you specify`parser.and(nextParser)`Another name for.
parser.*()	`parser.many()`Another name for.

Packrat Parsing

Supports Packrat Parsing.

Parser combinators usually proceed with analysis in the backtrack. So, for example, in the case of $ .alt (a, b, c), if the parser of ʻa fails, you can proceed with b`, but the same parser may be repeated many times along the way.

Therefore, when backtracking, performance improvement can be expected by caching some results and reusing them by using the property that ** if analyzed in the same place and under the same conditions, the same result will be obtained **. .. It's memoization.

Parsek supports this feature. By the way, it can be disabled by specifying {disablePackratParsing: true} in the constructor when creating a Parsek instance.

Let's see its performance with a benchmark. Refer to here to handle the grammar that causes a tremendous number of backtracks. The grammar is as follows.

S <- A
A <- P "+" A / P "-" A / P
P <- "(" A ")" / "1"

This will parse the following string:

(((((((((((((1)))))))))))))

The explanation is easy to understand, so I will quote it as it is.

The essence of this rule and string combination is that it must go through all A and P combinations before being finally parsed. In other words, the worst case of backtrack can be almost reproduced. So if you frankly backtrack, the processing time should increase exponentially with the number of parentheses correspondence.

No, it's amazing.

Now, the benchmark source code. Since the parser itself is the same, the same test is done with the $ D parser of{disablePackratParsing: true}and the $ E parser enabled.

using Parsek;

var $E = new Parsek();
var $D = new Parsek({ disablePackratParsing: true });
var tmr = new SystemTimer();

// rule:
//     S <- A
//     A <- P "+" A / P "-" A / P
//     P <- "(" A ")" / "1"
// input:
//     (((((((((((((1)))))))))))))

function generateParser($) {
    var S, A, P;
    S = $.lazy(&() => A);
    A = $.lazy(&() => $.alt($.seq(P, $.string('+'), A), $.seq(P, $.string('-'), A), P));
    P = $.alt($.seq($.string('('), A, $.string(')')), $.string('1'));
    return S;
}

function test($) {
    for (var i = 0; i <= 10; ++i) {
        tmr.restart();
        var r = generateParser($).parseAll(('(' * i) + '1' + (')' * i));
        var elapsed = tmr.elapsed();
        System.println(["%8.5f" % elapsed, r]);
    }
}

test($E);
test($D);

Let's do it!

`If valid`


[" 0.00090", {"position":1,"status":1,"value":"1"}]
[" 0.00041", {"position":3,"status":1,"value":["(","1",")"]}]
[" 0.00044", {"position":5,"status":1,"value":["(",["(","1",")"],")"]}]
[" 0.00061", {"position":7,"status":1,"value":["(",["(",["(","1",")"],")"],")"]}]
[" 0.00083", {"position":9,"status":1,"value":["(",["(",["(",["(","1",")"],")"],")"],")"]}]
[" 0.00055", {"position":11,"status":1,"value":["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"]}]
[" 0.00061", {"position":13,"status":1,"value":["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"]}]
[" 0.00071", {"position":15,"status":1,"value":["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"]}]
[" 0.00200", {"position":17,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"]}]
[" 0.00101", {"position":19,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"],")"]}]
[" 0.00196", {"position":21,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"],")"],")"]}]

`When disabled`


[" 0.00022", {"position":1,"status":1,"value":"1"}]
[" 0.00034", {"position":3,"status":1,"value":["(","1",")"]}]
[" 0.00097", {"position":5,"status":1,"value":["(",["(","1",")"],")"]}]
[" 0.00353", {"position":7,"status":1,"value":["(",["(",["(","1",")"],")"],")"]}]
[" 0.01142", {"position":9,"status":1,"value":["(",["(",["(",["(","1",")"],")"],")"],")"]}]
[" 0.02686", {"position":11,"status":1,"value":["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"]}]
[" 0.09525", {"position":13,"status":1,"value":["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"]}]
[" 0.26821", {"position":15,"status":1,"value":["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"]}]
[" 0.72229", {"position":17,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"]}]
[" 2.44629", {"position":19,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"],")"]}]
[" 7.48334", {"position":21,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"],")"],")"]}]

If Packrat Parsing is enabled, 10 parentheses will complete within 2 milliseconds. On the other hand, when disabled, it takes 7.5 seconds when the last 10 parentheses are reached.

in conclusion

Parser combinator, it's convenient to try this. I haven't used it much, but I think I'll use it a little more in the future. You can easily parse strings that cannot be expressed by regular expressions.

You can use it to create a language processing system. Kinx uses yacc for parsing and handwriting for lexical analysis.

Originally, I like parsing itself and handwritten recursive descent parser, but I use yacc exclusively because it is difficult to change the grammar definition. However, the parser combinator can also perform lexical analysis, which is convenient.

see you!