Kinx Library-Parser Combinator (Part 1)

Introduction

** "Looks like JavaScript, brain (contents) is Ruby, (stability is AC / DC)" ** Scripting language Kinx ). This time is a parser combinator.

Reference
First motivation ... Script language KINX (introduction)
All links to individual articles are collected here.
Repository ... https://github.com/Kray-G/kinx
We are waiting for Pull Request etc.

Last time, I introduced JIT Library, but I ended it with the following words.

With this, if you implement and combine with a parser combinator, you can create a little language processing system with JIT.

Yes, I made it in a hurry there. Parser combinator library. Its name is ** Parsek **. Parsek, not Parsec.

The interface is based on Parsimmon, but the implementation is completely unique. The API isn't that rich, but it works as it should. It's easy to add an interface, so let's add it as needed.

It's going to be long, so I'll divide the article into two parts.

This time, let's use this to parse the four arithmetic grammars and create an AST (Abstract Syntax Tree). Next time, at the end, I will go to the point of compiling and executing JIT.

What is a parser combinator?

A library for combining small (simple) parsers to create a large parser. I will leave the details to other articles, but let's understand it while looking at the sample below.

sample

This time, I will change the taste and explain using a sample. The sample is an arithmetic operation with a positive integer (natural number). Negative numbers are not treated for simplicity (results can be negative).

Normally, I would go into the explanation of BNF or PEG here, but ignore it. Start by moving the sample first.

using Parsek

The parser combinator library is not built-in as standard, so let's use it. Also, the library is provided as a class, so let's instantiate it.

using Parsek;

var $ = new Parsek();

You can use $ as a variable name.

What is a small (simple) parser?

Let me give you an example one by one.

Parser for parsing numbers

First, let's define a positive integer. This is the first small (simple) parser. The first is a regular expression. Well, it's not that difficult, so it's easy to understand.

The only pitfall is that the engine (= demon car) you are using is not a POSIX NFA **, so you have to write the one that matches longer. Simply put, in the example below, " 123 " matches properly with"123", but the opposite (/ [0-9] | [1-9] [0-9] * / ) Will match the first written [0-9] and stop the search, so it will be " 1 " and will not match " 23 ". Let's watch out.

var number = $.regex(/[1-9][0-9]*|[0-9]/);

This parser, number, can now parse numbers (the string in which they are written). let's do it.

The actual parsing is done by the parseAll () method. There is also parse (), but this is a method that succeeds even if it ends in the middle, and is usually used internally. In the case of parseAll (), all analysis is completed, post-processing is performed, and the result is returned.

using Parsek;

var $ = new Parsek();
var number = $.regex(/[1-9][0-9]*|[0-9]/);

System.println(number.parseAll("0"));       // => {"position":1,"status":1,"value":"0"}
System.println(number.parseAll("10"));      // => {"position":2,"status":1,"value":"10"}
System.println(number.parseAll("129"));     // => {"position":3,"status":1,"value":"129"}
System.println(number.parseAll("abc"));     // => {"position":0,"status":0,"value":null}
System.println(number.parseAll("0129"));    // => {"position":1,"status":0,"value":null}

The return value position is the completion position of the parsed character string, status is the success / failure (1 is success), and value is the character string that was actually parsed. As you can see, if it fails, value is null.

But if you look closely, value is a string. It's natural because it's just interpreting strings. Here, the method that converts to value is.map (). Give the conversion function as follows.

using Parsek;

var $ = new Parsek();
var number = $.regex(/[1-9][0-9]*|[0-9]/).map(&(value) => Integer.parseInt(value));

System.println(number.parseAll("129"));     // => {"position":3,"status":1,"value":129}

It's a number. In the above case, you're just passing through the value, so passing ʻInteger.parseInt` directly is the same.

var number = $.regex(/[1-9][0-9]*|[0-9]/).map(Integer.parseInt);

This is more concise.

Parser that parses operators of four arithmetic operations

Since the priority differs depending on the operator, divide it into two.

+ Or -
* Or / or %

A convenient way to interpret the single character or is $ .oneOf (). Use it as follows.

var addsub = $.oneOf("+-");
var muldiv = $.oneOf("*/%");

It's easy. Let's try it right away.

using Parsek;

var $ = new Parsek();
var addsub = $.oneOf("+-");
var muldiv = $.oneOf("*/%");

System.println(addsub.parseAll("+"));       // => {"position":1,"status":1,"value":"+"}
System.println(addsub.parseAll("-"));       // => {"position":1,"status":1,"value":"-"}
System.println(addsub.parseAll("*"));       // => {"position":0,"status":0,"value":null}
System.println(muldiv.parseAll("*"));       // => {"position":1,"status":1,"value":"*"}
System.println(muldiv.parseAll("/"));       // => {"position":1,"status":1,"value":"/"}
System.println(muldiv.parseAll("%"));       // => {"position":1,"status":1,"value":"%"}
System.println(muldiv.parseAll("a"));       // => {"position":0,"status":0,"value":null}

As expected.

Parser that parses parentheses

Another thing, let's interpret the parentheses necessary for numerical operations. The parser that matches a particular string uses $ .string (). Here it is one character, but any string is OK.

var lbr = $.string("(");
var rbr = $.string(")");

This also works fine if you try it. Let's try another string to see the effect of $ .string ().

using Parsek;

var $ = new Parsek();
var lbr = $.string("(");
var rbr = $.string(")");
var hoge = $.string("hoge");

System.println(lbr.parseAll("("));          // => {"position":1,"status":1,"value":"("}
System.println(lbr.parseAll(")"));          // => {"position":0,"status":0,"value":null}
System.println(rbr.parseAll("("));          // => {"position":0,"status":0,"value":null}
System.println(rbr.parseAll(")"));          // => {"position":1,"status":1,"value":")"}
System.println(hoge.parseAll("hoge"));      // => {"position":4,"status":1,"value":"hoge"}
System.println(hoge.parseAll("fuga"));      // => {"position":0,"status":0,"value":null}

You can see that it matches correctly.

What is a combination? (Combinator)

Now you have a small (simple) parser tool.

var number = $.regex(/[1-9][0-9]*|[0-9]/).map(Integer.parseInt);
var addsub = $.oneOf("+-");
var muldiv = $.oneOf("*/%");
var lbr = $.string("(");
var rbr = $.string(")");

Let's combine these. Here is the PEG. BNF is fine, but PEG is more suitable for combinators. If you don't show that the grammar is like this, you won't know what you're doing. I will touch on the meaning one by one.

number <- regex(/[1-9][0-9]*|[0-9]/)
addsub <- '+' / '-'
muldiv <- '*' / '/' / '%'
lbr <- '('
rbr <- ')'

expression <- term (addsub term)*
term <- factor (muldiv factor)*
factor <- number / (lbr expression rbr)

The PEG prioritized selection symbol / and the division specification '/' are confusing, but you can see them by looking closely.

There are both top-down and bottom-up, but here we will build the parser from the bottom-up.

factor

First is factor.

factor <- number / (lbr expression rbr)

factor can be number or lbr expression rbr. You can drop it into the program as it is. The following methods are used here.

We haven't defined ʻexpression yet, so we'll use $ .lazy () to make it lazy. Using $ .lazy ()` will create a parser when it is actually evaluated.
The method of choosing one is $ .alt (). Returns the result of the first successful parser from multiple ones.
$ .Seq () indicates that multiple things are consecutive, such as lbr expression rbr.

Now let's write. ʻExpression` is only declared in advance.

var expression;
var factor = $.lazy(&() => $.alt(number, $.seq(lbr, expression, rbr)));

term

Next is term.

term <- factor (muldiv factor)*

This means that factor is followed by(muldiv factor)0 or more times. Since it is allowed 0 times, it is OK that nothing continues. Arranging like muldiv factor means that it is the same as the previous lbr expression rbr and is continuous. The method used here is as follows.

Repeat 0 or more times, specifies .many () for the parser.

Let's define it.

var term = $.seq(factor, $.seq(muldiv, factor).many());

You have now defined term.

expression

Finally, ʻexpression. The shape is the same as term`.

expression <- term (addsub term)*

Let's write it as it is.

expression = $.seq(term, $.seq(addsub, term).many());

Now you have a parser. Let's try parsing!

Perth

I will put all the source code once. That's not so much.

using Parsek;

var $ = new Parsek();
var number = $.regex(/[1-9][0-9]*|[0-9]/).map(Integer.parseInt);
var addsub = $.oneOf("+-");
var muldiv = $.oneOf("*/%");
var lbr = $.string("(");
var rbr = $.string(")");

var expression;
var factor = $.lazy(&() => $.alt(number, $.seq(lbr, expression, rbr)));
var term = $.seq(factor, $.seq(muldiv, factor).many());
expression = $.seq(term, $.seq(addsub, term).many());

// parse expression!
System.println(expression.parseAll("1+2*3+2*(14-2)"));
// => {"position":14,"status":1,"value":[[1,{}],[["+",[2,[["*",3]]]],["+",[2,[["*",["(",[[14,{}],[["-",[2,{}]]]],")"]]]]]]]}
System.println(expression.parseAll("1+2*3+2*(14-2-)"));
// => {"position":7,"status":0,"value":null}

The first one turns out to be successful (albeit long). It's hard to read the results, but let's shape this later. And you can see that the second one is failing. That's because the last (14-2-) doesn't match any of the rules.

Let's format this result. The one that works is .map () used in number.

Bracket formula

First, the $ .seq (lbr, expression, rbr) part. $ .seq () returns the resulting array as a value. The parenthesis expression does not need parentheses as a value, and only the result of the expression inside is sufficient. So, change it as follows.

var factor = $.lazy(&() => $.alt(number, $.seq(lbr, expression, rbr).map(&(value) => value[1])));

After modification, the result will be as follows.

System.println(expression.parseAll("1+2*3+2*(14-2)"));
// => {"position":14,"status":1,"value":[[1,{}],[["+",[2,[["*",3]]]],["+",[2,[["*",[[14,{}],[["-",[2,{}]]]]]]]]]]}

It's a little shorter.

term、expression

Next are term and ʻexpression`. Here, let's format it in the form of AST (Abstract Syntax Tree) for later analysis. Since it is basically a binary operator, create an object that is a combination of LHS (Left Hand Side = left side), RHS (Right Hand Side = right side), and operator (Operator).

Now change to use $ .seqMap () instead of $ .seq (). It's like a combination of $ .seq () and .map (), a convenient method that takes the result list as an argument and calls back to the function specified as the last argument. I use it like this.

var term = $.seqMap(factor, $.seq(muldiv, factor).many(), &(first, rest) => {
    var expr = first;
    for (var i = 0, l = rest.length(); i < l; ++i) {
        expr = { lhs: expr, op: rest[i][0], rhs: rest[i][1] };
    }
    return expr;
});

first is the result of factor and rest is the result of$ .seq (muldiv, factor) .many (). So rest is an array in the form of [operator, right-hand side] for each element (it can be an empty array). It is shaped into an AST. As a result, something like " 2 * 3 * 4 " is formatted as follows.

At the time of callback
First, first is 2
rest is[['*', 3], ['*', 4]]
ʻExprcontains2`
Enter the loop and ʻexpr becomes {lhs: 2, op:'*', rhs: 3} `.
Since there is another element, ʻexpr becomes {lhs: {lhs: 2, op:'', rhs: 3}, op:'', rhs: 4} `.

The AST has a shape in which the left branch grows (this is called a left connection). All operators this time are left-associative.

Since ʻexpression` is also the same, let's write it in the same way. The contents are exactly the same, so let's make it a function and use it.

function makeAST(first, rest) {
    var expr = first;
    for (var i = 0, l = rest.length(); i < l; ++i) {
        expr = { lhs: expr, op: rest[i][0], rhs: rest[i][1] };
    }
    return expr;
}
var term = $.seqMap(factor, $.seq(muldiv, factor).many(), makeAST);
expression = $.seqMap(term, $.seq(addsub, term).many(), makeAST);

I feel refreshed.

Then, it is a set of programs. With just this definition, you can parse four arithmetic operations (taking into account operator precedence). It is wonderful!

using Parsek;

function makeAST(first, rest) {
    var expr = first;
    for (var i = 0, l = rest.length(); i < l; ++i) {
        expr = { lhs: expr, op: rest[i][0], rhs: rest[i][1] };
    }
    return expr;
}

var $ = new Parsek();
var number = $.regex(/[1-9][0-9]*|[0-9]/).map(Integer.parseInt);
var addsub = $.oneOf("+-");
var muldiv = $.oneOf("*/%");
var lbr = $.string("(");
var rbr = $.string(")");

var expression;
var factor = $.lazy(&() => $.alt(number, $.seq(lbr, expression, rbr).map(&(value) => value[1])));
var term = $.seqMap(factor, $.seq(muldiv, factor).many(), makeAST);
expression = $.seqMap(term, $.seq(addsub, term).many(), makeAST);

// test
System.println(expression.parseAll("1+2*3+2*(14-2)").value.toJsonString(true));

The result is like this. It's working!

"lhs": {
    "lhs": 1,
    "op": "+",
    "rhs": {
        "lhs": 2,
        "op": "*",
        "rhs": 3
    }
},
"op": "+",
"rhs": {
    "lhs": 2,
    "op": "*",
    "rhs": {
        "lhs": 14,
        "op": "-",
        "rhs": 2
    }
}

in conclusion

Now you have the desired AST. Next time, I will interpret this and let it run. I will also use the JIT library I made!

See you next time!