There is parsing in the process of the compiler implementation. Continuing from Implementing simple lexical analysis in Java, I would like to implement a very primitive parsing in Java.
Confirm what you want to do in implementation. Suppose that the character string in the program is an arithmetic operation. Parsing analyzes the order in which the four arithmetic operations are calculated. Consider the following formula as an example.
a = 3 + 4 * 5
As you can see at a glance, the calculation is done from the part in parentheses in the order below.
(4 * 5)
(3 + (4 * 5))
(a = (3 + (4 * 5)))
Parsing does this ordering. In an actual program, there are function definitions and calls. For the sake of simplicity, we first aim to be able to analyze the four arithmetic operations and assignments to variables.
I understand that I want to order.
Think about how to express the ordered state in the implementation.
Let's start with a part of the example formula, 4 * 5
.
In Previous article, I decomposed the expression into tokens.
When 4 * 5
is decomposed into tokens, it becomes 4
, *
, and 5
.
Here, in order to express the state of 4 * 5
, we added a field variable so that we can see that we have a 4
token on the left and a 5
token on the right for the*
token. I'll give it to you.
Specifically, I will add left
and right
to the implementation of the Token class in the previous article.
Token.java
public class Token {
public String kind;
public String value;
public Token left;
public Token right;
@Override
public String toString() {
return kind + " \"" + value + "\"";
}
public String paren() {
if (left == null && right == null) {
return value;
} else {
StringBuilder b = new StringBuilder();
b.append("(");
if (left != null) {
b.append(left.paren()).append(" ");
}
b.append(value);
if (right != null) {
b.append(" ").append(right.paren());
}
b.append(")");
return b.toString();
}
}
}
Now you can express the state of 4 * 5
by giving the left
of the *
token the 4
token and the right
the 5
token.
Next, the state of 3 + (4 * 5)
is expressed.
In the same way, let the +
token left
have the 3
token and the right
have the*
token.
Here the *
token has the 4
and 5
tokens in the left
and right
in the previous description.
In other words, it is 4 * 5
.
In summary, there are +
tokens, 3
tokens in left
, 4 * 5
tokens in right
, and the state of 3 + (4 * 5)
can be expressed.
On a different note, I also added paren (), which represents the state in parentheses, to the Token class to check the contents.
Next, let's consider how to do this ordering.
In the example expression, three operators appear. There are three: =
, +
, and *
.
It is these operators that are the basis of the order.
In the order I checked for what I wanted to do in parsing, *
was number 1, +
was number 2, and =
was number 3.
Here, the operator that comes first in this order is assigned a numerical value of the degree that comes first.
Specifically, assign 60
to*
, 50
to +
, and 10
to=
.
Based on this degree, we will put parentheses first from the one with the largest number.
Since *
is 60
and the degree is the largest, it is bracketed first,
=
Is 10
, which is the smallest degree, so it will be parenthesized at the end.
Move on to implementation. Let's take a look at some of the classes that do parsing.
The first defines how it works with respect to the meaning of the token.
degrees
defines the degrees of operator order.
factorKinds
defines the kind
of tokens at the left and right ends of the expression, such as the ʻa token and the
3token.
binaryKinds defines the
kindof the token that comes in the middle of the expression, such as the
=token or the
+token.
rightAssocs is an ordering to
= `tokens that needs special treatment and is the definition for that.
Parser.java
public class Parser {
private Map<String, Integer> degrees;
private List<String> factorKinds;
private List<String> binaryKinds;
private List<String> rightAssocs;
public Parser() {
degrees = new HashMap<>();
degrees.put("*", 60);
degrees.put("/", 60);
degrees.put("+", 50);
degrees.put("-", 50);
degrees.put("=", 10);
factorKinds = Arrays.asList(new String[] { "digit", "variable" });
binaryKinds = Arrays.asList(new String[] { "sign" });
rightAssocs = Arrays.asList(new String[] { "=" });
}
This is the initialization part before parsing.
You will receive a list of tokens decomposed by lexical analysis.
Parse this token list.
Since it is convenient to handle the termination in parsing, add the ʻeob token that represents the termination to the end of the token list. Also, set the index ʻi
to 0 to refer to the token list from the beginning to the end.
Parser.java
private List<Token> tokens;
private int i;
public Parser init(List<Token> tokens) {
i = 0;
this.tokens = new ArrayList<Token>(tokens);
Token eob = new Token();
eob.kind = "eob";
eob.value = "(eob)";
this.tokens.add(eob);
return this;
}
It is a function to refer to the token list in order.
token ()
is the token you are currently looking at in the token list.
next ()
returns the token of interest and advances the position of interest to the next.
Parser.java
private Token token() throws Exception {
if (tokens.size() <= i) {
throw new Exception("No more token");
}
return tokens.get(i);
}
private Token next() throws Exception {
Token t = token();
++i;
return t;
}
I will go into the explanation of the part to be analyzed.
lead ()
parses the tokens at the left and right ends of the expression, such as ʻa and
3`, and returns that token.
Parser.java
private Token lead(Token token) throws Exception {
if (factorKinds.contains(token.kind)) {
return token;
} else {
throw new Exception("The token cannot place there.");
}
}
degree ()
returns the token priority.
bind ()
allocates the left and right tokens to the operator token and returns the assigned operator token.
Parser.java
private int degree(Token t) {
if (degrees.containsKey(t.value)) {
return degrees.get(t.value);
} else {
return 0;
}
}
private Token bind(Token left, Token operator) throws Exception {
if (binaryKinds.contains(operator.kind)) {
operator.left = left;
int leftDegree = degree(operator);
if (rightAssocs.contains(operator.value)) {
leftDegree -= 1;
}
operator.right = expression(leftDegree);
return operator;
} else {
throw new Exception("The token cannot place there.");
}
}
Let's take a closer look at bind ()
.
bind ()
takes the token to the left of the expression and the operator token of the expression as arguments.
bind ()
first assigns the token to the left to the operator token ʻoperator.left. The token assigned to ʻoperator.right
assigns the return value that called ʻexpression (). Let's explain what token ʻexpression ()
returns.
When calling ʻexpression (), pass the precedence of the operator token as an argument. The degree passed is compared in ʻexpression ()
with the degree of the operator that follows in the token list.
If more or less passed, ʻexpression ()returns the token before the operator token that follows. For example, in the formula below, consider the case where
bind ()is used as an argument,
left is
6, and ʻoperator
is+
.
6 + 7 - 8
ʻExpression ()is called with a degree of
+of
50 and is compared to the subsequent degree of
-``50. Since the degree is the same, it returns the token
7 that precedes
-. Then it returns to
bind ()and
7 is assigned to ʻoperator.right
.
Now bind ()
returns the+
associated with 6 + 7
.
If leftDegree
is small in the degree comparison, it will be explained later.
ʻExpression () uses operator precedence to control token association. ʻExpress ()
calls lead ()
and bind ()
as described above.
Parser.java
public Token expression(int leftDegree) throws Exception {
Token left = lead(next());
int rightDegree = degree(token());
while (leftDegree < rightDegree) {
Token operator = next();
left = bind(left, operator);
rightDegree = degree(token());
}
return left;
}
Let's take the movement of ʻexpression ()` as an example of some token lists.
The first call to ʻexpression ()calls
leftDegree with
0. For ʻa
only, a call tolead ()
returns ʻa and
left is determined by ʻa
.
The next degree ()
returns the degree of(eob)
added to make it easier to terminate, and rightDegree
is 0
.
leftDegree
is 0
, rightDegree
is 0
, while
is not established, and left
ʻa is returned as it is. In other words, the token list of only ʻa
could be parsed.
If ʻa = 3, the call to
lead () returns ʻa
and left
is determined by ʻa. The next
degree () returns the degree of
= and
rightDegreeis
10. As in the previous case, ʻexpression ()
holds while
when leftDegree
is called with 0
.
Then it calls bind ()
with ʻaand
=as arguments. As explained in
bind (),
bind () calls ʻexpression ()
to parse the token on the right side of the expression.
When calling ʻexpression ()with
bind (), it is possible to parse up to ʻa =
, so the only token that remains in the token list is 3
.
This is the same situation as "when the token list is only ʻa" explained earlier. That is, ʻexpression ()
called by bind ()
returns 3
, and ʻoperator.right is determined by
3.
bind () returns
= to the caller's ʻexpression ()
,
The left
of the caller's ʻexpression ()is determined by
=. The
degree ()on the line following the
bind ()call returns the degree
0 of
(eob) , so exit
while. With this, ʻexpression ()
returns the left
determined to be=
, and the analysis is completed.
This explanation is also the explanation when leftDegree
is small in the degree comparison that was explained later.
This is the final explanation of the part to be analyzed.
block ()
calls ʻexpression ()until the token list becomes
(eob)and adds the parsed result to the list. The analysis result is added to
blk` as many as the number of expressions in the token list.
Parser.java
public List<Token> block() throws Exception {
List<Token> blk = new ArrayList<Token>();
while (!token().kind.equals("eob")) {
blk.add(expression(0));
}
return blk;
}
Using the above implementation, the string that is the example program
a = 3 + 4 * 5
Parses and prints to standard output.
Parser.java
public static void main(String[] args) throws Exception {
String text = "a = 3 + 4 * 5";
List<Token> tokens = new Lexer().init(text).tokenize();
List<Token> blk = new Parser().init(tokens).block();
for (Token ast : blk) {
System.out.println(ast.paren());
}
// --> (a = (3 + (4 * 5)))
}
}
That's all for the implementation. Thank you very much.
The source is available here.
Calc https://github.com/quwahara/Calc/tree/article-2-parser/Calc/src/main/java
There is a continuation article.
** Implement a simple interpreter in Java ** http://qiita.com/quwahara/items/30e93dfd2690913d66c0
Finally, I will give you a summary of the Parser classes.
Parser.java
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class Parser {
private Map<String, Integer> degrees;
private List<String> factorKinds;
private List<String> binaryKinds;
private List<String> rightAssocs;
public Parser() {
degrees = new HashMap<>();
degrees.put("*", 60);
degrees.put("/", 60);
degrees.put("+", 50);
degrees.put("-", 50);
degrees.put("=", 10);
factorKinds = Arrays.asList(new String[] { "digit", "variable" });
binaryKinds = Arrays.asList(new String[] { "sign" });
rightAssocs = Arrays.asList(new String[] { "=" });
}
private List<Token> tokens;
private int i;
public Parser init(List<Token> tokens) {
i = 0;
this.tokens = new ArrayList<Token>(tokens);
Token eob = new Token();
eob.kind = "eob";
eob.value = "(eob)";
this.tokens.add(eob);
return this;
}
private Token token() throws Exception {
if (tokens.size() <= i) {
throw new Exception("No more token");
}
return tokens.get(i);
}
private Token next() throws Exception {
Token t = token();
++i;
return t;
}
private Token lead(Token token) throws Exception {
if (factorKinds.contains(token.kind)) {
return token;
} else {
throw new Exception("The token cannot place there.");
}
}
private int degree(Token t) {
if (degrees.containsKey(t.value)) {
return degrees.get(t.value);
} else {
return 0;
}
}
private Token bind(Token left, Token operator) throws Exception {
if (binaryKinds.contains(operator.kind)) {
operator.left = left;
int leftDegree = degree(operator);
if (rightAssocs.contains(operator.value)) {
leftDegree -= 1;
}
operator.right = expression(leftDegree);
return operator;
} else {
throw new Exception("The token cannot place there.");
}
}
public Token expression(int leftDegree) throws Exception {
Token left = lead(next());
int rightDegree = degree(token());
while (leftDegree < rightDegree) {
Token operator = next();
left = bind(left, operator);
rightDegree = degree(token());
}
return left;
}
public List<Token> block() throws Exception {
List<Token> blk = new ArrayList<Token>();
while (!token().kind.equals("eob")) {
blk.add(expression(0));
}
return blk;
}
public static void main(String[] args) throws Exception {
String text = "a = 3 + 4 * 5";
List<Token> tokens = new Lexer().init(text).tokenize();
List<Token> blk = new Parser().init(tokens).block();
for (Token ast : blk) {
System.out.println(ast.paren());
}
// --> (a = (3 + (4 * 5)))
}
}
Recommended Posts