There is parsing in the process of the compiler implementation. Continuing from Implementing simple lexical analysis in Java, I would like to implement a very primitive parsing in Java.
Confirm what you want to do in implementation. Suppose that the character string in the program is an arithmetic operation. Parsing analyzes the order in which the four arithmetic operations are calculated. Consider the following formula as an example.
a = 3 + 4 * 5
As you can see at a glance, the calculation is done from the part in parentheses in the order below.
(4 * 5)(3 + (4 * 5))(a = (3 + (4 * 5)))Parsing does this ordering. In an actual program, there are function definitions and calls. For the sake of simplicity, we first aim to be able to analyze the four arithmetic operations and assignments to variables.
I understand that I want to order.
Think about how to express the ordered state in the implementation.
Let's start with a part of the example formula, 4 * 5.
In Previous article, I decomposed the expression into tokens.
When 4 * 5 is decomposed into tokens, it becomes 4, *, and 5.
Here, in order to express the state of 4 * 5, we added a field variable so that we can see that we have a 4 token on the left and a 5 token on the right for the*token. I'll give it to you.
Specifically, I will add left and right to the implementation of the Token class in the previous article.
Token.java
public class Token {
public String kind;
public String value;
public Token left;
public Token right;
@Override
public String toString() {
return kind + " \"" + value + "\"";
}
public String paren() {
if (left == null && right == null) {
return value;
} else {
StringBuilder b = new StringBuilder();
b.append("(");
if (left != null) {
b.append(left.paren()).append(" ");
}
b.append(value);
if (right != null) {
b.append(" ").append(right.paren());
}
b.append(")");
return b.toString();
}
}
}
Now you can express the state of 4 * 5 by giving the left of the * token the 4 token and the right the 5 token.
Next, the state of 3 + (4 * 5) is expressed.
In the same way, let the + token left have the 3 token and the right have the*token.
Here the * token has the 4 and 5 tokens in the left and right in the previous description.
In other words, it is 4 * 5.
In summary, there are + tokens, 3 tokens in left, 4 * 5 tokens in right, and the state of 3 + (4 * 5) can be expressed.
On a different note, I also added paren (), which represents the state in parentheses, to the Token class to check the contents.
Next, let's consider how to do this ordering.
In the example expression, three operators appear. There are three: =, +, and *.
It is these operators that are the basis of the order.
In the order I checked for what I wanted to do in parsing, * was number 1, + was number 2, and = was number 3.
Here, the operator that comes first in this order is assigned a numerical value of the degree that comes first.
Specifically, assign 60 to*, 50 to +, and 10 to=.
Based on this degree, we will put parentheses first from the one with the largest number.
Since * is 60 and the degree is the largest, it is bracketed first,
= Is 10, which is the smallest degree, so it will be parenthesized at the end.
Move on to implementation. Let's take a look at some of the classes that do parsing.
The first defines how it works with respect to the meaning of the token.
degrees defines the degrees of operator order.
factorKinds defines the kind of tokens at the left and right ends of the expression, such as the ʻa token and the 3token. binaryKinds defines the kindof the token that comes in the middle of the expression, such as the=token or the+token. rightAssocs is an ordering to = `tokens that needs special treatment and is the definition for that.
Parser.java
public class Parser {
private Map<String, Integer> degrees;
private List<String> factorKinds;
private List<String> binaryKinds;
private List<String> rightAssocs;
public Parser() {
degrees = new HashMap<>();
degrees.put("*", 60);
degrees.put("/", 60);
degrees.put("+", 50);
degrees.put("-", 50);
degrees.put("=", 10);
factorKinds = Arrays.asList(new String[] { "digit", "variable" });
binaryKinds = Arrays.asList(new String[] { "sign" });
rightAssocs = Arrays.asList(new String[] { "=" });
}
This is the initialization part before parsing.
You will receive a list of tokens decomposed by lexical analysis.
Parse this token list.
Since it is convenient to handle the termination in parsing, add the ʻeob token that represents the termination to the end of the token list. Also, set the index ʻi to 0 to refer to the token list from the beginning to the end.
Parser.java
private List<Token> tokens;
private int i;
public Parser init(List<Token> tokens) {
i = 0;
this.tokens = new ArrayList<Token>(tokens);
Token eob = new Token();
eob.kind = "eob";
eob.value = "(eob)";
this.tokens.add(eob);
return this;
}
It is a function to refer to the token list in order.
token () is the token you are currently looking at in the token list.
next () returns the token of interest and advances the position of interest to the next.
Parser.java
private Token token() throws Exception {
if (tokens.size() <= i) {
throw new Exception("No more token");
}
return tokens.get(i);
}
private Token next() throws Exception {
Token t = token();
++i;
return t;
}
I will go into the explanation of the part to be analyzed.
lead () parses the tokens at the left and right ends of the expression, such as ʻa and 3`, and returns that token.
Parser.java
private Token lead(Token token) throws Exception {
if (factorKinds.contains(token.kind)) {
return token;
} else {
throw new Exception("The token cannot place there.");
}
}
degree () returns the token priority.
bind () allocates the left and right tokens to the operator token and returns the assigned operator token.
Parser.java
private int degree(Token t) {
if (degrees.containsKey(t.value)) {
return degrees.get(t.value);
} else {
return 0;
}
}
private Token bind(Token left, Token operator) throws Exception {
if (binaryKinds.contains(operator.kind)) {
operator.left = left;
int leftDegree = degree(operator);
if (rightAssocs.contains(operator.value)) {
leftDegree -= 1;
}
operator.right = expression(leftDegree);
return operator;
} else {
throw new Exception("The token cannot place there.");
}
}
Let's take a closer look at bind ().
bind () takes the token to the left of the expression and the operator token of the expression as arguments.
bind () first assigns the token to the left to the operator token ʻoperator.left. The token assigned to ʻoperator.right assigns the return value that called ʻexpression (). Let's explain what token ʻexpression () returns.
When calling ʻexpression (), pass the precedence of the operator token as an argument. The degree passed is compared in ʻexpression () with the degree of the operator that follows in the token list.
If more or less passed, ʻexpression ()returns the token before the operator token that follows. For example, in the formula below, consider the case wherebind ()is used as an argument,left is 6, and ʻoperator is+.
6 + 7 - 8
ʻExpression ()is called with a degree of+of50 and is compared to the subsequent degree of -``50. Since the degree is the same, it returns the token 7 that precedes -. Then it returns to bind ()and7 is assigned to ʻoperator.right.
Now bind () returns the+associated with 6 + 7.
If leftDegree is small in the degree comparison, it will be explained later.
ʻExpression () uses operator precedence to control token association. ʻExpress () calls lead () and bind () as described above.
Parser.java
public Token expression(int leftDegree) throws Exception {
Token left = lead(next());
int rightDegree = degree(token());
while (leftDegree < rightDegree) {
Token operator = next();
left = bind(left, operator);
rightDegree = degree(token());
}
return left;
}
Let's take the movement of ʻexpression ()` as an example of some token lists.
The first call to ʻexpression ()callsleftDegree with 0. For ʻa only, a call tolead ()returns ʻa and left is determined by ʻa.
The next degree () returns the degree of(eob)added to make it easier to terminate, and rightDegree is 0.
leftDegree is 0, rightDegree is 0, while is not established, and left ʻa is returned as it is. In other words, the token list of only ʻa could be parsed.
If ʻa = 3, the call to lead () returns ʻa and left is determined by ʻa. The next degree () returns the degree of = and rightDegreeis10. As in the previous case, ʻexpression () holds while when leftDegree is called with 0.
Then it calls bind () with ʻaand=as arguments. As explained inbind (), bind () calls ʻexpression () to parse the token on the right side of the expression.
When calling ʻexpression ()withbind (), it is possible to parse up to ʻa =, so the only token that remains in the token list is 3.
This is the same situation as "when the token list is only ʻa" explained earlier. That is, ʻexpression () called by bind () returns 3, and ʻoperator.right is determined by 3. bind () returns = to the caller's ʻexpression (),
The left of the caller's ʻexpression ()is determined by=. The degree ()on the line following thebind ()call returns the degree0 of (eob) , so exit while. With this, ʻexpression () returns the left determined to be=, and the analysis is completed.
This explanation is also the explanation when leftDegree is small in the degree comparison that was explained later.
This is the final explanation of the part to be analyzed.
block () calls ʻexpression ()until the token list becomes(eob)and adds the parsed result to the list. The analysis result is added toblk` as many as the number of expressions in the token list.
Parser.java
public List<Token> block() throws Exception {
List<Token> blk = new ArrayList<Token>();
while (!token().kind.equals("eob")) {
blk.add(expression(0));
}
return blk;
}
Using the above implementation, the string that is the example program
a = 3 + 4 * 5
Parses and prints to standard output.
Parser.java
public static void main(String[] args) throws Exception {
String text = "a = 3 + 4 * 5";
List<Token> tokens = new Lexer().init(text).tokenize();
List<Token> blk = new Parser().init(tokens).block();
for (Token ast : blk) {
System.out.println(ast.paren());
}
// --> (a = (3 + (4 * 5)))
}
}
That's all for the implementation. Thank you very much.
The source is available here.
Calc https://github.com/quwahara/Calc/tree/article-2-parser/Calc/src/main/java
There is a continuation article.
** Implement a simple interpreter in Java ** http://qiita.com/quwahara/items/30e93dfd2690913d66c0
Finally, I will give you a summary of the Parser classes.
Parser.java
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class Parser {
private Map<String, Integer> degrees;
private List<String> factorKinds;
private List<String> binaryKinds;
private List<String> rightAssocs;
public Parser() {
degrees = new HashMap<>();
degrees.put("*", 60);
degrees.put("/", 60);
degrees.put("+", 50);
degrees.put("-", 50);
degrees.put("=", 10);
factorKinds = Arrays.asList(new String[] { "digit", "variable" });
binaryKinds = Arrays.asList(new String[] { "sign" });
rightAssocs = Arrays.asList(new String[] { "=" });
}
private List<Token> tokens;
private int i;
public Parser init(List<Token> tokens) {
i = 0;
this.tokens = new ArrayList<Token>(tokens);
Token eob = new Token();
eob.kind = "eob";
eob.value = "(eob)";
this.tokens.add(eob);
return this;
}
private Token token() throws Exception {
if (tokens.size() <= i) {
throw new Exception("No more token");
}
return tokens.get(i);
}
private Token next() throws Exception {
Token t = token();
++i;
return t;
}
private Token lead(Token token) throws Exception {
if (factorKinds.contains(token.kind)) {
return token;
} else {
throw new Exception("The token cannot place there.");
}
}
private int degree(Token t) {
if (degrees.containsKey(t.value)) {
return degrees.get(t.value);
} else {
return 0;
}
}
private Token bind(Token left, Token operator) throws Exception {
if (binaryKinds.contains(operator.kind)) {
operator.left = left;
int leftDegree = degree(operator);
if (rightAssocs.contains(operator.value)) {
leftDegree -= 1;
}
operator.right = expression(leftDegree);
return operator;
} else {
throw new Exception("The token cannot place there.");
}
}
public Token expression(int leftDegree) throws Exception {
Token left = lead(next());
int rightDegree = degree(token());
while (leftDegree < rightDegree) {
Token operator = next();
left = bind(left, operator);
rightDegree = degree(token());
}
return left;
}
public List<Token> block() throws Exception {
List<Token> blk = new ArrayList<Token>();
while (!token().kind.equals("eob")) {
blk.add(expression(0));
}
return blk;
}
public static void main(String[] args) throws Exception {
String text = "a = 3 + 4 * 5";
List<Token> tokens = new Lexer().init(text).tokenize();
List<Token> blk = new Parser().init(tokens).block();
for (Token ast : blk) {
System.out.println(ast.paren());
}
// --> (a = (3 + (4 * 5)))
}
}
Recommended Posts