Introduction

There is lexical analysis in the processing process of the compiler implementation. I would like to implement a very primitive lexical analysis in Java.

I was inspired by the Compiler Study Group held on January 28, 2017. Although it is far from the advanced content of the presentation of this study session and I do not have such knowledge, I thought it would be fun if there was an article about the compiler implementation Hello, World !.

What you want to do with lexical analysis

When implementing, let's first confirm what you want to do. Lexical analysis breaks down a programmed string into what are called lexical or tokens. Let's look at an example. There is a programmed string that assigns the result of the addition to a variable.

ans1 = 10 + 20

The purpose of lexical analysis is to decompose this string into five parts: ʻans1, =, 10, + , and 20. The decomposed string is called a lexical or token. In addition, each of these five tokens is a meaningful unit. You can intuitively understand the meaning, but ʻans1 is a variable, = and + are operators, and 10 and 20 are numbers. It is also the purpose of lexical analysis to give such meanings to tokens. In the implementation, "string is ʻans1 and meaning is variable", "string is = `and meaning is operator", etc. It is more convenient to put the strings and their meanings together. Therefore, in the implementation, the character string and the meaning are combined into a token object. To summarize what I want to do with lexical analysis again, "I want to decompose the programmatic character string into a character string and a token that represents the meaning."

How to disassemble

Now that I know what I want to do, I think about how to do it. To disassemble, constrain the token and use it. The restrictions are as follows.

Looking at the first letter of a token determines the meaning of that token
The first character that determines the meaning of the token is different depending on the meaning and there is no duplication

Consider this constraint on variable names, operators, and numbers. Determine the constraint for the first character.

The first character of the variable name is the alphabet
The first character of the operator is = or +
The first letter of the number is the number from 0 to 9

It also shows the restriction that there is no duplication in the first character.

Let's use this constraint to break it down into tokens. Let's look at the previous example again.

ans1 = 10 + 20

Trace the character string in the example program character by character in order from the left. Trace in the order of ʻa → n→s → 1... until the last0. When you trace the first ʻa here, ʻa is an alphabet, so Due to constraints, ʻa is determined to be the first character of the variable name. Since the variable name has been set, the following n → s → 1 is also traced in order as a string of variable name strings. The character after 1 is not part of the variable name, so it can be decomposed into tokens as a string corresponding to the variable name. I will continue to trace. The space before = is not related to any token, so simply skip it. And I meet =. This is also determined as an operator from the constraint, and the operator token can be decomposed. To summarize how to decompose, create a token constraint, trace the character string in the program, and decompose it into tokens with the meaning that meets the constraint of the first character in order to the end.

Try to implement in Java

Move on to implementation. First is the class that represents the token. There are two field variables in the class: a kind that represents the meaning and a value that holds the string. In the class that performs lexical analysis, the character string that is the program is decomposed into this Token. Implemented toString () to check the contents.

`Token.java`



public class Token {

    public String kind;
    public String value;

    @Override
    public String toString() {
        return kind + " \"" + value + "\"";
    }

}

Let's look at a class that performs lexical analysis. Let's take a partial look. This is the beginning of the process. There is a text field that holds the programmatic string. And there is an i-field that serves as an index for tracing that string. Initialize them with the init method.

`Lexer.java`



import java.util.ArrayList;
import java.util.List;

public class Lexer {

    private String text;
    private int i;

    public Lexer init(String text) {
        i = 0;
        this.text = text;
        return this;
    }

This is the part of the mechanism that traces the character string. ʻIs EOTmeans that there are no more characters to trace. c ()is a programmatic string, the character at the point of interest while tracing. next ()` returns the character of interest and advances the position of interest to the next.

`Lexer.java`


    private boolean isEOT() {
        return text.length() <= i;
    }

    private char c() throws Exception {
        if (isEOT()) {
            throw new Exception("No more character");
        }
        return text.charAt(i);
    }

    private char next() throws Exception {
        char c = c();
        ++i;
        return c;
    }

I will explain in order the part that uses the mechanism of tracing the character string. skipSpace () skips characters that are not related to tokens, such as spaces. No matter how many spaces are in the beginning or end of the program string or in the middle of the expression, this is OK.

`Lexer.java`


    private void skipSpace() throws Exception {
        while (!isEOT() && Character.isWhitespace(c())) {
            next();
        }
    }

This is the part that determines the constraint of the first character. In order, it determines whether it is the first character of the operator, number, and variable name.

`Lexer.java`


    private boolean isSignStart(char c) {
        return c == '=' || c == '+' || c == '-' || c == '*' || c == '/';
    }

    private boolean isDigitStart(char c) throws Exception {
        return Character.isDigit(c);
    }

    private boolean isVariableStart(char c) throws Exception {
        return Character.isAlphabetic(c);
    }

This is the part that breaks down into tokens. In order, it decomposes into tokens of operators, numbers, and variable names. The position of interest ʻi in the programmed string is by next () `. If you can break it down into tokens, you can probably proceed.

`Lexer.java`


    private Token sign() throws Exception {
        Token t = new Token();
        t.kind = "sign";
        t.value = Character.toString(next());
        return t;
    }

    private Token digit() throws Exception {
        StringBuilder b = new StringBuilder();
        b.append(next());
        while (!isEOT() && Character.isDigit(c())) {
            b.append(next());
        }
        Token t = new Token();
        t.kind = "digit";
        t.value = b.toString();
        return t;
    }

    private Token variable() throws Exception {
        StringBuilder b = new StringBuilder();
        b.append(next());
        while (!isEOT() && (Character.isAlphabetic(c()) || Character.isDigit(c()))) {
            b.append(next());
        }
        Token t = new Token();
        t.kind = "variable";
        t.value = b.toString();
        return t;
    }

Using the part that judges the constraint of the first character above and the part that decomposes it into tokens, Returns the first token found from the position of interest. First the space is skipped, The character at the position of interest determines the token and decomposes it into tokens.

`Lexer.java`


    public Token nextToken() throws Exception {
        skipSpace();
        if (isEOT()) {
            return null;
        } else if (isSignStart(c())) {
            return sign();
        } else if (isDigitStart(c())) {
            return digit();
        } else if (isVariableStart(c())) {
            return variable();
        } else {
            throw new Exception("Not a character for tokens");
        }
    }

Use the above nextToken () to decompose all the programmatic strings into Tokens.

`Lexer.java`


    public List<Token> tokenize() throws Exception {
        List<Token> tokens = new ArrayList<>();
        Token t = nextToken();
        while (t != null) {
            tokens.add(t);
            t = nextToken();
        }
        return tokens;
    }

Using the above implementation, the string that is the example program

 ans1 = 10 + 20

Is lexically analyzed and printed to standard output.

`Lexer.java`


    public static void main(String[] args) throws Exception {
        String text = " ans1 = 10 + 20 ";
        List<Token> tokens = new Lexer().init(text).tokenize();
        for (Token token : tokens) {
            System.out.println(token.toString());
        }
        // --> variable "ans1"
        // --> sign "="
        // --> digit "10"
        // --> sign "+"
        // --> digit "20"
    }

}

That's all for the implementation. Thank you very much.

in conclusion

The source is available here.

Calc https://github.com/quwahara/Calc/tree/lexer/Calc/src/main/java

There is a continuation article.

** Implement simple parsing in Java ** http://qiita.com/quwahara/items/9bf468ff4286b28d2a24

Just in case, I'll give you a summary of class Lexer.

`Lexer.java`



import java.util.ArrayList;
import java.util.List;

public class Lexer {

    private String text;
    private int i;

    public Lexer init(String text) {
        i = 0;
        this.text = text;
        return this;
    }

    private boolean isEOT() {
        return text.length() <= i;
    }

    private char c() throws Exception {
        if (isEOT()) {
            throw new Exception("No more character");
        }
        return text.charAt(i);
    }

    private char next() throws Exception {
        char c = c();
        ++i;
        return c;
    }

    private void skipSpace() throws Exception {
        while (!isEOT() && Character.isWhitespace(c())) {
            next();
        }
    }

    private boolean isSignStart(char c) {
        return c == '=' || c == '+' || c == '-' || c == '*' || c == '/';
    }

    private boolean isDigitStart(char c) throws Exception {
        return Character.isDigit(c);
    }

    private boolean isVariableStart(char c) throws Exception {
        return Character.isAlphabetic(c);
    }

    private Token sign() throws Exception {
        Token t = new Token();
        t.kind = "sign";
        t.value = Character.toString(next());
        return t;
    }

    private Token digit() throws Exception {
        StringBuilder b = new StringBuilder();
        b.append(next());
        while (!isEOT() && Character.isDigit(c())) {
            b.append(next());
        }
        Token t = new Token();
        t.kind = "digit";
        t.value = b.toString();
        return t;
    }

    private Token variable() throws Exception {
        StringBuilder b = new StringBuilder();
        b.append(next());
        while (!isEOT() && (Character.isAlphabetic(c()) || Character.isDigit(c()))) {
            b.append(next());
        }
        Token t = new Token();
        t.kind = "variable";
        t.value = b.toString();
        return t;
    }

    public Token nextToken() throws Exception {
        skipSpace();
        if (isEOT()) {
            return null;
        } else if (isSignStart(c())) {
            return sign();
        } else if (isDigitStart(c())) {
            return digit();
        } else if (isVariableStart(c())) {
            return variable();
        } else {
            throw new Exception("Not a character for tokens");
        }
    }

    public List<Token> tokenize() throws Exception {
        List<Token> tokens = new ArrayList<>();
        Token t = nextToken();
        while (t != null) {
            tokens.add(t);
            t = nextToken();
        }
        return tokens;
    }

    public static void main(String[] args) throws Exception {
        String text = " ans1 = 10 + 20 ";
        List<Token> tokens = new Lexer().init(text).tokenize();
        for (Token token : tokens) {
            System.out.println(token.toString());
        }
        // --> variable ,"ans1"
        // --> sign "="
        // --> digit "10"
        // --> sign "+"
        // --> digit "20"
    }

}

1 Implement simple lexical analysis in Java

Introduction

What you want to do with lexical analysis

How to disassemble

Try to implement in Java

Token.java

Lexer.java

Lexer.java

Lexer.java

Lexer.java

Lexer.java

Lexer.java

Lexer.java

Lexer.java

in conclusion

Lexer.java

`Token.java`

`Lexer.java`

`Lexer.java`

`Lexer.java`

`Lexer.java`

`Lexer.java`

`Lexer.java`

`Lexer.java`

`Lexer.java`

`Lexer.java`