Introduction

I made a simple lexical analyzer for use in my own Java console program. Since I created the token class in Previous article, this time it will be the lexical analyzer itself. For lexical analysis, it is easier to use various libraries, but since it is used in the workplace where downloading from external sites is prohibited, it is created from scratch using only Java8 functions.

2018/10/22 The source has been modified. 2018/10/24 Fixed to save a copy of the parsed string in the constructor.

In creating it, I refer to Implementing simple lexical analysis in Java.

important point

This time, the source and its explanation will be long, so please see Previous article.

Lexical analysis class

`LexicalAnalyzer.java`


package console;

public class LexicalAnalyzer {
    public static LexicalAnalyzer create(String target) {
        if ( target == null  || target.trim().isEmpty() ) { target = ""; }
        return new LexicalAnalyzer(target);
    }

    public java.util.List<Token> analyze() {
        char c;
        while ( (c = next()) != '\0' ) {
            if ( isSymbol_1(c) ) {
                tokens_.add( Token.create(c) );
                continue;
            }
            if ( isQuote(c) ) {
                quotedText(c);
                continue;
            }
            text(c);
        }
        return new java.util.ArrayList<>(tokens_);
    }

    // query methods ================================================================================
    public boolean isEmpty() { return tokens_.size() == 0;}

    public boolean isValid() {
        return !isEmpty()  &&  tokens_.stream().noneMatch( e -> e.kind() == Token.Kinds.Unknown );
    }

    // internal methods ======================================================================
    /**Cut out up to the next quote character as a block of tokens. */
    private void quotedText(char quote) {
        tokens_.add( Token.create(quote));  // create token of begin quote

        java.lang.StringBuilder builder = new java.lang.StringBuilder();
        char c;
        while ( (c = nextAll()) != '\0'  &&  c != quote) { builder.append(c); }
        if ( builder.length() != 0 ) {
            tokens_.add( Token.create(builder.toString()) );  // append string
        }

        tokens_.add( Token.create(c) );  // append token of end quote
    }

    /**Cut out separators and blank characters as a single token. */
    private void text(char first) {
        java.lang.StringBuilder builder = new java.lang.StringBuilder();
        builder.append(first);

        char c;
        while ( (c = nextAll()) != '\0'  &&  !isSeparator(c)  &&  !isWhitespace(c) ) { 
            builder.append(c);
        }
        tokens_.add( Token.create(builder.toString()) );

        // append separator token, if not end of text
        if ( isEnd() ) { return; }

        tokens_.add( isWhitespace(c) ? Token.create(' ') : Token.create(c) );
    }

    private char next() {
        skipSpace();
        return nextAll();
    }

    private char nextAll() {
        char c = aChar();
        ++pos_;
        return c;
    }

    private char aChar() { return isEnd() ? '\0' : target_.charAt(pos_); }

    private void skipSpace() {
        while ( !isEnd()  &&  Character.isWhitespace(aChar()) ) { pos_++; }
    }

    private boolean isEnd()                 { return length_ <= pos_; }
    private boolean isSeparator(char c)     { return exists(separators_, c);    }
    private boolean isQuote(char c)         { return exists(quotes_, c);        }
    private boolean isSymbol_1(char c)      { return exists(symbol1_, c);       }
    private boolean isWhitespace(char c)    { return Character.isWhitespace(c); }

    private boolean exists(char[] arr, char c) {
        return java.util.Arrays.binarySearch(arr, c) >= 0;
    }

    private LexicalAnalyzer(String target) {
        target_ = target;
        length_ = target.length();
    }

    // internal fields ======================================================================
    private static final char[] separators_ = { ':', ',', '=', '(', ')', '{', '}' };
    static { Arrays.sort(separators_); }

    private static final char[] quotes_ = { '"', '\'' };
    static { Arrays.sort(quotes_); }

    private static final char[] symbol1_ = {'(', ')', '{', '}', ':', ',', '=', '&' };
    static { Arrays.sort(symbol1_); }

    final String target_;       // analyze target string
    final int    length_;       // length of target string
    int          pos_ = 0;      // "next" analyzing position

    java.util.List<Token> tokens_ = new java.util.ArrayList<>();  // result
}

Commentary

Factory method

Originally, I want to make the parsed string that is the argument mandatory, but there is no way to force it to the caller, so set an empty string to the target string. This frees the caller from the hassle of null checking and exception handling. By the way, there is no need to check arguments in the constructor.

** ** Set a copy of the argument when setting the string. If you do not copy it, you can change the target character string externally.

    public static LexicalAnalyzer create(String target) {
        if ( target == null  || target.trim().isEmpty() ) { target = ""; }
        return new LexicalAnalyzer(target);
    }

    private LexicalAnalyzer(String target) {
        target_ = new String(target);
        length_ = target.length();
    }

    final String target_;       // analyze target string
    final int    length_;       // length of target string

Support method

Since it is a lexical analyzer, it extracts the target character string character by character and processes it. The basic functions for that are the following methods.

** ** nextAll () reads a new character and advances the reading position (pos_) by one. The difference from next () is that it returns all characters, including whitespace. This is used in text () and quotedText () described below.
** ** next () skips whitespace and returns a valid character when it is reached.
aChar () returns the character at the read position. Returns a null character ('\ 0') if there are no more target strings.
skipSpace () skips syntactically meaningless whitespace characters. (Advance the reading position)
isEnd () returns true when the end of the target string is reached.
** ** isSeparator () returns true if the argument is an arithmetic symbol, whitespace character, or parentheses. This is used to determine the end position of a string separated by symbols such as identifiers.
Before the fix, it was compared directly with the char literal, but it searches for the characters registered in the array separators_ using Arrays.binarySearch () and returns true if found. As a requirement of Arrays.binarySearch (), the array to be searched is required to be sorted, so a static initializer ( static {...}` `` immediately after array initialization) Sorting is done with.
In addition, compared to the symbol type set in the previous Token class, there are fewer delimiter symbols, which matches the requirements of the higher-level program you use. (Not intended for use such as brackets)
** ** isQuote () returns true if it is a quote character.
As with isSeparator (), the target character is registered in the static array separators_.
** ** isSymbol_1 () returns true for symbols that form a token with one character (characters in the static array symbol1_).
** ** isWhitespace () is originally unnecessary, but I added it because it is subtle that instance methods and Java API are mixed in the processing body.
** ** isExists () is the implementation part of the array search process. (Because it seemed verbose to use Arrays.binarySearch () multiple times)
** ** Static arrays are searched by isSeparator (), isQuote (), and isSymbol_1 () mentioned above, respectively. I'm adding some missing characters.

    private char next() {
        skipSpace();
        return nextAll();
    }

    private char nextAll() {
        char c = aChar();
        ++pos_;
        return c;
    }

    private char aChar() { return isEnd() ? '\0' : target_.charAt(pos_); }

    private void skipSpace() {
        while ( !isEnd()  &&  Character.isWhitespace(aChar()) ) { pos_++; }
    }

    private boolean isEnd() { return length_ <= pos_; }
    private boolean isSeparator(char c)     { return exists(separators_, c);    }
    private boolean isQuote(char c)         { return exists(quotes_, c);        }
    private boolean isSymbol_1(char c)      { return exists(symbol1_, c);       }
    private boolean isWhitespace(char c)    { return Character.isWhitespace(c); }

    private boolean exists(char[] arr, char c) {
        return java.util.Arrays.binarySearch(arr, c) >= 0;
    }

    private static final char[] separators_ = { ',', '=', '(', ')', '{', '}', ':' };
    static { Arrays.sort(separators_); }

    private static final char[] quotes_ = { '"', '\'' };
    static { Arrays.sort(quotes_); }

    private static final char[] symbol1_ = { '(', ')', '{', '}', ':', ',', '=', '&' };
    static { Arrays.sort(symbol1_); }


    int          pos_ = 0;      // "next" analyzing position

Analysis processing

Performs lexical analysis and returns the result as a list of previously created Token objects. The return value is a copy of the token list tokens_ held internally. Because it is a standard list, the caller can add / remove elements, which affects the tokens_ held by the object (exactly referring to the same object). If you call analyze () a second time or later, it will return a copy of the token list that has already been created. No reanalysis is performed.

** ** Use next () to get one character to be parsed. If the end of the string has been reached, a null character will be returned, which is the end condition of the loop (more precisely, the continuation condition ...).
Changed to skip whitespace inside next ().
** ** If the read character corresponds to a one-character token excluding quotes (isSymbol_1 () returns true), create a one-character token and add it to the list.
** ** If the read character is a quote, pass the process to quotedText (char). Pass the read quotes as arguments.
If none of the above applies, pass the process to text (char). Pass the first character as an argument so that it can be registered as a string token inside text ().
** ** After exiting the loop, return a copy of the token list to the caller.
If you pass an invalid argument to the factory method, the loop will not be executed and an empty list will be returned because the parsed string is an empty string.

By changing the implementation of next () (skipping whitespace characters) and adding a judgment method, it is cleaner than the control by the previous switch statement. By the way, the intention of the judgment is reflected in the method name, so no comment is needed. (Isn't it?)

    public java.util.List<Token> analyze() {
        char c;
        while ( (c = next()) != '\0' ) {
            if ( isSymbol_1(c) ) {
                tokens_.add( Token.create(c) );
                continue;
            }
            if ( isQuote(c) ) {
                quotedText(c);
                continue;
            }
            text(c);
        }
        return new java.util.ArrayList<>(tokens_);
    }

    java.util.List<Token> tokens_ = new java.util.ArrayList<>();  // result

Quoting string processing

Make the quoted string into a single token, including whitespace characters. Whitespace between the quotes and the first character, and between the last and quotes is removed, but any whitespace between the other strings, including newline and tab characters, is retained. ..

First, at the entrance of the method, register the quotation marks in the token list.
** ** Then use a loop to add characters to StringBuilder until you see the end of the string or the same quotes as the quotes passed in the arguments. I don't want whitespace to be skipped here, so I'm using nextAll (), which allows you to get including whitespace.
After the end of the loop, if even one character is added to StringBuilder, it will be registered as a string token. If there is no quote content, such as (""), the string token will not be added.
Register the quotation marks that you got when you last exited the loop.

    /**Cut out up to the next quote character as a block of tokens. */
    private void quotedText(char quote) {
        tokens_.add( Token.create(quote));  // create token of begin quote

        java.lang.StringBuilder builder = new java.lang.StringBuilder();
        char c;
        while ( (c = nextAll()) != '\0'  &&  c != quote) { builder.append(c); }

        if ( builder.length() != 0 ) {
            tokens_.add( Token.create(builder.toString()) );  // append string
        }

        tokens_.add( Token.create(c) );  // append token of end quote
    }

Character string (identifier) processing

Cut out a string token that is not enclosed in quotation marks.

First, we are taking the first character of the string as an argument, so add it to StringBuilder.
** ** Read the character until it reaches the end of the string, the delimiter, or the whitespace character and add it to StringBuilder.
If you skip the whitespace character, the end of the token due to the space becomes undetectable, so changed the character acquisition to nextAll ().
When the loop is finished, register it as a string token in the list.
If the end of the target string has been reached, it will end at this point. In this case, c ='\ 0', so if you register as it is, unknown tokens will be added, so it is necessary to suppress the registration process below.
Finally, register the previously read delimiter as a token as a loop end condition. There are multiple whitespace characters, but all are registered as a token with one space character.

    /**Cut out separators and blank characters as a single token. */
    private void text(char first) {
        java.lang.StringBuilder builder = new java.lang.StringBuilder();
        builder.append(first);

        char c;
        while ( (c = nextAll()) != '\0'  &&  !isSeparator(c)  &&  !Character.isWhitespace(c) ) {
            builder.append(c);
        }
        tokens_.add( Token.create(builder.toString()) );

        if ( isEnd() ) { return; }

        tokens_.add( isWhitespace(c) ? Token.create(' ') : Token.create(c) );
    }

Status acquisition

LexicalAnalyzer state acquisition method.

isEmpty () returns true if the parsed token list is empty. Before executing analyze (), the token list is empty, so it also returns true.
isValid () checks the analysis result for invalid tokens. The conditions for invalidity (return value = false) are the following three patterns.
There is at least one token in the list that is Token.kind () == Kinds.Unknown.
The character string to be parsed is an empty string.
Before analysis execution (because the state is unknown)

    public boolean isEmpty() { return tokens_.size() == 0;}

    public boolean isValid() {
        return !isEmpty()  &&  tokens_.stream().noneMatch( e -> e.kind() == Token.Kinds.Unknown );
    }

test

Last but not least, here's the main () method for testing and the processing results. ~~ I don't know JUnit ~~ The delimiter token is hidden to reduce the number of lines in the processing result.

    public static void main(String[] args) {
        String s;
        java.util.List<Token> tokens;
        LexicalAnalyzer lex;

        System.out.println("test 1 -----------------------------------------------------------------------------------");
        LexicalAnalyzer.create(null).analyze().stream().filter( e -> e.kind() != Token.Kinds.Separator )
                                                       .forEach( e -> System.out.println(e) );
        System.out.printf("isEmpty()  after analyze  : %s\r\n", lex.isEmpty());

        System.out.println("\r\ntest 2 -------------------------------------------------------------------------------");
        s = "s";
        lex = LexicalAnalyzer.create(s);
        System.out.printf("isEmpty() before analyze() : %s\r\n", lex.isEmpty());
        lex.analyze().stream().filter( e -> e.kind() != Token.Kinds.Separator )
                              .forEach( e -> System.out.println(e) );
        System.out.printf("isEmpty() after analyze()  : %s\r\n", lex.isEmpty());

        System.out.println("test 3 -----------------------------------------------------------------------------------");
        s = "   [some] -c  \" text  document  \", \r\n (sequence1, \"seq 2\r\n  quoted\",seq3 seq4 'seq 5  ')\"ss";
        lex = LexicalAnalyzer.create(s);
        System.out.printf("isValid() before analyze() : %s\r\n", lex.isValid());
        lex.analyze().stream().filter( e -> e.kind() != Token.Kinds.Separator )
                              .forEach( e -> System.out.println(e) );
        System.out.printf("isValid() after  analyze() : %s\r\n", lex.isValid());
    }

Execution result

test 1 ... Case where null is passed as an argument

In the factory method, an empty string is set and it works without error, so there is no problem even if analyze () and the subsequent stream () are executed. Nothing is displayed. "isEmpty () ..." is the test result of the isEmpty () method.

test 2 ... parse only one character

Rather than checking the analysis results, check the specifications of isEmpty () before and after executing analyze ().

test 3 ... Parsing such a string

Since "seq 2 ..." contains line breaks in the quotation marks, the processing result also contains line breaks. The last isValid () is false because it started parsing as a quote string, but there was no pair of quotes and the string ended. If closed properly, it will be true.

test 1 -------------------------------------------------------------------------------
isEmpty()  after analyze  : true

test 2 -------------------------------------------------------------------------------
isEmpty() before analyze() : true
[String          : "s"]
isEmpty() after analyze()  : false

test 3 -------------------------------------------------------------------------------
isValid() before analyze() : false
[String          : "[some]"]
[String          : "-c"]
[DoubleQuote     : """]
[String          : "text  document"]
[DoubleQuote     : """]
[LeftParenthesis : "("]
[String          : "sequence1"]
[DoubleQuote     : """]
[String          : "seq 2
  quoted"]
[DoubleQuote     : """]
[String          : "seq3"]
[String          : "seq4"]
[SingleQuote     : "'"]
[String          : "seq 5"]
[SingleQuote     : "'"]
[RightParenthesis: ")"]
[DoubleQuote     : """]
[String          : "ss"]
[Unknown         : "**Unknown**"]
isValid() after  analyze() : false

Summary

It's been a long time, but this is the end of creating a lexical analyzer.

How about parsing so far? However, I can't write it because the syntax of the user has not been decided. The expected grammar has been decided to some extent, but it is undecided when to write it because the preparation for that is not catching up. (You have to complete the console class first ...)

Creating lexical analysis in Java 8 (Part 2)