Creating lexical analysis in Java 8 (Part 2)

Introduction

I made a simple lexical analyzer for use in my own Java console program. Since I created the token class in Previous article, this time it will be the lexical analyzer itself. For lexical analysis, it is easier to use various libraries, but since it is used in the workplace where downloading from external sites is prohibited, it is created from scratch using only Java8 functions.

2018/10/22 The source has been modified. 2018/10/24 Fixed to save a copy of the parsed string in the constructor.

In creating it, I refer to Implementing simple lexical analysis in Java.

important point

This time, the source and its explanation will be long, so please see Previous article.

Lexical analysis class

LexicalAnalyzer.java


package console;

public class LexicalAnalyzer {
    public static LexicalAnalyzer create(String target) {
        if ( target == null  || target.trim().isEmpty() ) { target = ""; }
        return new LexicalAnalyzer(target);
    }

    public java.util.List<Token> analyze() {
        char c;
        while ( (c = next()) != '\0' ) {
            if ( isSymbol_1(c) ) {
                tokens_.add( Token.create(c) );
                continue;
            }
            if ( isQuote(c) ) {
                quotedText(c);
                continue;
            }
            text(c);
        }
        return new java.util.ArrayList<>(tokens_);
    }

    // query methods ================================================================================
    public boolean isEmpty() { return tokens_.size() == 0;}

    public boolean isValid() {
        return !isEmpty()  &&  tokens_.stream().noneMatch( e -> e.kind() == Token.Kinds.Unknown );
    }

    // internal methods ======================================================================
    /**Cut out up to the next quote character as a block of tokens. */
    private void quotedText(char quote) {
        tokens_.add( Token.create(quote));  // create token of begin quote

        java.lang.StringBuilder builder = new java.lang.StringBuilder();
        char c;
        while ( (c = nextAll()) != '\0'  &&  c != quote) { builder.append(c); }
        if ( builder.length() != 0 ) {
            tokens_.add( Token.create(builder.toString()) );  // append string
        }

        tokens_.add( Token.create(c) );  // append token of end quote
    }

    /**Cut out separators and blank characters as a single token. */
    private void text(char first) {
        java.lang.StringBuilder builder = new java.lang.StringBuilder();
        builder.append(first);

        char c;
        while ( (c = nextAll()) != '\0'  &&  !isSeparator(c)  &&  !isWhitespace(c) ) { 
            builder.append(c);
        }
        tokens_.add( Token.create(builder.toString()) );

        // append separator token, if not end of text
        if ( isEnd() ) { return; }

        tokens_.add( isWhitespace(c) ? Token.create(' ') : Token.create(c) );
    }

    private char next() {
        skipSpace();
        return nextAll();
    }

    private char nextAll() {
        char c = aChar();
        ++pos_;
        return c;
    }

    private char aChar() { return isEnd() ? '\0' : target_.charAt(pos_); }

    private void skipSpace() {
        while ( !isEnd()  &&  Character.isWhitespace(aChar()) ) { pos_++; }
    }

    private boolean isEnd()                 { return length_ <= pos_; }
    private boolean isSeparator(char c)     { return exists(separators_, c);    }
    private boolean isQuote(char c)         { return exists(quotes_, c);        }
    private boolean isSymbol_1(char c)      { return exists(symbol1_, c);       }
    private boolean isWhitespace(char c)    { return Character.isWhitespace(c); }

    private boolean exists(char[] arr, char c) {
        return java.util.Arrays.binarySearch(arr, c) >= 0;
    }

    private LexicalAnalyzer(String target) {
        target_ = target;
        length_ = target.length();
    }

    // internal fields ======================================================================
    private static final char[] separators_ = { ':', ',', '=', '(', ')', '{', '}' };
    static { Arrays.sort(separators_); }

    private static final char[] quotes_ = { '"', '\'' };
    static { Arrays.sort(quotes_); }

    private static final char[] symbol1_ = {'(', ')', '{', '}', ':', ',', '=', '&' };
    static { Arrays.sort(symbol1_); }

    final String target_;       // analyze target string
    final int    length_;       // length of target string
    int          pos_ = 0;      // "next" analyzing position

    java.util.List<Token> tokens_ = new java.util.ArrayList<>();  // result
}

Commentary

Factory method

Originally, I want to make the parsed string that is the argument mandatory, but there is no way to force it to the caller, so set an empty string to the target string. This frees the caller from the hassle of null checking and exception handling. By the way, there is no need to check arguments in the constructor.

** ** Set a copy of the argument when setting the string. If you do not copy it, you can change the target character string externally.

    public static LexicalAnalyzer create(String target) {
        if ( target == null  || target.trim().isEmpty() ) { target = ""; }
        return new LexicalAnalyzer(target);
    }

    private LexicalAnalyzer(String target) {
        target_ = new String(target);
        length_ = target.length();
    }

    final String target_;       // analyze target string
    final int    length_;       // length of target string

Support method

Since it is a lexical analyzer, it extracts the target character string character by character and processes it. The basic functions for that are the following methods.

    private char next() {
        skipSpace();
        return nextAll();
    }

    private char nextAll() {
        char c = aChar();
        ++pos_;
        return c;
    }

    private char aChar() { return isEnd() ? '\0' : target_.charAt(pos_); }

    private void skipSpace() {
        while ( !isEnd()  &&  Character.isWhitespace(aChar()) ) { pos_++; }
    }

    private boolean isEnd() { return length_ <= pos_; }
    private boolean isSeparator(char c)     { return exists(separators_, c);    }
    private boolean isQuote(char c)         { return exists(quotes_, c);        }
    private boolean isSymbol_1(char c)      { return exists(symbol1_, c);       }
    private boolean isWhitespace(char c)    { return Character.isWhitespace(c); }

    private boolean exists(char[] arr, char c) {
        return java.util.Arrays.binarySearch(arr, c) >= 0;
    }

    private static final char[] separators_ = { ',', '=', '(', ')', '{', '}', ':' };
    static { Arrays.sort(separators_); }

    private static final char[] quotes_ = { '"', '\'' };
    static { Arrays.sort(quotes_); }

    private static final char[] symbol1_ = { '(', ')', '{', '}', ':', ',', '=', '&' };
    static { Arrays.sort(symbol1_); }


    int          pos_ = 0;      // "next" analyzing position

Analysis processing

Performs lexical analysis and returns the result as a list of previously created Token objects. The return value is a copy of the token list tokens_ held internally. Because it is a standard list, the caller can add / remove elements, which affects the tokens_ held by the object (exactly referring to the same object). If you call analyze () a second time or later, it will return a copy of the token list that has already been created. No reanalysis is performed.

By changing the implementation of next () (skipping whitespace characters) and adding a judgment method, it is cleaner than the control by the previous switch statement. By the way, the intention of the judgment is reflected in the method name, so no comment is needed. (Isn't it?)

    public java.util.List<Token> analyze() {
        char c;
        while ( (c = next()) != '\0' ) {
            if ( isSymbol_1(c) ) {
                tokens_.add( Token.create(c) );
                continue;
            }
            if ( isQuote(c) ) {
                quotedText(c);
                continue;
            }
            text(c);
        }
        return new java.util.ArrayList<>(tokens_);
    }

    java.util.List<Token> tokens_ = new java.util.ArrayList<>();  // result

Quoting string processing

Make the quoted string into a single token, including whitespace characters. Whitespace between the quotes and the first character, and between the last and quotes is removed, but any whitespace between the other strings, including newline and tab characters, is retained. ..

    /**Cut out up to the next quote character as a block of tokens. */
    private void quotedText(char quote) {
        tokens_.add( Token.create(quote));  // create token of begin quote

        java.lang.StringBuilder builder = new java.lang.StringBuilder();
        char c;
        while ( (c = nextAll()) != '\0'  &&  c != quote) { builder.append(c); }

        if ( builder.length() != 0 ) {
            tokens_.add( Token.create(builder.toString()) );  // append string
        }

        tokens_.add( Token.create(c) );  // append token of end quote
    }

Character string (identifier) processing

Cut out a string token that is not enclosed in quotation marks.

    /**Cut out separators and blank characters as a single token. */
    private void text(char first) {
        java.lang.StringBuilder builder = new java.lang.StringBuilder();
        builder.append(first);

        char c;
        while ( (c = nextAll()) != '\0'  &&  !isSeparator(c)  &&  !Character.isWhitespace(c) ) {
            builder.append(c);
        }
        tokens_.add( Token.create(builder.toString()) );

        if ( isEnd() ) { return; }

        tokens_.add( isWhitespace(c) ? Token.create(' ') : Token.create(c) );
    }

Status acquisition

LexicalAnalyzer state acquisition method.

    public boolean isEmpty() { return tokens_.size() == 0;}

    public boolean isValid() {
        return !isEmpty()  &&  tokens_.stream().noneMatch( e -> e.kind() == Token.Kinds.Unknown );
    }

test

Last but not least, here's the main () method for testing and the processing results. ~~ I don't know JUnit ~~ The delimiter token is hidden to reduce the number of lines in the processing result.

    public static void main(String[] args) {
        String s;
        java.util.List<Token> tokens;
        LexicalAnalyzer lex;

        System.out.println("test 1 -----------------------------------------------------------------------------------");
        LexicalAnalyzer.create(null).analyze().stream().filter( e -> e.kind() != Token.Kinds.Separator )
                                                       .forEach( e -> System.out.println(e) );
        System.out.printf("isEmpty()  after analyze  : %s\r\n", lex.isEmpty());

        System.out.println("\r\ntest 2 -------------------------------------------------------------------------------");
        s = "s";
        lex = LexicalAnalyzer.create(s);
        System.out.printf("isEmpty() before analyze() : %s\r\n", lex.isEmpty());
        lex.analyze().stream().filter( e -> e.kind() != Token.Kinds.Separator )
                              .forEach( e -> System.out.println(e) );
        System.out.printf("isEmpty() after analyze()  : %s\r\n", lex.isEmpty());

        System.out.println("test 3 -----------------------------------------------------------------------------------");
        s = "   [some] -c  \" text  document  \", \r\n (sequence1, \"seq 2\r\n  quoted\",seq3 seq4 'seq 5  ')\"ss";
        lex = LexicalAnalyzer.create(s);
        System.out.printf("isValid() before analyze() : %s\r\n", lex.isValid());
        lex.analyze().stream().filter( e -> e.kind() != Token.Kinds.Separator )
                              .forEach( e -> System.out.println(e) );
        System.out.printf("isValid() after  analyze() : %s\r\n", lex.isValid());
    }

Execution result

test 1 ... Case where null is passed as an argument

In the factory method, an empty string is set and it works without error, so there is no problem even if analyze () and the subsequent stream () are executed. Nothing is displayed. "isEmpty () ..." is the test result of the isEmpty () method.

test 2 ... parse only one character

Rather than checking the analysis results, check the specifications of isEmpty () before and after executing analyze ().

test 3 ... Parsing such a string

Since "seq 2 ..." contains line breaks in the quotation marks, the processing result also contains line breaks. The last isValid () is false because it started parsing as a quote string, but there was no pair of quotes and the string ended. If closed properly, it will be true.

test 1 -------------------------------------------------------------------------------
isEmpty()  after analyze  : true

test 2 -------------------------------------------------------------------------------
isEmpty() before analyze() : true
[String          : "s"]
isEmpty() after analyze()  : false

test 3 -------------------------------------------------------------------------------
isValid() before analyze() : false
[String          : "[some]"]
[String          : "-c"]
[DoubleQuote     : """]
[String          : "text  document"]
[DoubleQuote     : """]
[LeftParenthesis : "("]
[String          : "sequence1"]
[DoubleQuote     : """]
[String          : "seq 2
  quoted"]
[DoubleQuote     : """]
[String          : "seq3"]
[String          : "seq4"]
[SingleQuote     : "'"]
[String          : "seq 5"]
[SingleQuote     : "'"]
[RightParenthesis: ")"]
[DoubleQuote     : """]
[String          : "ss"]
[Unknown         : "**Unknown**"]
isValid() after  analyze() : false

Summary

It's been a long time, but this is the end of creating a lexical analyzer.

How about parsing so far? However, I can't write it because the syntax of the user has not been decided. The expected grammar has been decided to some extent, but it is undecided when to write it because the preparation for that is not catching up. (You have to complete the console class first ...)

Recommended Posts

Creating lexical analysis in Java 8 (Part 2)
Creating lexical analysis in Java 8 (Part 1)
1 Implement simple lexical analysis in Java
Creating a matrix class in Java Part 1
Topic Analysis (LDA) in Java
Use OpenCV_Contrib (ArUco) in Java! (Part 2-Programming)
[Creating] A memorandum about coding in Java
Partization in Java
Changes in Java 11
Rock-paper-scissors in Java
java practice part 1
Pi in Java
FizzBuzz in Java
Use OpenCV_Contrib (ArUco) in Java! (Part 1-Build) (OpenCV-3.4.4)
Static code analysis with Checkstyle in Java + Gradle
NLP4J [001b] Morphological analysis in Java (using kuromoji)
[java] sort in list
Read JSON in Java
[LeJOS] Let's program mindstorm-EV3 in Java [Environment construction part 2]
Interpreter implementation in Java
Make Blackjack in Java
Rock-paper-scissors app in Java
Constraint programming in Java
Put java8 in centos7
NVL-ish guy in Java
Combine arrays in Java
"Hello World" in Java
A quick review of Java learned in class part4
Comments in Java source
Azure functions in java
Format XML in Java
Simple htmlspecialchars in Java
Boyer-Moore implementation in Java
Hello World in Java
Use OpenCV in Java
What I learned in Java (Part 3) Instruction execution statement
Type determination in Java
Various threads in java
Heapsort implementation (in java)
Zabbix API in Java
ASCII art in Java
Compare Lists in Java
POST JSON in Java
Studying Java ~ Part 8 ~ Cast
A quick review of Java learned in class part3
Express failure in Java
A quick review of Java learned in class part2
Create JSON in Java
Date manipulation in Java 8
What's new in Java 8
Use PreparedStatement in Java
What's new in Java 9,10,11
Parallel execution in Java
Initializing HashMap in Java
JSON in Java and Jackson Part 1 Return JSON from the server
Log aggregation and analysis (working with AWS Athena in Java)
What I learned in Java (Part 4) Conditional branching and repetition
45 Java Performance Optimization Techniques (Part 1)
Try using RocksDB in Java
Read binary files in Java 1
Avoid Yubaba's error in Java