Creating lexical analysis in Java 8 (Part 1)

Introduction

I made a simple lexical analyzer for use in my own Java console program. For lexical analysis, it is easier to use various libraries, but since it is used in the workplace where downloading from external sites is prohibited, I created it from scratch using the Java8 function I have.

In creating it, I refer to Implementing simple lexical analysis in Java.

important point

Requirements

First, the requirements for the lexical analyzer to be created.

--Implement only Java 8 functions. --Since it is not used for parsing a full-fledged program language, only simple lexical analysis (about halfway between the Windows command line and the Unix shell) is performed.

--Implemented only one-letter symbols and strings that can be tokens.

Program style

Please note that it is an oleore style that deviates from the general description of Java. There are the following reasons.

--To keep the upper program simple, do not throw exceptions or return nulls as much as possible. I want to avoid burying the processing body in a large number of error checks. --Minimize scope. --Use immutable objects wherever possible. --Do not use general getters and setters. (Getter / Setter is evil. That's it. Original [Getters / Setters. Evil. Period.](Https: See //www.yegor256.com/2014/09/16/getters-and-setters-are-evil.html) --No annotations to make the source compact. --In principle, do not use import statements to get used to the Java standard library. Importing Stream-related classes, which tend to be redundant.

Token class

First, I'll show you the full text of the basic token class.

Token.java


package console;

final class Token {
    enum Kinds {
        Unknown,
        Empty,
        Ampersand,          // "&"
        Assign,             // "="
        Plus,               // "+"
        Minus,              // "-"
        Asterisk,           // "*"
        Slash,              // "/"
        Separator,          // ","or whitespace
        LeftParenthesis,    // "("
        RightParenthesis,   // ")"
        LeftCurlyBracket,   // "{"
        RightCurlyBracket,  // "}"
        LeftSquareBracket,  // "["
        RightSquareBracket, // "]"
        Colon,              // ":"
        BackSlash,          // "\"
        DoubleQuote,        // """
        SingleQuote,        // "'"
        String,
    }

    static Token create(char c) {
        final String s = Character.toString(c);
        switch(c) {
            case '&'  : return new Token(s, Kinds.Ampersand         );
            case '='  : return new Token(s, Kinds.Assign            );
            case '+'  : return new Token(s, Kinds.Plus              );
            case '-'  : return new Token(s, Kinds.Minus             );
            case '*'  : return new Token(s, Kinds.Asterisk          );
            case '/'  : return new Token(s, Kinds.Slash             );
            case ','  : // down through
            case ' '  : return new Token(s, Kinds.Separator         );
            case '('  : return new Token(s, Kinds.LeftParenthesis   );
            case ')'  : return new Token(s, Kinds.RightParenthesis  );
            case '{'  : return new Token(s, Kinds.LeftCurlyBracket  );
            case '}'  : return new Token(s, Kinds.RightCurlyBracket );
            case '['  : return new Token(s, Kinds.LeftSquareBracket );
            case ']'  : return new Token(s, Kinds.RightSquareBracket);
            case ':'  : return new Token(s, Kinds.Colon             );
            case '\\' : return new Token(s, Kinds.BackSlash         );
            case '\"' : return new Token(s, Kinds.DoubleQuote       );
            case '\'' : return new Token(s, Kinds.SingleQuote       );
        }
        return unknown_;
    }

    static Token create(String s) {
        if ( s == null  ||  s.trim().isEmpty() ) { return empty_; }
        //What to do if you make a mistake and the symbol is passed as a string
        if ( s.length() == 1 ) {
            Token t = Token.create(s.charAt(0));
            if ( t.kind() != Kinds.Unknown ) { return t; }
        }
        return new Token(s.trim(), Kinds.String);
    }

    final String value()     { return value_; }
    final Kinds  kind()      { return kind_; }
    final String kindName()  { return kind_.toString(); }
    public String toString() {
        return String.format("[%-14s: \"%s\"]", kindName(), value());
    }

    private Token(String s, Kinds k) {
        kind_  = k;
        value_ = s;
    }

    private static Token empty_   = new Token("", Kinds.Empty);                // empty token
    private static Token unknown_ = new Token("**Unknown**", Kinds.Unknown);   // unknown token

    private final Kinds  kind_;
    private final String value_;
}

Commentary

It's a short source, so I don't think it's necessary to explain it, but ~~ I'll forget it ~~ I'll briefly explain each part of the source.

Token type

Register the token type in the enumeration type. This is a marker that will be used later in parsing, so the characters and elements do not necessarily have to have a one-to-one correspondence.

    enum Kinds {
        Unknown,
        Empty,
        Ampersand,          // "&"
        Assign,             // "="
        Plus,               // "+"
        //Omitted on the way
        String,
    }

Factory method

Use the factory method to create a token for the symbol. From the lexical analyzer, the read character type (char type) is passed directly to separate it from the string token factory. If a character that cannot be caught by the switch statement is passed, the unknown token created in advance is returned. Since Token is an immutable object, its contents cannot be changed. Therefore, you can reuse the pre-created objects with confidence. Commas and spaces are examples of the non-one-to-one correspondence between enums and characters mentioned above. (Both are created as Kinds.Separator)

    static Token create(char c) {
        final String s = Character.toString(c);
        switch(c) {
            case '&'  : return new Token(s, Kinds.Ampersand         );
            case '='  : return new Token(s, Kinds.Assign            );
            case '+'  : return new Token(s, Kinds.Plus              );
            case '-'  : return new Token(s, Kinds.Minus             );
            case '*'  : return new Token(s, Kinds.Asterisk          );
            case '/'  : return new Token(s, Kinds.Slash             );
            case ','  : // down through
            case ' '  : return new Token(s, Kinds.Separator         );
            case '('  : return new Token(s, Kinds.LeftParenthesis   );
            case ')'  : return new Token(s, Kinds.RightParenthesis  );
            case '{'  : return new Token(s, Kinds.LeftCurlyBracket  );
            case '}'  : return new Token(s, Kinds.RightCurlyBracket );
            case '['  : return new Token(s, Kinds.LeftSquareBracket );
            case ']'  : return new Token(s, Kinds.RightSquareBracket);
            case ':'  : return new Token(s, Kinds.Colon             );
            case '\\' : return new Token(s, Kinds.BackSlash         );
            case '\"' : return new Token(s, Kinds.DoubleQuote       );
            case '\'' : return new Token(s, Kinds.SingleQuote       );
        }
        return unknown_;
    }

    private static Token unknown_ = new Token("**Unknown**", Kinds.Unknown);   // unknown token

Next is the string token factory method. This is a lexical analyzer that reads to the end of the string and then passes it to the factory, so unlike the symbol token, it is overloaded to take a String argument. If an empty string (including a space-only string) or null is passed, a pre-created empty token will be returned. This avoids extra error checking on the caller. The one-character symbol is supposed to be passed in char type, but as insurance when passed in string type, if the passed character string is only one character, the factory of the symbol token is called. .. If the character string cannot be supported by the symbol token factory, an unknown token will be returned. In that case, create it as a character string token again.

In addition, although the explanation is mixed up, the constructor is declared private so that the object is created only via the factory method.

    static Token create(String s) {
        if ( s == null  ||  s.trim().isEmpty() ) { return empty_; }
        //What to do if you make a mistake and the symbol is passed as a string
        if ( s.length() == 1 ) {
            Token t = Token.create(s.charAt(0));
            if ( t.kind() != Kinds.Unknown ) { return t; }
        }
        return new Token(s.trim(), Kinds.String);
    }

    private Token(String s, Kinds k) {
        kind_  = k;
        value_ = s;
    }

    private static Token empty_   = new Token("", Kinds.Empty);                // empty token

Inquiry

There are two types of tokens: an enumerated element type and a string representation of the element ("Plus" is returned for kinds.Plus).

    final String value()     { return value_; }
    final Kinds  kind()      { return kind_; }
    final String kindName()  { return kind_.toString(); }

    public String toString() {
        return String.format("[%-14s: \"%s\"]", kindName(), value());
    }

Summary

This concludes the explanation of the token class.

Actually, I increased the number of symbol tokens from the program I actually created, but even at this level, I found it annoying to synchronize the enumeration type and the factory switch statement. I thought about remodeling this part for general purposes, but I decided to forgo this time because it is troublesome to increase the number of classes and due to time constraints.

Next time, it will be the main body of the lexical analyzer.

Creating lexical analysis with Java 8 (Part 2) Posted.

Recommended Posts

Creating lexical analysis in Java 8 (Part 2)
Creating lexical analysis in Java 8 (Part 1)
1 Implement simple lexical analysis in Java
Creating a matrix class in Java Part 1
Morphological analysis in Java with Kuromoji
Creating a matrix class in Java Part 2-About matrices (linear algebra)-
Use OpenCV_Contrib (ArUco) in Java! (Part 2-Programming)
[Creating] A memorandum about coding in Java
Partization in Java
Changes in Java 11
Rock-paper-scissors in Java
java practice part 1
Pi in Java
FizzBuzz in Java
Use OpenCV_Contrib (ArUco) in Java! (Part 1-Build) (OpenCV-3.4.4)
What I learned in Java (Part 2) What are variables?
NLP4J [001b] Morphological analysis in Java (using kuromoji)
[LeJOS] Let's program mindstorm-EV3 in Java [Environment construction part 2]
Interpreter implementation in Java
Make Blackjack in Java
Rock-paper-scissors app in Java
Constraint programming in Java
Put java8 in centos7
NVL-ish guy in Java
"Hello World" in Java
Callable Interface in Java
A quick review of Java learned in class part4
Comments in Java source
Azure functions in java
Format XML in Java
Simple htmlspecialchars in Java
Boyer-Moore implementation in Java
Use OpenCV in Java
webApi memorandum in java
What I learned in Java (Part 3) Instruction execution statement
Type determination in Java
Ping commands in Java
Heapsort implementation (in java)
Zabbix API in Java
ASCII art in Java
Compare Lists in Java
POST JSON in Java
Studying Java ~ Part 8 ~ Cast
A quick review of Java learned in class part3
Express failure in Java
A quick review of Java learned in class part2
Create JSON in Java
What's new in Java 8
Use PreparedStatement in Java
What's new in Java 9,10,11
Parallel execution in Java
Initializing HashMap in Java
JSON in Java and Jackson Part 1 Return JSON from the server
[LeJOS] Let's program mindstorm-EV3 in Java [Environment construction first part]
Log aggregation and analysis (working with AWS Athena in Java)
What I learned in Java (Part 4) Conditional branching and repetition
45 Java Performance Optimization Techniques (Part 1)
Try using RocksDB in Java
Read binary files in Java 1
Avoid Yubaba's error in Java
Get EXIF information in Java