Introduction

I made a simple lexical analyzer for use in my own Java console program. For lexical analysis, it is easier to use various libraries, but since it is used in the workplace where downloading from external sites is prohibited, I created it from scratch using the Java8 function I have.

In creating it, I refer to Implementing simple lexical analysis in Java.

important point

Requirements

First, the requirements for the lexical analyzer to be created.

--Implement only Java 8 functions. --Since it is not used for parsing a full-fledged program language, only simple lexical analysis (about halfway between the Windows command line and the Unix shell) is performed.

--Implemented only one-letter symbols and strings that can be tokens.

Program style

Please note that it is an oleore style that deviates from the general description of Java. There are the following reasons.

--To keep the upper program simple, do not throw exceptions or return nulls as much as possible. I want to avoid burying the processing body in a large number of error checks. --Minimize scope. --Use immutable objects wherever possible. --Do not use general getters and setters. (Getter / Setter is evil. That's it. Original [Getters / Setters. Evil. Period.](Https: See //www.yegor256.com/2014/09/16/getters-and-setters-are-evil.html) --No annotations to make the source compact. --In principle, do not use import statements to get used to the Java standard library. Importing Stream-related classes, which tend to be redundant.

Token class

First, I'll show you the full text of the basic token class.

`Token.java`


package console;

final class Token {
    enum Kinds {
        Unknown,
        Empty,
        Ampersand,          // "&"
        Assign,             // "="
        Plus,               // "+"
        Minus,              // "-"
        Asterisk,           // "*"
        Slash,              // "/"
        Separator,          // ","or whitespace
        LeftParenthesis,    // "("
        RightParenthesis,   // ")"
        LeftCurlyBracket,   // "{"
        RightCurlyBracket,  // "}"
        LeftSquareBracket,  // "["
        RightSquareBracket, // "]"
        Colon,              // ":"
        BackSlash,          // "\"
        DoubleQuote,        // """
        SingleQuote,        // "'"
        String,
    }

    static Token create(char c) {
        final String s = Character.toString(c);
        switch(c) {
            case '&'  : return new Token(s, Kinds.Ampersand         );
            case '='  : return new Token(s, Kinds.Assign            );
            case '+'  : return new Token(s, Kinds.Plus              );
            case '-'  : return new Token(s, Kinds.Minus             );
            case '*'  : return new Token(s, Kinds.Asterisk          );
            case '/'  : return new Token(s, Kinds.Slash             );
            case ','  : // down through
            case ' '  : return new Token(s, Kinds.Separator         );
            case '('  : return new Token(s, Kinds.LeftParenthesis   );
            case ')'  : return new Token(s, Kinds.RightParenthesis  );
            case '{'  : return new Token(s, Kinds.LeftCurlyBracket  );
            case '}'  : return new Token(s, Kinds.RightCurlyBracket );
            case '['  : return new Token(s, Kinds.LeftSquareBracket );
            case ']'  : return new Token(s, Kinds.RightSquareBracket);
            case ':'  : return new Token(s, Kinds.Colon             );
            case '\\' : return new Token(s, Kinds.BackSlash         );
            case '\"' : return new Token(s, Kinds.DoubleQuote       );
            case '\'' : return new Token(s, Kinds.SingleQuote       );
        }
        return unknown_;
    }

    static Token create(String s) {
        if ( s == null  ||  s.trim().isEmpty() ) { return empty_; }
        //What to do if you make a mistake and the symbol is passed as a string
        if ( s.length() == 1 ) {
            Token t = Token.create(s.charAt(0));
            if ( t.kind() != Kinds.Unknown ) { return t; }
        }
        return new Token(s.trim(), Kinds.String);
    }

    final String value()     { return value_; }
    final Kinds  kind()      { return kind_; }
    final String kindName()  { return kind_.toString(); }
    public String toString() {
        return String.format("[%-14s: \"%s\"]", kindName(), value());
    }

    private Token(String s, Kinds k) {
        kind_  = k;
        value_ = s;
    }

    private static Token empty_   = new Token("", Kinds.Empty);                // empty token
    private static Token unknown_ = new Token("**Unknown**", Kinds.Unknown);   // unknown token

    private final Kinds  kind_;
    private final String value_;
}

Commentary

It's a short source, so I don't think it's necessary to explain it, but ~~ I'll forget it ~~ I'll briefly explain each part of the source.

Token type

Register the token type in the enumeration type. This is a marker that will be used later in parsing, so the characters and elements do not necessarily have to have a one-to-one correspondence.

    enum Kinds {
        Unknown,
        Empty,
        Ampersand,          // "&"
        Assign,             // "="
        Plus,               // "+"
        //Omitted on the way
        String,
    }

Factory method

Use the factory method to create a token for the symbol. From the lexical analyzer, the read character type (char type) is passed directly to separate it from the string token factory. If a character that cannot be caught by the switch statement is passed, the unknown token created in advance is returned. Since Token is an immutable object, its contents cannot be changed. Therefore, you can reuse the pre-created objects with confidence. Commas and spaces are examples of the non-one-to-one correspondence between enums and characters mentioned above. (Both are created as Kinds.Separator)

    static Token create(char c) {
        final String s = Character.toString(c);
        switch(c) {
            case '&'  : return new Token(s, Kinds.Ampersand         );
            case '='  : return new Token(s, Kinds.Assign            );
            case '+'  : return new Token(s, Kinds.Plus              );
            case '-'  : return new Token(s, Kinds.Minus             );
            case '*'  : return new Token(s, Kinds.Asterisk          );
            case '/'  : return new Token(s, Kinds.Slash             );
            case ','  : // down through
            case ' '  : return new Token(s, Kinds.Separator         );
            case '('  : return new Token(s, Kinds.LeftParenthesis   );
            case ')'  : return new Token(s, Kinds.RightParenthesis  );
            case '{'  : return new Token(s, Kinds.LeftCurlyBracket  );
            case '}'  : return new Token(s, Kinds.RightCurlyBracket );
            case '['  : return new Token(s, Kinds.LeftSquareBracket );
            case ']'  : return new Token(s, Kinds.RightSquareBracket);
            case ':'  : return new Token(s, Kinds.Colon             );
            case '\\' : return new Token(s, Kinds.BackSlash         );
            case '\"' : return new Token(s, Kinds.DoubleQuote       );
            case '\'' : return new Token(s, Kinds.SingleQuote       );
        }
        return unknown_;
    }

    private static Token unknown_ = new Token("**Unknown**", Kinds.Unknown);   // unknown token

Next is the string token factory method. This is a lexical analyzer that reads to the end of the string and then passes it to the factory, so unlike the symbol token, it is overloaded to take a String argument. If an empty string (including a space-only string) or null is passed, a pre-created empty token will be returned. This avoids extra error checking on the caller. The one-character symbol is supposed to be passed in char type, but as insurance when passed in string type, if the passed character string is only one character, the factory of the symbol token is called. .. If the character string cannot be supported by the symbol token factory, an unknown token will be returned. In that case, create it as a character string token again.

In addition, although the explanation is mixed up, the constructor is declared private so that the object is created only via the factory method.

    static Token create(String s) {
        if ( s == null  ||  s.trim().isEmpty() ) { return empty_; }
        //What to do if you make a mistake and the symbol is passed as a string
        if ( s.length() == 1 ) {
            Token t = Token.create(s.charAt(0));
            if ( t.kind() != Kinds.Unknown ) { return t; }
        }
        return new Token(s.trim(), Kinds.String);
    }

    private Token(String s, Kinds k) {
        kind_  = k;
        value_ = s;
    }

    private static Token empty_   = new Token("", Kinds.Empty);                // empty token

Inquiry

There are two types of tokens: an enumerated element type and a string representation of the element ("Plus" is returned for kinds.Plus).

    final String value()     { return value_; }
    final Kinds  kind()      { return kind_; }
    final String kindName()  { return kind_.toString(); }

    public String toString() {
        return String.format("[%-14s: \"%s\"]", kindName(), value());
    }

Summary

This concludes the explanation of the token class.

Actually, I increased the number of symbol tokens from the program I actually created, but even at this level, I found it annoying to synchronize the enumeration type and the factory switch statement. I thought about remodeling this part for general purposes, but I decided to forgo this time because it is troublesome to increase the number of classes and due to time constraints.

Next time, it will be the main body of the lexical analyzer.

→ Creating lexical analysis with Java 8 (Part 2) Posted.

Creating lexical analysis in Java 8 (Part 1)