I made a simple lexical analyzer for use in my own Java console program. Since I created the token class in Previous article, this time it will be the lexical analyzer itself. For lexical analysis, it is easier to use various libraries, but since it is used in the workplace where downloading from external sites is prohibited, it is created from scratch using only Java8 functions.
2018/10/22 The source has been modified. 2018/10/24 Fixed to save a copy of the parsed string in the constructor.
In creating it, I refer to Implementing simple lexical analysis in Java.
This time, the source and its explanation will be long, so please see Previous article.
LexicalAnalyzer.java
package console;
public class LexicalAnalyzer {
public static LexicalAnalyzer create(String target) {
if ( target == null || target.trim().isEmpty() ) { target = ""; }
return new LexicalAnalyzer(target);
}
public java.util.List<Token> analyze() {
char c;
while ( (c = next()) != '\0' ) {
if ( isSymbol_1(c) ) {
tokens_.add( Token.create(c) );
continue;
}
if ( isQuote(c) ) {
quotedText(c);
continue;
}
text(c);
}
return new java.util.ArrayList<>(tokens_);
}
// query methods ================================================================================
public boolean isEmpty() { return tokens_.size() == 0;}
public boolean isValid() {
return !isEmpty() && tokens_.stream().noneMatch( e -> e.kind() == Token.Kinds.Unknown );
}
// internal methods ======================================================================
/**Cut out up to the next quote character as a block of tokens. */
private void quotedText(char quote) {
tokens_.add( Token.create(quote)); // create token of begin quote
java.lang.StringBuilder builder = new java.lang.StringBuilder();
char c;
while ( (c = nextAll()) != '\0' && c != quote) { builder.append(c); }
if ( builder.length() != 0 ) {
tokens_.add( Token.create(builder.toString()) ); // append string
}
tokens_.add( Token.create(c) ); // append token of end quote
}
/**Cut out separators and blank characters as a single token. */
private void text(char first) {
java.lang.StringBuilder builder = new java.lang.StringBuilder();
builder.append(first);
char c;
while ( (c = nextAll()) != '\0' && !isSeparator(c) && !isWhitespace(c) ) {
builder.append(c);
}
tokens_.add( Token.create(builder.toString()) );
// append separator token, if not end of text
if ( isEnd() ) { return; }
tokens_.add( isWhitespace(c) ? Token.create(' ') : Token.create(c) );
}
private char next() {
skipSpace();
return nextAll();
}
private char nextAll() {
char c = aChar();
++pos_;
return c;
}
private char aChar() { return isEnd() ? '\0' : target_.charAt(pos_); }
private void skipSpace() {
while ( !isEnd() && Character.isWhitespace(aChar()) ) { pos_++; }
}
private boolean isEnd() { return length_ <= pos_; }
private boolean isSeparator(char c) { return exists(separators_, c); }
private boolean isQuote(char c) { return exists(quotes_, c); }
private boolean isSymbol_1(char c) { return exists(symbol1_, c); }
private boolean isWhitespace(char c) { return Character.isWhitespace(c); }
private boolean exists(char[] arr, char c) {
return java.util.Arrays.binarySearch(arr, c) >= 0;
}
private LexicalAnalyzer(String target) {
target_ = target;
length_ = target.length();
}
// internal fields ======================================================================
private static final char[] separators_ = { ':', ',', '=', '(', ')', '{', '}' };
static { Arrays.sort(separators_); }
private static final char[] quotes_ = { '"', '\'' };
static { Arrays.sort(quotes_); }
private static final char[] symbol1_ = {'(', ')', '{', '}', ':', ',', '=', '&' };
static { Arrays.sort(symbol1_); }
final String target_; // analyze target string
final int length_; // length of target string
int pos_ = 0; // "next" analyzing position
java.util.List<Token> tokens_ = new java.util.ArrayList<>(); // result
}
Originally, I want to make the parsed string that is the argument mandatory, but there is no way to force it to the caller, so set an empty string to the target string. This frees the caller from the hassle of null checking and exception handling. By the way, there is no need to check arguments in the constructor.
**
public static LexicalAnalyzer create(String target) {
if ( target == null || target.trim().isEmpty() ) { target = ""; }
return new LexicalAnalyzer(target);
}
private LexicalAnalyzer(String target) {
target_ = new String(target);
length_ = target.length();
}
final String target_; // analyze target string
final int length_; // length of target string
Since it is a lexical analyzer, it extracts the target character string character by character and processes it. The basic functions for that are the following methods.
static {...}` `` immediately after array initialization) Sorting is done with. private char next() {
skipSpace();
return nextAll();
}
private char nextAll() {
char c = aChar();
++pos_;
return c;
}
private char aChar() { return isEnd() ? '\0' : target_.charAt(pos_); }
private void skipSpace() {
while ( !isEnd() && Character.isWhitespace(aChar()) ) { pos_++; }
}
private boolean isEnd() { return length_ <= pos_; }
private boolean isSeparator(char c) { return exists(separators_, c); }
private boolean isQuote(char c) { return exists(quotes_, c); }
private boolean isSymbol_1(char c) { return exists(symbol1_, c); }
private boolean isWhitespace(char c) { return Character.isWhitespace(c); }
private boolean exists(char[] arr, char c) {
return java.util.Arrays.binarySearch(arr, c) >= 0;
}
private static final char[] separators_ = { ',', '=', '(', ')', '{', '}', ':' };
static { Arrays.sort(separators_); }
private static final char[] quotes_ = { '"', '\'' };
static { Arrays.sort(quotes_); }
private static final char[] symbol1_ = { '(', ')', '{', '}', ':', ',', '=', '&' };
static { Arrays.sort(symbol1_); }
int pos_ = 0; // "next" analyzing position
Performs lexical analysis and returns the result as a list of previously created Token objects. The return value is a copy of the token list tokens_ held internally. Because it is a standard list, the caller can add / remove elements, which affects the tokens_ held by the object (exactly referring to the same object). If you call analyze () a second time or later, it will return a copy of the token list that has already been created. No reanalysis is performed.
By changing the implementation of next () (skipping whitespace characters) and adding a judgment method, it is cleaner than the control by the previous switch statement. By the way, the intention of the judgment is reflected in the method name, so no comment is needed. (Isn't it?)
public java.util.List<Token> analyze() {
char c;
while ( (c = next()) != '\0' ) {
if ( isSymbol_1(c) ) {
tokens_.add( Token.create(c) );
continue;
}
if ( isQuote(c) ) {
quotedText(c);
continue;
}
text(c);
}
return new java.util.ArrayList<>(tokens_);
}
java.util.List<Token> tokens_ = new java.util.ArrayList<>(); // result
Make the quoted string into a single token, including whitespace characters. Whitespace between the quotes and the first character, and between the last and quotes is removed, but any whitespace between the other strings, including newline and tab characters, is retained. ..
/**Cut out up to the next quote character as a block of tokens. */
private void quotedText(char quote) {
tokens_.add( Token.create(quote)); // create token of begin quote
java.lang.StringBuilder builder = new java.lang.StringBuilder();
char c;
while ( (c = nextAll()) != '\0' && c != quote) { builder.append(c); }
if ( builder.length() != 0 ) {
tokens_.add( Token.create(builder.toString()) ); // append string
}
tokens_.add( Token.create(c) ); // append token of end quote
}
Cut out a string token that is not enclosed in quotation marks.
/**Cut out separators and blank characters as a single token. */
private void text(char first) {
java.lang.StringBuilder builder = new java.lang.StringBuilder();
builder.append(first);
char c;
while ( (c = nextAll()) != '\0' && !isSeparator(c) && !Character.isWhitespace(c) ) {
builder.append(c);
}
tokens_.add( Token.create(builder.toString()) );
if ( isEnd() ) { return; }
tokens_.add( isWhitespace(c) ? Token.create(' ') : Token.create(c) );
}
LexicalAnalyzer state acquisition method.
Token.kind () == Kinds.Unknown
. public boolean isEmpty() { return tokens_.size() == 0;}
public boolean isValid() {
return !isEmpty() && tokens_.stream().noneMatch( e -> e.kind() == Token.Kinds.Unknown );
}
Last but not least, here's the main () method for testing and the processing results. ~~ I don't know JUnit ~~ The delimiter token is hidden to reduce the number of lines in the processing result.
public static void main(String[] args) {
String s;
java.util.List<Token> tokens;
LexicalAnalyzer lex;
System.out.println("test 1 -----------------------------------------------------------------------------------");
LexicalAnalyzer.create(null).analyze().stream().filter( e -> e.kind() != Token.Kinds.Separator )
.forEach( e -> System.out.println(e) );
System.out.printf("isEmpty() after analyze : %s\r\n", lex.isEmpty());
System.out.println("\r\ntest 2 -------------------------------------------------------------------------------");
s = "s";
lex = LexicalAnalyzer.create(s);
System.out.printf("isEmpty() before analyze() : %s\r\n", lex.isEmpty());
lex.analyze().stream().filter( e -> e.kind() != Token.Kinds.Separator )
.forEach( e -> System.out.println(e) );
System.out.printf("isEmpty() after analyze() : %s\r\n", lex.isEmpty());
System.out.println("test 3 -----------------------------------------------------------------------------------");
s = " [some] -c \" text document \", \r\n (sequence1, \"seq 2\r\n quoted\",seq3 seq4 'seq 5 ')\"ss";
lex = LexicalAnalyzer.create(s);
System.out.printf("isValid() before analyze() : %s\r\n", lex.isValid());
lex.analyze().stream().filter( e -> e.kind() != Token.Kinds.Separator )
.forEach( e -> System.out.println(e) );
System.out.printf("isValid() after analyze() : %s\r\n", lex.isValid());
}
In the factory method, an empty string is set and it works without error, so there is no problem even if analyze () and the subsequent stream () are executed. Nothing is displayed. "isEmpty () ..." is the test result of the isEmpty () method.
Rather than checking the analysis results, check the specifications of isEmpty () before and after executing analyze ().
Since "seq 2 ..." contains line breaks in the quotation marks, the processing result also contains line breaks. The last isValid () is false because it started parsing as a quote string, but there was no pair of quotes and the string ended. If closed properly, it will be true.
test 1 -------------------------------------------------------------------------------
isEmpty() after analyze : true
test 2 -------------------------------------------------------------------------------
isEmpty() before analyze() : true
[String : "s"]
isEmpty() after analyze() : false
test 3 -------------------------------------------------------------------------------
isValid() before analyze() : false
[String : "[some]"]
[String : "-c"]
[DoubleQuote : """]
[String : "text document"]
[DoubleQuote : """]
[LeftParenthesis : "("]
[String : "sequence1"]
[DoubleQuote : """]
[String : "seq 2
quoted"]
[DoubleQuote : """]
[String : "seq3"]
[String : "seq4"]
[SingleQuote : "'"]
[String : "seq 5"]
[SingleQuote : "'"]
[RightParenthesis: ")"]
[DoubleQuote : """]
[String : "ss"]
[Unknown : "**Unknown**"]
isValid() after analyze() : false
It's been a long time, but this is the end of creating a lexical analyzer.
How about parsing so far? However, I can't write it because the syntax of the user has not been decided. The expected grammar has been decided to some extent, but it is undecided when to write it because the preparation for that is not catching up. (You have to complete the console class first ...)
Recommended Posts