CSV parsing with newline characters in fields

Find a purpose for the means.

At one point, I was thinking that I would like to make a parser using javaCC due to the influence of the book "Let's make a normal compiler" that I met at the library.

"The CSV file that dumps the event DB of jp1base contains line breaks and commas in the fields surrounded by double quotes, so it is difficult to paste it in Excel."

So, I decided to make a parser immediately.


things to do

Convert CSV data including commas and line breaks as field data into a form that is easy to read in Excel and output it as standard.

manner

  1. Rough flow

  2. Parse CSV data into array data.

  3. After reading one line of data, format it and output it.

  4. Realize the above with java.

  5. Definition of CSV data

  6. Comma separated data

  7. Fields enclosed in double quotes (") contain commas and line breaks.

  8. Also, if you want to represent the double quotation mark itself in the field surrounded by double quotation marks, escape it with a backslash ().

  9. Output format

  10. Output each field separated by commas

  11. Output each field in double quotes

  12. Replace double quotes and newline characters in each field data with spaces.

environment

  1. OS Any OS that the JDK supports is fine. (What I made on my Mac at home worked as it was on the RHEL6 server and Window10 at work.)

  2. JDK
    This article has been confirmed to work in a JDK 1.6 environment. As of April 2020, according to the Official Site, Java is 100% pure, so there is no dependency on Runtime.

  3. JavaCC I used 6.0. Please refer to this article for the setup method. Reference article

Scanner definition

First of all, define the scanner. The scanner is the part that is responsible for lexical analysis, and generates meaningful lexical terms (TOKEN) from a list of character strings. Write the scanner definition in a text file with the extension ".jj". The following is the scanner definition I created this time.

CSVParser.jj


//This is the definition of SKIP to ignore whitespace characters.
//TOKEN is not generated from the skipped characters.
//I'll ask you to ignore spaces and tabs here.
SKIP : {
    " "
  | "\t"
}

//Definition for reading strings enclosed in double quotes
//Using the MORE directive,
//If you find double quotes,"IN_DOUBLE_QUOTE"Because of the transition to the mode!
//I will instruct.
MORE : {
    "\"" : IN_DOUBLE_QUOTE
}

//This is the rule for reading double quotes.
// IN_DOUBLE_In QUOTE mode
// 1.If you get a character other than a backslash, continue reading the next character.
// 2.If any character comes after the backslash, read the next character.
//Is instructed.(According to rule 2, the double quotes following the backslash are read as just characters.
<IN_DOUBLE_QUOTE> MORE: {
    < ~["\\"] >
  | < "\\" ~[] >
}

//Next is the definition to get out of double quotes
// IN_DOUBLE_In QUOTE mode
//When a single double quote appears, generate a token called DQFIELD and
//Default mode(DEFAULT)Go back to! Is instructed.
// "DQFIELD"That is the name I decided on my own.
//By the way IN_DOUBLE_QUOTE is also a name that I decided on my own.
<IN_DOUBLE_QUOTE> TOKEN: {
    <DQFIELD : "\""> : DEFAULT
}

//In default mode
//comma,Double quotation,A list of characters that does not include a newline character is defined as a STDFIELD token.
//Also, comma<SEPARATOR>Let's call it a token.
//In addition, as a character to mark the end of a line"\n"Or"\r"Continuous<EOL>Treat them as tokens.
TOKEN : {
    <STDFIELD : (~["\"", ",", "\r", "\n"])+ >
  | <SEPARATOR : "," >
  | <EOL : (["\r", "\n"])+ >
}

Definition of parser

A parser analyzes the list of TOKEN generated by the scanner and performs the necessary work. Here, we aim to return one row of CSV data as an array. First, let's define the CSV data.

CSV data definition

CSV data is a series of lines (records) in which each data (called a field) is lined up separated by commas.

CSV data image


Field 1-1,Field 1-2,Field 1-3,・ ・ ・ ・
Field 2-1,Field 2-2,Field 2-3,・ ・ ・ ・
 :

First, let's express this structure with a parser. First the field definition. By the way, the parser definition is also written in the same file as the scanner definition. (Because it doesn't get so big if it's about CSV.)

CSVParser.jj(Field definition)


String field() : {
} {
  (
    <DQFIELD> | <STDFIELD>
  )
}
/////Commentary//////
//Return value:Since I want to generate the data as character string data, the return type is String.
//name:For easy understanding later"filed"I named it.
//Contents:String enclosed in double quotes(DQFIELD)Or an ordinary string(STDFIELD)Is.
//Is defined.

Next is the definition of one line of data (here we call it a record).

jp1EventParser.jj(Record definition)


//Defined as zero or more consecutive fields separated by SEPARATOR.
//An array of fields, List<String>It is defined as.
List<String> record() : {
} {
  //First there is a field
  field()
  //Followed by 0 or more fields separated by SEPARATOR(=It may not be there.)
  //In addition, it does not correspond when a comma comes suddenly.
  (
    <SEPARATOR>
    (field())?
  )*
}

Finally, the definition of the entire CSV file.

CSVParser.jj(CSV file definition)


// csvContents()(By the contents of CSV)Is
//You say that the records are lined up in multiple lines.
//I output it as standard output line by line, and I have no intention of returning anything, so I made it void.
void csvContents() : {
} {
  (
    record()
    <EOL>
  )+
  <EOF> 
}

And flesh out the parser.

Now that you have defined the structure of the CSV file, what do you do when you read that structure? I will write the specific processing. Here is the code I actually wrote.

CSVParser.jj(A parser definition with actual processing added.)


//It looks horribly different, but basically()I'm just adding the process inside.
//Field definition
String field() : {
  String data = "";  //Variable for storing the read character string (Initialize with an empty string)
  Token fieldToken;  //Variable to store the read token
} {
  (
    //If you read DQFIELD, the image(Actual string)Is stored in the variable data.
    fieldToken = <DQFIELD> {
      data = fieldToken.image;
    }
    //Or, if you read STDIELD, the image(Actual string)After all, is stored in the variable data.
    | fieldToken = <STDFIELD> {
      data = fieldToken.image;
    }
  ) {
    //After reading one DQFILED or STDFIELD, the value of the variable data is returned.
    return data;
  }
}

//Record definition
List<String> record() : {
  List<String> fieldList = new ArrayList<String>();
  String fieldData;
} {
  //First there is a field
  fieldData = field(){
    //Add the first field to the array
    fieldList.add(fieldData);
  }
  //Followed by 0 or more fields separated by SEPARATOR(=It may not be there.)
  //In addition, it does not correspond when a comma comes suddenly.
  (
    <SEPARATOR>
    (fieldData = field(){
      //If you find the data after it separated by SEPARATOR, add more to the array.
      fieldList.add(fieldData);
    })?
  )*
  {
    //After reading one line, it returns the array up to that point.
    return fieldList;
  }
}

//Definition of the entire CSV file
void csvContents() : {
  List<String> csvRecord; //Variable to store one line of data
} {
  (
    csvRecord = record(){
      //After reading one line, it will be output to standard output.
      //The writer here is a self-made class.(Will come out later.)
      CSVWriter.writeLine(csvRecord);
    }
    <EOL>
  )+
  <EOF> 
}

Convert the contents of the CSV file and output

This is the end of how to write a parser, but since it's a big deal, I'll bring it to the point where it can actually be moved. First, define the CSVWriter class that outputs the string array that came out earlier by enclosing it in double quotation marks separated by commas.

CSVWriter.java


import java.util.List;
public class CSVWriter {
  public static void writeLine(List<String> record) {
    String line = "";
    String comma = "";

    for ( String field : record ) {
      //Concatenate the strings in each field separated by commas.
      line = line + comma + "\"" + sanitizeString(field) + "\"";
      //For a long time, when making comma-separated records, the empty string is concatenated only at the beginning like this, but I would like to know if there is another good way.
      comma = ",";
    }

    System.out.println(line);

  }

  private static sanitaizeString(String input){
    //I'm sorry it's appropriate.
    //It removes the newline character and double quotes.
    return input.replace("\n", " ").replace("\r", " ").replace("\"", "");

  }
}

Next, CSVParser.jj is completed.

CSVParser.jj


//This is magical.
options {
//  DEBUG_PARSER=true;
  UNICODE_INPUT=true;
}

//Parser class definition(java code)Is this PARSER_BEGIN〜PARSER_Write between END.
PARSER_BEGIN(CSVParser)

import java.util.List;
import java.util.ArrayList;
import java.util.HashMap;
import java.io.InputStream;
import java.io.FileInputStream;

public class CSVParser {
  public void parseCSV() {
    try {
      csvContents();
    } catch(Exception ex) {
      System.out.println("ParseError occured: " + ex.toString());
    }
  }
}

PARSER_END(CSVParser)

//Definition of scanner from here(Comments are omitted.)
SKIP : {
    " "
  | "\t"
}

MORE : {
    "\"" : IN_DOUBLE_QUOTE
  | "'" : IN_SINGLE_QUOTE
}

<IN_DOUBLE_QUOTE> MORE: {
    < ~["\"", "\\"] >
  | < "\\" ~[] >
  | < "\"" "\"" >
}

<IN_SINGLE_QUOTE> MORE: {
    < ~["'", "\\"] >
  | < "\\" ~[] >
  | < "'" "'" >
}

<IN_DOUBLE_QUOTE> TOKEN: {
  <DQSTR : "\""> : DEFAULT
}

TOKEN : {
    <STDFIELD : (~["\"", ",", "\r", "\n" ])+ >
  | <SEPERATOR : "," >
  | <EOL : (["\r", "\n"])+ >
}


//Definition of parser from here(Comments are omitted.)
void csvContents() : {
  List<String> csvRecord;
} {
  (
    csvRecord = record() {
      CSVWriter.writeLine(csvRecord);
    }
    <EOL>
  )+
  <EOF> 
}

List<String> record() : {
  List<String> fieldList = new ArrayList<String>();
  String fieldData;
} {
  fieldData = field() {
    fieldList.add(fieldData);
  }
  (
    <SEPERATOR>
    (fieldData = field(){
       fieldList.add(fieldData);
    })?
  )*
  {
    return fieldList;
  }
}

String field() : {
  String data = "";
  Token fieldToken;
} {
  (
    fieldToken = <DQSTR> {
      data = fieldToken.image;
    }
  | 
    fieldToken = <STDFIELD> {
      data = fieldToken.image;
    }
  ) {
    return data;
  }

}

Finally, you need the main entry point, right? It's really suitable. .. .. Because it will be a sample of the call.

CSVConv.java


import java.io.InputStream;
import java.io.FileInputStream;

public class CSVConv {
  public static void main(String[] args) {
    if ( args.length != 1 ) {
      return;
    }

    //While teaching juniors that it is a bad way to pass arguments as they are. this. ..
    try(InputStream csvReader = new FileInputStream(args[0])) {
      //I don't remember defining a constructor that receives such an InputStream.
      //Don't worry, javaCC will make it for you.
      CSVParser parser = new CSVParser(csvReader, "utf8");

      //Actually parse the file.
      //This is how to write because the read line is output as standard output without permission.
      parser.parseCSV();
      
    } catch(Exception ex) {
      System.out.println("Error occured: " + ex.toString());
    }
  }
}

How to compile etc.

I will write the compilation procedure for the time being.

shell


#Run javaCC=> CSVParser.Read jj and CSVParser.It will make java.
javacc CSVParser.jj
#Then, compile all together with this.
javac CSVConv.java

So, create a trial CSV file (There is some mean data that includes such line breaks.)

hoge.csv


abc,"def
ghi",jkl,"mno,pqr"
stu,vwx,yz

Let's do it! !!

shell


java CSVConv hoge.csv

#You should get this kind of output.
# "abc","def ghi","jkl", "mno,pqr"
# "stu","vwx","yz"

I wrote it for a long time, but can it be used as one of the means to solve "Isn't it a problem if such data is passed as text?" I thought, I made an article.

Recommended Posts

CSV parsing with newline characters in fields
Measures against garbled characters in Multipart Request with Quarkus
Converting TSV files to CSV files (with BOM) in Ruby
CSV import with BOM
Zip compression with Java in Windows environment without garbled characters