Convert large XLSX files to CSV with Apache POI

Overview

I needed to convert a large XLSX file to CSV, so I used Apache POI to handle it. I will leave a note of what I was careful about. The sample project is [here]

environment

Java:1.8 POI:3.17

important point

There is no support for cell formats or formats. It does not support xls format (2003 format) (officially there was a sample source that supports xls)

Correspondence policy

Use the API that says Streaming (ala SAX) in the feature overview of the Official Site (https://poi.apache.org/spreadsheet/). It doesn't seem to consume much memory because it is read by SAX (Simple API for XML).

Spreadsheet API Feature Summary

Source

You can convert XLSX to CSV by copying and pasting the following sources and executing the convert method.

XLSX to CSV conversion sample


	/**
	 *Convert XLSX file to CSV file<br>
	 *Even if there are multiple sheets in the XLSX file, only the first sheet is targeted.
	 *
	 * @param fromXlsxPath Path of XLSX file to be converted
	 * @param toCsvPath Path of CSV file to be converted (overwrite if existing path is specified)
	 */
	public static void convert(Path fromXlsxPath, Path toCsvPath) {
		System.out.println("Start XlsxToCsvUtil#convert");

		try (OPCPackage pkg = OPCPackage.open(fromXlsxPath.toAbsolutePath().toString(), PackageAccess.READ);
				BufferedWriter bw = Files.newBufferedWriter(toCsvPath, StandardCharsets.UTF_8)) {
			ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg);
			XSSFReader xssfReader = new XSSFReader(pkg);
			StylesTable styleTable = xssfReader.getStylesTable();

			try (InputStream is = xssfReader.getSheetsData().next()) {
				//Handler generation for parsing
				ContentHandler handler = new XSSFSheetXMLHandler(
						styleTable, null, strings, new XlsxRowHandler(bw), new DataFormatter(), false);

				//XML reader generation for parsing
				XMLReader sheetParser = SAXHelper.newXMLReader();
				sheetParser.setContentHandler(handler);
				System.out.println("Start perspective");
				sheetParser.parse(new InputSource(is));
				System.out.println("End of perspective");
			}

		} catch (InvalidOperationException | IOException | SAXException | OpenXML4JException
				| ParserConfigurationException e) {
			System.out.println("Error XlsxToCsvUtil#convert");
		}

		System.out.println("Finished XlsxToCsvUtil#convert");
	}

	/**
	 *
	 */
	public static class XlsxRowHandler implements SheetContentsHandler {

		private final List<String> row = new ArrayList<>();

		private final BufferedWriter bw;

		public XlsxRowHandler(BufferedWriter bw) throws IOException {
			this.bw = bw;
		}

		@Override
		public void startRow(int rowNum) {
			System.out.println(rowNum + "Start reading line");
			row.clear();
		}

		@Override
		public void cell(String cellReference, String formattedValue, XSSFComment comment) {
			row.add(formattedValue);
		}

		@Override
		public void endRow(int rowNum) {
			try {
				bw.write(String.join(",", row.stream().map(c -> "\"" + c + "\"").collect(Collectors.toList())));
				bw.newLine();
			} catch (IOException e) {
			}
		}

		@Override
		public void headerFooter(String text, boolean isHeader, String tagName) {
		}
	}

The following classes are prepared by POI and the samples are almost the same. I don't understand the details.

The important thing is "XlsxRowHandler". This is a class that implements the interface prepared by POI. This class reads the contents of the cell and writes it out with BufferedWriter to output CSV. As you can see by looking at the method name, the method is executed at the following timing.

Implementation notes

The SheetContentsHandler # cell method is not called on blank cells. When dealing with XLSX files with blank cells, it is necessary to implement accordingly.

Reference site

Official site On the here page, "[XLSX2CSV](https://svn.apache.org/repos/asf/poi/trunk/" src / examples / src / org / apache / poi / xssf / eventusermodel / XLSX2CSV.java) "is the official sample.

Created sample project

Recommended Posts

Convert large XLSX files to CSV with Apache POI
Download large files with Apache JMeter
I want to avoid OutOfMemory when outputting large files with POI
[Java] Handle Excel files with Apache POI
Converting TSV files to CSV files (with BOM) in Ruby
CSV output with Apache Commons CSV
How to use Apache POI
Manipulate Excel with Apache POI
Avoid Zip bomb errors when reading large files with POI
Convert 2D array to csv format with Java 8 Stream API
Convert Markdown to HTML with flexmark-java
Handle CSV files uploaded to GCS
Edit SVG with Java + Apache Batik to convert to PNG or JPEG
Notes for reading and generating xlsx files from Java using Apache POI
How to share files with Docker Toolbox
Convert C language to JavaScript with Emscripten
Run Embulk on Docker to convert files
Convert SVG files to PNG files in Java
Convert JSON to TSV and TSV to JSON with Ruby
Convert a string to a character-by-character array with swift
How to get resource files out with spring-boot
Burn files to CD / DVD with Ubuntu 18.04 LTS