[Apache POI] Judgment of unnecessary cells

Introduction

When loading Excel using Apache POI, I wanted to skip cells with strikethroughs and hidden cells, so I tried to find out how to do it. I will write it as my own memo, but please refer to it if you like.

Table of contents

  1. [Hidden Matrix](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#1-%E9%9D%9E%E8%A1%A8%E7%A4%BA%E8%A1%8C%E5 % 88% 97)
  2. [Judgment of hidden lines](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#1-%E9%9D%9E%E8%A1%A8%E7%A4%BA%E8%A1%8C % E3% 81% AE% E5% 88% A4% E5% AE% 9A)
  3. [Judgment of hidden columns](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#2-%E9%9D%9E%E8%A1%A8%E7%A4%BA%E5%88%97 % E3% 81% AE% E5% 88% A4% E5% AE% 9A )
  4. Cancellation
  5. [There is a strikethrough throughout the cell](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#1-%E3%82%BB%E3%83%AB%E5%85%A8%E4%BD % 93% E3% 81% AB% E5% 8F% 96% E6% B6% 88% E7% B7% 9A% E3% 81% 8C% E3% 81% 82% E3% 82% 8B)
  6. [There is a strikethrough in a part of the cell](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#2-%E3%82%BB%E3%83%AB%E5%86%85% E3% 81% AE% E4% B8% 80% E9% 83% A8% E3% 81% AB% E5% 8F% 96% E6% B6% 88% E7% B7% 9A% E3% 81% 8C% E3% 81% 82% E3% 82% 8B)

1. Hidden matrix

Prevents reading of row / column data that is hidden (or width is 0).

1. Judgment of hidden lines

Determines if the row is hidden.

sample.java


for(Row row : sheet){
   //Determine if a row is hidden
   if(row.getZeroHeight()){
      //If hidden, read the next line
      continue;
   }
   ...
}

2. Judgment of hidden columns

Determines if the column is hidden. Apache POI reads books ➡ sheets ➡ rows ➡ columns, so In the case of columns, it is necessary to judge by cell. In addition, since there is no hidden judgment in the Cell class for columns, it is necessary to describe the judgment process in the Sheet class.

sample.java


for(Cell cell : row){
   //Determine if the column is hidden
   if(sheet.isColumnHidden(cell.getColumnIndex())){
      //If hidden, read the next cell
      continue;
   }
   ...
}

2. Cancellation line

Prevents reading cells with strikethroughs. Since the correspondence method differs depending on whether the cell has a strikethrough or a part of the cell (a part of the character) has a strikethrough, we will introduce them separately. image.png

1. There is a strikethrough throughout the cell

This is a method for determining whether or not there is a strikethrough in the entire cell.

sample.java


for(Cell cell : row){
   //Get cell font
   CellStyle style = c.getCellStyle();
   Font font = c.getSheet().getWorkbook().getFontAt(style.getFontIndex());
   //Determine if a cell has a strikethrough
   if(font.getStrikeout()){
      //If there is a strikethrough, read the next cell
      continue;
   }
   ...
}

2. There is a strikethrough in a part of the cell

If there is a strikethrough in a part of the cell, it is necessary to use the writing method properly depending on the type of Excel extension to be read.

When the extension is xlsx

Gets the string with the strikethrough. Example) Ai ~~ Ue ~~ O → Ue

sample.java


//Get cell value in XSSFRichTextString format
XSSFRichTextString richStr = (XSSFRichTextString) cell.getRichStringCellValue();
//Get the number of font chunks that hit the string in the cell
int cnt = richStr.numFormattingRuns();
//0 when the font is not correct
if (cnt == 0 ) {
   continue;
}
for (int i = 0; i < cnt; i++;) {
   //Get the i-th font chunk index
   XSSFFont xssfFont = richStr.getFontOfFormattingRun(i);
   //If the font does not hit the chunk
   if (xssfFont == null) {
       continue;
   }
   //If the font hits the block, determine if there is a strikethrough
   if (xssfFont.getStrikeout()) {
   //If there is a strikethrough, get the string of the i-th index chunk
       System.out.println(richStr.getCTRst().getRArray(i).getT());
   }
}

As mentioned above, in the case of xlsx, it can be obtained relatively simply. I myself had a little trouble understanding the meaning of the Formatting Run system processing, but as far as I can see the processing, if it is "Ai ~~ Ue ~~ O", try dividing it into "Ai", "Ue", and "O" chunks. It seemed to be there. (However, "Ai" was obtained without font, and "Ue" and "O" were obtained by character size, character font (Meiryo, etc.), presence / absence of strikethrough, etc.)

When the extension is xls (Excel 97-2003 ver.)

HSSFFont cannot get the font itself like XSSF, probably because the type is old version, and it seems to get only the index with strikethrough.

sample.java


String cellStr = cell.getStringCellValue();
HSSFRichTextString richStr = (HSSFRichTextString)cell.getRichStringCellValue();
//Get the number of chunks the font hits
int cnt = richStr.numFormattingRuns();
if (cnt == 0){
   continue;
} else {
   int startStrikeoutIndex = -1;
   boolean isStrikeoutStartIndex = false;
   for (int i = 0; i < cnt; i++;) {
      //Get the i-th font chunk index
      Font hssfFont = wb.getFontAt(richStr.getFontOfFormattingRun(i));
      int index = richStr.getIndexOfFormattingRun(i);
      //If the font does not hit the chunk
      if (hssfFont == null) {
          if (isStrikeoutStartIndex) {
             System.out.println(cellStr.substring(startStrikeoutIndex, index));
             isStrikeoutStartIndex = false;
          }
          continue;
      }
      //If you have font settings but no strikethrough
      if (!hssfFont.getStrikeout()) {
          if (isStrikeoutStartIndex) {
             System.out.println(cellStr.substring(startStrikeoutIndex, index));
             isStrikeoutStartIndex = false; 
          }
      } 
      //If there is a strikethrough in part
      else if (hssfFont.getStrikeout()) {
          startStrikeoutIndex = index;
          isStrikeoutStartIndex = true;
          if (i + 1 == cnt) {
             //In the case of the last chunk, it is output here.
             System.out.println(cellStr.substring(startStrikeoutIndex, cellStr.length()));
          }
      }
   }
}

In my research, I implemented it using usable processing, so it became quite a skill. To briefly explain the mechanism, in the case of "Ah ~~ i ~~ u ~~ e ~~ o", the Index of the first character of the part with the strikethrough is acquired as startStrikeoutIndex (in the case of "I" 1), isStrikeoutStartIndex. Is set to true. After that, if isStrikeoutStartIndex is true, the character between startStrikeoutIndex and the Index of the first character of the next block (2 because it is the "U" part) is the character that is struck through, so get that part. I am.

in conclusion

It was a very difficult part, so I summarized it so that I can look back when needed. I hope it helps those who want to do the same.

reference

Recommended Posts

[Apache POI] Judgment of unnecessary cells
Judgment of the calendar
[Apache POI] Corresponding Excel version
Apache POI Addictive Point List
Inclusion judgment of elements of Collection
Excel operation using Apache POI
Judgment of JSONArray and JSONObject
apache POI personal notes crossfish21
How to use Apache POI
Judgment of fractions in Ruby
Manipulate Excel with Apache POI
Apache POI Excel in Kotlin
Excel output using Apache POI!