Ateam cyma Advent Calendar 2019, 21st day! Ateam Co., Ltd. EC Business Headquarters Saima engineer @shimura_atsushi is here for the second time.

In the first attempt, Challenge the challenges of Sima using the OCR service of Google Could Platform, I took the first challenge to tackle the challenges of Sima. .. This time, which is the second time, we will make further efforts to check the delivery note.

Review the last time

I tried a simple OCR using GCP services in Last Post. However, many of the delivery documents currently used by Cyma have complicated texts, and the accuracy of transcription is low just by applying OCR. Even if the transcription is successful, the data is not labeled and the text data is not labeled. It was in a state of poor reusability.

This time···

Based on the previous reflection, this time we will focus on "preparing an image that is easy to perform OCR" and create preprocessing of the image to be applied to OCR. Since the content is a continuation from the previous time, the title remains the same, but this time the implementation in Python is the main and the Google Could Platform is thin. Please forgive me.

This time, with the recommendation of @NamedPython, a Cyma engineer, we will use Python, which has a rich image processing library.

Prepare the environment

Development terminal used this time

--Development terminal MacBook Pro 15-inch

OS macOS Mojave

Install everything

I'm going to go here quickly.

Installation

--Install Python - pyenv --You can manage the installed version of python - python 3.8.0 --Use the latest at the time of writing - pip --Package management tool in python --I think it will come with you when you install python --pdf2image installation --Used to convert PDF to PNG or JPEG

--Install poppler --Used for PDF conversion with pdf2image --pillow installation --Used for image processing, mainly used for cropping --ʻOpencv` installation --Used for image processing, mainly used for binarization

brew install pyenv 
pyenv install --list #Check the installable version
pyenv install 3.8.0
pip3 install pdf2image
brew install poppler 
pip install pillow
pip install opencv

Rough flow

Scan documents with a multifunction device
Convert scan data (PDF) to image data
Crop the converted image data
Binar the cropped data
Apply to OCR

Scan with a multifunction device (PDF)

I will use the multifunction device at the head office, and when I scan it, a PDF will be attached to the registered e-mail address.

In the final operation, we plan to scan at each factory in Sima

Convert from PDF to image data

Since the scanned data is in PDF format, it will be converted to image data. If you specify a directory, the stored PDF file will be converted to image data. If you import pdf2image and pass the file path you want to convert to the method convert_from_path It will convert it, isn't it?

`pdf2png.py`


from pdf2image import convert_from_path
from pathlib import Path
import os

p = Path('./img/pdf')
pdf_list = os.listdir(p)
print(pdf_list)

for i, pdf_file_path in enumerate(pdf_list):
  images = convert_from_path('./img/pdf/{}'.format(pdf_file_path))
  for image in images:
    image.save('./img/png/{}.png'.format(i), 'png')

Crop the converted data

The heart of this OCR is this process. Based on the previous reflection, we will implement the process of cutting out and labeling the necessary parts from the complicated delivery note data in this process.

Prepare the configuration file in JSON

Since the format of the delivery note is basically the same for each supplier (there are different patterns for bicycles and parts), prepare a JSON format setting file for each delivery note format with the coordinates required for cropping.

The necessary information on the delivery note is

--Supplier name

Delivery date
Item Number --Number
unit price

Therefore, have the coordinates of the place where these are described in the configuration file.

JSON configuration file for each supplier

`shiiresaki_setting.json`


{
  "wholesaler_id": 2,
  "warehouse": {
    "x":10,
    "y":10,
    "height":50,
    "width":100
  },
  "date": {
    "x":20,
    "y":20,
    "height":50,
    "width":100
  },
  "product": {
    "x":30,
    "y":30,
    "height":150,
    "width":200
  },
  "figure": {
    "x":40,
    "y":40,
    "height":200,
    "width":250
  },
  "price": {
    "x":50,
    "y":50,
    "height":200,
    "width":250
  }
}

Numbers are temporary.

Image cropping process using pillow

`crop4image.py`


from PIL import Image
import sys
import json
import productsetting

args = sys.argv
p = productsetting.product.ProductSetting(args[1])
image = Image.open('img/png/{wholesaler_id}.png'.format(wholesaler_id=p.wholesaler_id))

rect = (
  p.warehouse['x'],
  p.warehouse['y'],
  p.warehouse['x'] + p.warehouse['width'], 
  p.warehouse['y'] + p.warehouse['height']
)
print(rect)
cropped_image = image.crop(rect)
cropped_image.save('{wholesaler_id}.png'.format(wholesaler_id=p.wholesaler_id))

Class to read the configuration file by JSON

`productsetting.py`


import sys
import json

class ProductSetting:
  CONFIG_SETTING_FILE_BASE_FORMAT = './settings/product/{wholesaler_id}.json'
  
  def __init__(self, wholesaler):
    config_file_path = open(self.CONFIG_SETTING_FILE_BASE_FORMAT.format(wholesaler_id=wholesaler), 'r')
    config = json.load(config_file_path)
    self.wholesaler_id = config['wholesaler_id']
    self.warehouse = {
      'x': config['warehouse']['x'],
      'y': config['warehouse']['y'],
      'height': config['warehouse']['height'],
      'width': config['warehouse']['width']
    }    
    self.product = {
      'x': config['product']['x'],
      'y': config['product']['y'],
      'height': config['product']['height'],
      'width': config['product']['width']
    }    
    self.date = {
      'x': config['date']['x'],
      'y': config['date']['y'],
      'height': config['date']['height'],
      'width': config['date']['width']
    }    
    self.figure = {
      'x': config['figure']['x'],
      'y': config['figure']['y'],
      'height': config['figure']['height'],
      'width': config['figure']['width']
    }

When you run this script from an image like this スクリーンショット 2019-12-19 22.11.02.png

The image has been processed.

I was able to crop at the coordinates specified in this way.

Binar the clipped data

Next, in order to improve the OCR accuracy of the cropped image, the characters are binarized to improve the reading accuracy.

Created using ʻopencv` The binarization program is simply like this

`deeply_character.py`


import cv2
img = cv2.imread('./result/png/1013/buyoption_1013.png', 0)
threshold = 100 #Threshold
ret, img_thresh = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
cv2.imwrite('./result/deeply/test/buyoption_1013.png', img_thresh)

This image that I cut out

It was binarized like this

I don't feel much benefit because the sample is not good.

I will try it with this image that seems to be tough.

If you adjust the threshold and binarize it ...

What! The image is clearer.

Let's apply this image to the OCR of GCP created last time. as a result··· スクリーンショット 2019-12-19 20.24.52.png

It was transcribed in this way, and if you think about it carefully, the part called "delivery date" also becomes noise, so it was okay to omit it. However, with this accuracy, it seems that the reusability of the check can be maintained.

Summary

This time, as a pretreatment for OCR,

--Cut out only the necessary parts --Clarification by binarizing the cropped image

I tried to find out how to create a favorable situation for OCR by doing.

Regarding this effort to improve the efficiency of paperwork, it was good to say "I'll try it!" Within the division, but when I saw the actual delivery note, I was worried whether it could be automated. As a result, I feel that the accuracy can be improved and automation has become realistic by applying OCR after removing noise by cropping the image and sharpening by binarization.

Advent Calendar We have begun to take on the challenge of automating the issue of checking delivery notes at Cyma twice with OCR-centered technology. In the future, I would like to work on the realization of operations involving factories while proceeding with system implementation.

Finally

How was the 21st day of Ateam cyma Advent Calendar 2019? On the 22nd day, Saima's designer @ryo_cy will talk about CSS design using BEM, so stay tuned!

Ateam Co., Ltd. is looking for colleagues with a strong spirit of challenge to work with.

If you are an engineer and are interested, please see cyma's Qiita Jobs.

For other occupations, see Ateam Group Recruitment Site.

[PYTHON] Continue to challenge Cyma's challenges using the OCR service of Google Cloud Platform