[PYTHON] Continue to challenge Cyma's challenges using the OCR service of Google Cloud Platform

Ateam cyma Advent Calendar 2019, 21st day! Ateam Co., Ltd. EC Business Headquarters Saima engineer @shimura_atsushi is here for the second time.

In the first attempt, Challenge the challenges of Sima using the OCR service of Google Could Platform, I took the first challenge to tackle the challenges of Sima. .. This time, which is the second time, we will make further efforts to check the delivery note.

Review the last time

I tried a simple OCR using GCP services in Last Post. However, many of the delivery documents currently used by Cyma have complicated texts, and the accuracy of transcription is low just by applying OCR. Even if the transcription is successful, the data is not labeled and the text data is not labeled. It was in a state of poor reusability.

This time···

Based on the previous reflection, this time we will focus on "preparing an image that is easy to perform OCR" and create preprocessing of the image to be applied to OCR. Since the content is a continuation from the previous time, the title remains the same, but this time the implementation in Python is the main and the Google Could Platform is thin. Please forgive me.

This time, with the recommendation of @NamedPython, a Cyma engineer, we will use Python, which has a rich image processing library.

Prepare the environment

Development terminal used this time

--Development terminal MacBook Pro 15-inch

Install everything

I'm going to go here quickly.

Installation

--Install Python - pyenv --You can manage the installed version of python - python 3.8.0 --Use the latest at the time of writing - pip --Package management tool in python --I think it will come with you when you install python --pdf2image installation --Used to convert PDF to PNG or JPEG

--Install poppler --Used for PDF conversion with pdf2image --pillow installation --Used for image processing, mainly used for cropping --ʻOpencv` installation --Used for image processing, mainly used for binarization

brew install pyenv 
pyenv install --list #Check the installable version
pyenv install 3.8.0
pip3 install pdf2image
brew install poppler 
pip install pillow
pip install opencv

Rough flow

  1. Scan documents with a multifunction device
  2. Convert scan data (PDF) to image data
  3. Crop the converted image data
  4. Binar the cropped data
  5. Apply to OCR

Scan with a multifunction device (PDF)

I will use the multifunction device at the head office, and when I scan it, a PDF will be attached to the registered e-mail address.

Convert from PDF to image data

Since the scanned data is in PDF format, it will be converted to image data. If you specify a directory, the stored PDF file will be converted to image data. If you import pdf2image and pass the file path you want to convert to the method convert_from_path It will convert it, isn't it?

pdf2png.py


from pdf2image import convert_from_path
from pathlib import Path
import os

p = Path('./img/pdf')
pdf_list = os.listdir(p)
print(pdf_list)

for i, pdf_file_path in enumerate(pdf_list):
  images = convert_from_path('./img/pdf/{}'.format(pdf_file_path))
  for image in images:
    image.save('./img/png/{}.png'.format(i), 'png')

Crop the converted data

The heart of this OCR is this process. Based on the previous reflection, we will implement the process of cutting out and labeling the necessary parts from the complicated delivery note data in this process.

Prepare the configuration file in JSON

Since the format of the delivery note is basically the same for each supplier (there are different patterns for bicycles and parts), prepare a JSON format setting file for each delivery note format with the coordinates required for cropping.

The necessary information on the delivery note is

--Supplier name

Therefore, have the coordinates of the place where these are described in the configuration file.

JSON configuration file for each supplier

shiiresaki_setting.json


{
  "wholesaler_id": 2,
  "warehouse": {
    "x":10,
    "y":10,
    "height":50,
    "width":100
  },
  "date": {
    "x":20,
    "y":20,
    "height":50,
    "width":100
  },
  "product": {
    "x":30,
    "y":30,
    "height":150,
    "width":200
  },
  "figure": {
    "x":40,
    "y":40,
    "height":200,
    "width":250
  },
  "price": {
    "x":50,
    "y":50,
    "height":200,
    "width":250
  }
}

Image cropping process using pillow

crop4image.py


from PIL import Image
import sys
import json
import productsetting

args = sys.argv
p = productsetting.product.ProductSetting(args[1])
image = Image.open('img/png/{wholesaler_id}.png'.format(wholesaler_id=p.wholesaler_id))

rect = (
  p.warehouse['x'],
  p.warehouse['y'],
  p.warehouse['x'] + p.warehouse['width'], 
  p.warehouse['y'] + p.warehouse['height']
)
print(rect)
cropped_image = image.crop(rect)
cropped_image.save('{wholesaler_id}.png'.format(wholesaler_id=p.wholesaler_id))

Class to read the configuration file by JSON

productsetting.py


import sys
import json

class ProductSetting:
  CONFIG_SETTING_FILE_BASE_FORMAT = './settings/product/{wholesaler_id}.json'
  
  def __init__(self, wholesaler):
    config_file_path = open(self.CONFIG_SETTING_FILE_BASE_FORMAT.format(wholesaler_id=wholesaler), 'r')
    config = json.load(config_file_path)
    self.wholesaler_id = config['wholesaler_id']
    self.warehouse = {
      'x': config['warehouse']['x'],
      'y': config['warehouse']['y'],
      'height': config['warehouse']['height'],
      'width': config['warehouse']['width']
    }    
    self.product = {
      'x': config['product']['x'],
      'y': config['product']['y'],
      'height': config['product']['height'],
      'width': config['product']['width']
    }    
    self.date = {
      'x': config['date']['x'],
      'y': config['date']['y'],
      'height': config['date']['height'],
      'width': config['date']['width']
    }    
    self.figure = {
      'x': config['figure']['x'],
      'y': config['figure']['y'],
      'height': config['figure']['height'],
      'width': config['figure']['width']
    }    

When you run this script from an image like this スクリーンショット 2019-12-19 22.11.02.png

I was able to crop at the coordinates specified in this way. buyoption_1013.png

Binar the clipped data

Next, in order to improve the OCR accuracy of the cropped image, the characters are binarized to improve the reading accuracy.

Created using ʻopencv` The binarization program is simply like this

deeply_character.py


import cv2
img = cv2.imread('./result/png/1013/buyoption_1013.png', 0)
threshold = 100 #Threshold
ret, img_thresh = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
cv2.imwrite('./result/deeply/test/buyoption_1013.png', img_thresh)

This image that I cut out buyoption_1013.png

It was binarized like this buyoption_1013.png

I don't feel much benefit because the sample is not good.

I will try it with this image that seems to be tough. sample.png

If you adjust the threshold and binarize it ... buyoption_1013.png

What! The image is clearer.

Let's apply this image to the OCR of GCP created last time. as a result··· スクリーンショット 2019-12-19 20.24.52.png

It was transcribed in this way, and if you think about it carefully, the part called "delivery date" also becomes noise, so it was okay to omit it. However, with this accuracy, it seems that the reusability of the check can be maintained.

Summary

This time, as a pretreatment for OCR,

--Cut out only the necessary parts --Clarification by binarizing the cropped image

I tried to find out how to create a favorable situation for OCR by doing.

Regarding this effort to improve the efficiency of paperwork, it was good to say "I'll try it!" Within the division, but when I saw the actual delivery note, I was worried whether it could be automated. As a result, I feel that the accuracy can be improved and automation has become realistic by applying OCR after removing noise by cropping the image and sharpening by binarization.

Advent Calendar We have begun to take on the challenge of automating the issue of checking delivery notes at Cyma twice with OCR-centered technology. In the future, I would like to work on the realization of operations involving factories while proceeding with system implementation.

Finally

How was the 21st day of Ateam cyma Advent Calendar 2019? On the 22nd day, Saima's designer @ryo_cy will talk about CSS design using BEM, so stay tuned!

Ateam Co., Ltd. is looking for colleagues with a strong spirit of challenge to work with.

If you are an engineer and are interested, please see cyma's Qiita Jobs.

For other occupations, see Ateam Group Recruitment Site.

Recommended Posts

Continue to challenge Cyma's challenges using the OCR service of Google Cloud Platform
Let's publish the super resolution API using Google Cloud Platform
A story that contributes to new corona analysis using a free trial of Google Cloud Platform
I wanted to challenge the classification of CIFAR-10 using Chainer's trainer
Regular export of Google Analytics raw data to BigQuery using cloud functions
I tried to deliver mail from Node.js and Python using the mail delivery service (SendGrid) of IBM Cloud!
How to use the Google Cloud Translation API
[Google Cloud Platform] Use Google Cloud API using API Client Library
I tried to extract the text in the image file using Tesseract of the OCR engine
How to know the port number of the xinetd service
From python to running instance on google cloud platform
The story of using circleci to build manylinux wheels
Display the weather forecast on M5Stack + Google Cloud Platform
Send a message from the server to your Chrome extension using Google Cloud Messaging for Chrome
How to easily draw the structure of a neural network on Google Colaboratory using "convnet-drawer"
[Python] Change the Cache-Control of the object uploaded to Cloud Storage
View using the python module of Nifty Cloud mobile backend
Try to determine food photos using Google Cloud Vision API
The story of creating a database using the Google Analytics API
A story about switching a personally developed Web service from a rental server to GCP (Google Cloud Platform)