Introduction

Do you guys do OCR (Optical Character Recognition)? With the technology to read the text information on the image, it is becoming visible in various places. In addition, it is becoming easier to use OCR technology with GCP so that ordinary people can use it.

So I was trying to read the text information in the PDF using GCP's Cloud Vision API, but I felt that the official document was a little difficult to understand (?), So I would like to summarize it here instead of a memo.

Detect text in file (PDF / TIFF)

I felt that various important points were omitted in the above document, so it was a little difficult for me as a beginner.

environment

Mac OS Mojave
Python 3.7

cost

I don't know the bill for April yet, but I think it's probably low. I also use the free credits that I get for the first time, so I will update it as soon as I understand it.

Enable Cloud Vision API

Enable the Cloud Vision API.

Screen Shot 2020-04-25 at 18.14.17.png

Select a library from APIs and Services, search for and activate the Cloud Vision API.

Screen Shot 2020-04-25 at 18.13.40.png

Create json key file

Screen Shot 2020-04-25 at 19.18.08.png

Select a service account from IAM and Administration and create a new service account.

You can create a key json file from the following Create service account.

Screen Shot 2020-04-25 at 19.18.18.png

Now you can create a key file that contains the public key and so on. You will move this key file to your working file later.

Cloud Storage preparation

Screen Shot 2020-04-25 at 18.14.33.png

Select a browser from Storage. This will take you to the Storage Browser and click Create Bucket.

Screen Shot 2020-04-25 at 18.17.32.png

Create a new bucket and upload the pdf file you want to OCR to. My bucket name this time is ʻenvironment-engineering-pdf-bucket-1 and I uploaded scan-001.pdf`.

Screen Shot 2020-04-25 at 19.24.23.png

We will also create another bucket to store the read text information of the pdf file. I named it ʻocr-result-bucket-qiita`.

Import required modules

The following three are required, so let's import them. You can also use virtualenv.

pip install google-cloud-vision
pip install google-cloud-storage
pip install protobuf

https://pypi.org/project/google-cloud-storage/ https://pypi.org/project/google-cloud-vision/ https://pypi.org/project/protobuf/

Actual python processing start

import os
import json
import re
from google.cloud import vision
from google.cloud import storage
from google.protobuf import json_format

#Please change here to your own uri as well as your own
gcs_source_uri = "gs://environment-engineering-pdf-bucket-1/scan-001.pdf"
gcs_destination_uri = "gs://ocr-result-bucket-qiita"

#Please change the bucket name here to your own
bucket_name = "ocr-result-bucket-qiita"

#Please change the key file here to your own
#Don't forget to put the JSON key file in the same directory!
credential_path = 'engaged-symbol-274611-192d61800d05.json'

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path

mime_type = 'application/pdf'
batch_size = 2
client = vision.ImageAnnotatorClient()

feature = vision.types.Feature(
    type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

gcs_source = vision.types.GcsSource(uri=gcs_source_uri)

input_config = vision.types.InputConfig(
    gcs_source=gcs_source, mime_type=mime_type)

gcs_destination = vision.types.GcsDestination(uri=f"{gcs_destination_uri}/")
output_config = vision.types.OutputConfig(
    gcs_destination=gcs_destination, batch_size=batch_size)

async_request = vision.types.AsyncAnnotateFileRequest(
    features=[feature], input_config=input_config,
    output_config=output_config)

operation = client.async_batch_annotate_files(
    requests=[async_request])

print('Waiting for the operation to finish.')
operation.result(timeout=180)


storage_client = storage.Client()

bucket = storage_client.get_bucket(bucket_name)

output = blob_list[0]

json_string = output.download_as_string()
response = json_format.Parse(
    json_string, vision.types.AnnotateFileResponse())

# The actual response for the first page of the input file.
first_page_response = response.responses[0]
annotation = first_page_response.full_text_annotation

print(u'Full text:\n{}'.format(
    annotation.text))

Finally an example

Successful example

I tried OCR of the following image pdf.

Screen Shot 2020-04-25 at 19.43.40.png

Then, the title was displayed in the terminal as follows.

Gentosha Bunko
Chinese food in Kyoto
Naomi Kang

Example of failure

However, it fails on pages with the following cursive characters. Gyoza is recognized as Kamako, and Garlic is missing garlic and "".

Screen Shot 2020-04-25 at 19.45.09.png

output:

table of contents
《Kamako》
"dance"
Kashinnosu
Garlic
Bag child 4
Chapter fish(Marutamachi Nanahommatsu)|
34
Three-sided fish wing
Buan(Shimogamo)
Of sesame skin
Water 篮子
Numbers(Jodo-ji Temple)
04
Like a parent-child valve
Phoenix egg
Fuyoen(Kawaramachi Kajo)
Person

Finally

If you can output this to ʻepub` format etc., you can also convert it to mobi format and read it with kindle! I don't know how much it will cost. .. ..

[PYTHON] Flow of extracting text in PDF with Cloud Vision API