[PYTHON] Flow of extracting text in PDF with Cloud Vision API

Introduction

Do you guys do OCR (Optical Character Recognition)? With the technology to read the text information on the image, it is becoming visible in various places. In addition, it is becoming easier to use OCR technology with GCP so that ordinary people can use it.

So I was trying to read the text information in the PDF using GCP's Cloud Vision API, but I felt that the official document was a little difficult to understand (?), So I would like to summarize it here instead of a memo.

Detect text in file (PDF / TIFF)

I felt that various important points were omitted in the above document, so it was a little difficult for me as a beginner.

environment

Mac OS Mojave
Python 3.7

cost

I don't know the bill for April yet, but I think it's probably low. I also use the free credits that I get for the first time, so I will update it as soon as I understand it.

Enable Cloud Vision API

Enable the Cloud Vision API.

Screen Shot 2020-04-25 at 18.14.17.png

Select a library from APIs and Services, search for and activate the Cloud Vision API.

Screen Shot 2020-04-25 at 18.13.40.png

Create json key file

Screen Shot 2020-04-25 at 19.18.08.png

Select a service account from IAM and Administration and create a new service account.

You can create a key json file from the following Create service account.

Screen Shot 2020-04-25 at 19.18.18.png

Now you can create a key file that contains the public key and so on. You will move this key file to your working file later.

Cloud Storage preparation

Screen Shot 2020-04-25 at 18.14.33.png

Select a browser from Storage. This will take you to the Storage Browser and click Create Bucket.

Screen Shot 2020-04-25 at 18.17.32.png

Create a new bucket and upload the pdf file you want to OCR to. My bucket name this time is ʻenvironment-engineering-pdf-bucket-1 and I uploaded scan-001.pdf`.

Screen Shot 2020-04-25 at 19.24.23.png

We will also create another bucket to store the read text information of the pdf file. I named it ʻocr-result-bucket-qiita`.

Import required modules

The following three are required, so let's import them. You can also use virtualenv.

pip install google-cloud-vision
pip install google-cloud-storage
pip install protobuf

https://pypi.org/project/google-cloud-storage/ https://pypi.org/project/google-cloud-vision/ https://pypi.org/project/protobuf/

Actual python processing start

import os
import json
import re
from google.cloud import vision
from google.cloud import storage
from google.protobuf import json_format

#Please change here to your own uri as well as your own
gcs_source_uri = "gs://environment-engineering-pdf-bucket-1/scan-001.pdf"
gcs_destination_uri = "gs://ocr-result-bucket-qiita"

#Please change the bucket name here to your own
bucket_name = "ocr-result-bucket-qiita"

#Please change the key file here to your own
#Don't forget to put the JSON key file in the same directory!
credential_path = 'engaged-symbol-274611-192d61800d05.json'

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path

mime_type = 'application/pdf'
batch_size = 2
client = vision.ImageAnnotatorClient()

feature = vision.types.Feature(
    type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

gcs_source = vision.types.GcsSource(uri=gcs_source_uri)

input_config = vision.types.InputConfig(
    gcs_source=gcs_source, mime_type=mime_type)

gcs_destination = vision.types.GcsDestination(uri=f"{gcs_destination_uri}/")
output_config = vision.types.OutputConfig(
    gcs_destination=gcs_destination, batch_size=batch_size)

async_request = vision.types.AsyncAnnotateFileRequest(
    features=[feature], input_config=input_config,
    output_config=output_config)

operation = client.async_batch_annotate_files(
    requests=[async_request])

print('Waiting for the operation to finish.')
operation.result(timeout=180)


storage_client = storage.Client()

bucket = storage_client.get_bucket(bucket_name)

output = blob_list[0]

json_string = output.download_as_string()
response = json_format.Parse(
    json_string, vision.types.AnnotateFileResponse())

# The actual response for the first page of the input file.
first_page_response = response.responses[0]
annotation = first_page_response.full_text_annotation

print(u'Full text:\n{}'.format(
    annotation.text))

Finally an example

Successful example

I tried OCR of the following image pdf.

Screen Shot 2020-04-25 at 19.43.40.png

Then, the title was displayed in the terminal as follows.

Gentosha Bunko
Chinese food in Kyoto
Naomi Kang

Example of failure

However, it fails on pages with the following cursive characters. Gyoza is recognized as Kamako, and Garlic is missing garlic and "".

Screen Shot 2020-04-25 at 19.45.09.png

output:

table of contents
《Kamako》
"dance"
Kashinnosu
Garlic
Bag child 4
Chapter fish(Marutamachi Nanahommatsu)|
34
Three-sided fish wing
Buan(Shimogamo)
Of sesame skin
Water 篮子
Numbers(Jodo-ji Temple)
04
Like a parent-child valve
Phoenix egg
Fuyoen(Kawaramachi Kajo)
Person

Finally

If you can output this to ʻepub` format etc., you can also convert it to mobi format and read it with kindle! I don't know how much it will cost. .. ..

Recommended Posts

Flow of extracting text in PDF with Cloud Vision API
Text extraction with GCP Cloud Vision API (Python3.6)
Transcription of images with GCP's Vision API
Problems with output results with Google's Cloud Vision API
Text extraction (Read API) with Azure Computer Vision API (Python3.6)
A story of reading a picture book by synthesizing voice with COTOHA API and Cloud Vision API
Speech recognition of wav files with Google Cloud Speech API Beta
"AttributeError: module'google.cloud.vision' has no attribute'types'" in Cloud Vision API (GCP vision AI)
The story of outputting the planetarium master in pdf format with Pycairo
Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]
Cloud DevOps Cookbook Part 4-Explore DevOps DirectMail in Python with REST API
GOTO in Python with Sublime Text 3
GraphQL API with graphene_django in Django
Bookkeeping Learned with Python-The Flow of Bookkeeping-
Machine Learning x Web App Diagnosis: Recognize CAPTCHA with Cloud Vision API
Detect Japanese characters from images using Google's Cloud Vision API in Python
Google Cloud Vision API sample for python
Streaming speech recognition with Google Cloud Speech API
Extract Japanese text from PDF with PDFMiner
Use Google Cloud Vision API from Python
PDF output with Latex extension in Sphinx
Text mining with Python ② Visualization with Word Cloud
How to use GCP's Cloud Vision API
Text filtering with naive bayes in sklearn
Make API of switchbot thermo-hygrometer with Node-RED
Read text in images with python OCR
Transcribe WAV files with Cloud Speech API
Execute API of Cloud Pak for Data analysis project Job with environment variables