Introduction

This article describes OCR processing (converting to Google Docs) of PDF in Python (Google Colab environment).

Google Drive has a function that converts PDF to Documents file by OCR processing. Describes how to handle in Python code.

Extract text from PDF
Use the OCR function of Google Drive for text extraction
Convert to Google Documents by OCR processing and extract text
Address to the problem that the alphabet of the file name becomes full-width when converted to Documents

In particular, I didn't have any information about the double-byte problem of the file name of 4, so I wanted to share it as knowledge for those who are suffering from the same problem.

Technical elements

--Google Colaboratory (Colab)

python 3.x -googleapis / google-api-python-client (Hereafter, Google API client)

Source code

This is the final source code. Processing is performed according to the following flow.

Authenticate, get Drive Service
Processed PDF files are checked for duplicates by file name and excluded from the target.
Create a list of PDFs to convert
Convert the target PDF file

Details will be described later.

def full_to_half(val):
  """
Convert full-width to half-width
* Address to the problem that the alphabetic characters included in the file name after OCR become full-width
  """
  return val.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)}))

import os
import glob
from google.colab import auth
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload

#Authentication
auth.authenticate_user()
#Get Service to operate Drive
drive_service = build('drive', 'v3')

#Local path mounted on Colab
input_path = 'drive/My Drive/PDF/INPUT' #input(PDF)Directory path
output_path = 'drive/My Drive/PDF/OUTPUT' #Output destination directory path

#####
#Processed PDF files are checked for duplicates by file name and excluded from the target
####
#Get files recursively
files_o = glob.glob(output_path, recursive=True)
exist_filenames = ['']
for root, dirs, files_o in os.walk(output_path):
    for filename in files_o:
        #Convert full-width to half-width, remove extension
        exist_filename = full_to_half(filename).replace('.gdoc', '')
        #Add existing file name
        exist_filenames.append(exist_filename)

#####
#Create a list of PDFs to convert
####
#Get files recursively
files = glob.glob(input_path, recursive=True)
pdf_infos = []
for root, dirs, files in os.walk(input_path):
    for filename in files:
      #print(filename)
      #Excludes existing file names
      if full_to_half(filename) in exist_filenames:
        #print('Exists')
        pass
      else:
        #PDF extension
        if filename[-4:] == '.PDF' or filename[-4:] == '.pdf':
          #print('not exist')
          filepath = os.path.join(root, filename) #Local file path on Colab
          pdf_infos.append({
                'path': filepath,
                'name': filename
            })

#print('number of files: ' + str(len(pdf_infos)))

#MIME type of Google Docs file
MIME_TYPE = 'application/vnd.google-apps.document'

#####
#Convert target PDF file
####
for pdf_info in pdf_infos:
  pdf_path = pdf_info['path']

  #print(pdf_path)

  pdf_filename = pdf_info['name']
  #File name after OCR
  #print(pdf_filename)

  #Convert full-width alphabetic characters to half-width
  pdf_filename = full_to_half(pdf_filename)

  body = {
      'name': pdf_filename,
      'mimeType': MIME_TYPE,
      'parents': ['Output destination Drive directory ID']
  }
  try:
    media_body = MediaFileUpload(pdf_path, mimetype=MIME_TYPE, resumable=True)

    drive_service.files().create(
        body=body,
        media_body=media_body,
    ).execute()
  except:
    print('error:Failed to create Documents file.')
    print(pdf_path)

Preparation

Make some preparations before running the above code.

Google Drive mount

Colab has a mount feature that allows you to virtually treat Google Drive as a local file system. You can operate Drive, but if it is a Google API client, it will take time to process via Web API, so performance will decrease. Therefore, in order to increase the processing speed, try to process in the mounted position as much as possible.

To mount Drive on Colab, connect to the runtime and press the icon below.

Then the following code will be inserted, please execute this.

from google.colab import drive
drive.mount('/content/drive')

Open the displayed URL in your browser, copy the verification code beyond it, and paste it into the text box.

This completes the mount.

Install Google API client

Install the Google API client for Python.

!pip install google-api-python-client

Implementation

I will explain the implementation of the source code mentioned above.

1. Authenticate, get Drive Service

Get a Service object to work with Drive in the Google API client.

Authenticate using Colab's auth and get the Drive Service object in the Google API client.

from google.colab import auth
from googleapiclient.discovery import build

#Authentication
auth.authenticate_user()
#Get Service to operate Drive
drive_service = build('drive', 'v3')

2. Processed PDF files are checked for duplicates by file name and excluded from the target.

This time, the converted file is stored in one place. In addition, a duplicate check is performed to enable re-execution when the PDF is terminated in the middle or when a PDF is added.

It recursively searches the root directory of the virtual local and adds the filenames that exist in the variable exist_filenames (array) in order.

#Get files recursively
files_o = glob.glob(output_path, recursive=True)
exist_filenames = ['']
for root, dirs, files_o in os.walk(output_path):
    for filename in files_o:
        #Convert full-width to half-width, remove extension
        exist_filename = full_to_half(filename).replace('.gdoc', '')
        #Add existing file name
        exist_filenames.append(exist_filename)

3. Create a list of PDFs to convert

Create a list of PDFs to convert at runtime. If the non-target files acquired in process 2 match, they will be skipped. If the PDF file does not match, it is a new addition, so add it to the variable pdf_infos (array) as the PDF to be processed.

#Get files recursively
files = glob.glob(input_path, recursive=True)
pdf_infos = []
for root, dirs, files in os.walk(input_path):
    for filename in files:
      #print(filename)
      #Excludes existing file names
      if full_to_half(filename) in exist_filenames:
        #print('Exists')
        pass
      else:
        #PDF extension
        if filename[-4:] == '.PDF' or filename[-4:] == '.pdf':
          #print('not exist')
          filepath = os.path.join(root, filename) #Local file path on Colab
          pdf_infos.append({
                'path': filepath,
                'name': filename
            })

4. Convert the target PDF file

Convert the PDF file based on the list extracted in the process up to 3.

Create a new file in Drive with the Drive Service object files (). create () .execute (). At that time, if you specify the value of Documents for the MIME type, it will be automatically converted to an OCR-processed Documents file.

Specify the converted file name, MIME type, and parent directory ID in the body parameter of create (). For the media_body parameter, specify the PDF file uploaded to Google by Media File Update.

for pdf_info in pdf_infos:
  pdf_path = pdf_info['path']

  #print(pdf_path)

  pdf_filename = pdf_info['name']
  #File name after OCR
  #print(pdf_filename)

  #Convert full-width alphabetic characters to half-width
  pdf_filename = full_to_half(pdf_filename)

  body = {
      'name': pdf_filename,
      'mimeType': MIME_TYPE,
      'parents': ['Output destination Drive directory ID']
  }
  try:
    media_body = MediaFileUpload(pdf_path, mimetype=MIME_TYPE, resumable=True)

    drive_service.files().create(
        body=body,
        media_body=media_body,
    ).execute()
  except:
    print('error:Failed to create Documents file.')
    print(pdf_path)

Addressing the problem that the alphabetic characters in the converted file name become full-width

Documents files created by OCR conversion of PDF files will have full-width alphabetic characters. I investigated this with the following code.

chars = [
  'ｍ',  #Characters copied from the Documents file
  'm'  #Characters entered by direct typing
]

#Full-width (file name after conversion)
print(hex(ord(chars[0])))
#Half size
print(hex(ord(chars[1])))

#Convert full-width alphabetic characters to half-width alphabetic characters
print(hex(ord(chars[0].translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)})))))

Execution result

0xff4d
0x6d
0x6d

From the above execution results, it was found that the converted file name is full-width and that it can be converted to half-width.

For the conversion, I referred to this article. [Python] Convert full-width and half-width characters to each other in one line (alphabet + number + symbol) --Qiita

in conclusion

With the above, OCR conversion of PDF file has been implemented. We hope for your reference.

[PYTHON] Convert PDF to Documents by OCR