Multiply PDF by OCR on command line on Linux (Ubuntu)

I wanted to apply OCR to the PDF file on the command line, so I wrote a bash script with the file name ocrize for OCR.

If you use ocrize, please execute the following command in advance to install the required packages.

$ sudo apt install tesseract-ocr-jpn imagemagick
#Tesseract to multiply vertical Japanese PDF by OCR-ocr-jpn-Need vert.

Execute ocrize as follows (make it executable with chmod in advance).

$ ./ocrize input.pdf
# or
$ ocrize input.pdf #For example/usr/local/When ocrize is placed in a place where the path passes, such as bin.

The following describes the procedure for applying OCR to the PDF file input.pdf on the command line.

Step.0 Preparation

Before applying OCR to input.pdf, store the file name in a variable and create a primary storage directory.

$ stem="input" #File name without extension.
$ dir="${file}.temp"
$ mkdir ${dir} # input.A temp directory will be created.

Step.1 Divide the PDF file into image files

Use ImageMagick's convert command to split input.pdf into image files.

$ convert -density 300 -geometry 1000 ${stem}.pdf ${dir}/${stem}.png

The higher the value of option -density, the better the image quality, but if you specify a higher value, you need to change the resource of ImageMagick.

When the above command is executed, the file that divides input.pdf into pages and converts it to PNG is output under the input.temp directory as shown below.

$ ls ${dir}
input-0.png   input-1.png   input-2.png  input-3.png

If you get the following error with the convert command, modify ImageMagick's policy.xml (in my environment, policy.xml was directly under / etc / ImageMagick-6).

Error.1 convert: not authorized

Let's modify the right of the policy tag whose domain is coder and pattern is PDF so that PDF can be read and written as follows.

<policy domain="coder" rights="read|write" pattern="PDF" />

Error.2 convert-im6.q16: cache resources exhausted

We're out of memory, so let's fix the policy tag with domain resource. In my case, I fixed the place where name is memory and disk.

<policy domain="resource" name="memory" value="1GiB"/>
<policy domain="resource" name="disk" value="2GiB"/>

You can check the resource with the following command.

$  identify -list resource
# or
$  identify -list Resource

Step.2 Create a PDF with OCR applied from the image file

OCR is applied to each image file under the input.temp directory.

$ N=$(ls ${dir} | grep -c '' | awk '{printf "%d", $1-1}')
$ for n in $(seq 0 ${N}); do tesseract -l jpn+eng ${dir}/${stem}-${n}.png ${dir}/${stem}-${n} pdf; done

The recognition language is specified by the option -l. If you want to recognize in Japanese first and then in English, specify jpn + eng.

Step.3 Combine PDF files with OCR into one

Use the pdfunite command to combine multiple PDF files into one.

$ pdfunite $(for n in $(seq 0 ${N}); do echo ${dir}/${stem}-${n}.pdf; done) ocrized-${stem}.pdf

The PDF file is output in page order with for so that the page order is not broken. After execution, ocrized-input.pdf, which is input.pdf multiplied by OCR, will be created directly under the current directory.

Finally, delete the input.temp directory and you're done!

$ rm -r ${dir}

ocrize

The following is a bash script that summarizes the above contents. Save it with the file name ʻocrize` and use it.

#! /bin/bash

if [ $# -eq 1 ]; then
    file=$1
    ext=$(echo ${file} | rev | cut -d '.' -f 1 | rev)
    dir=${file}.temp
    if [ -d ${dir} ]; then
        echo "${dir} allready exists. Please remove this directory."
        exit 1
    fi
    if [ ${ext} = "pdf" -o ${ext} = "PDF" ]; then
        if [ ! -f ${file} ]; then
            echo "${file} dose not exist."
            exit 1
        fi
        stem=$(echo ${file} | rev | cut -c 5- | rev)
        mkdir ${dir}
        echo "1: Converting PDF to PNG."
        convert -density 300 -geometry 1000 ${file} ${dir}/${stem}.png
        echo "1: Finished."
       	N=$(ls ${dir} | grep -c '' | awk '{printf "%d", $1-1}')
        echo "2: OCRizing."
        for n in $(seq 0 ${N}); do
            p=$(echo "${n} ${N}" | awk '{printf "%5.1f", ($1+1)/($2+1)*100}')
            echo -ne "Progress: [${p} %]\r"
            tesseract -l jpn+eng ${dir}/${stem}-${n}.png ${dir}/${stem}-${n} pdf >& /dev/null
            rm ${dir}/${stem}-${n}.png
        done
        echo "2: Finished.        "
        echo "3: Merging PDF files."
        #pdftk $(for n in $(seq 0 ${N}); do echo ${dir}/${stem}-${n}.pdf; done) output ocrized-${pdffile}
        pdfunite $(for n in $(seq 0 ${N}); do echo ${dir}/${stem}-${n}.pdf; done) ocrized-${file}
        echo "3: Finished."
        rm -r ${dir}
    else
        echo "Extension must be pdf or PDF."
        exit 1
    fi
else
    echo "Usage: $ ocrize input.pdf"
    exit 1
fi

When you run ʻocrize`, the script will run as follows.

$ ./ocrize input.pdf
1: Converting PDF to PNG.
convert-im6.q16: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `input.pdf.temp/input.png' @ warning/png.c/MagickPNGWarningHandler/1654.
1: Finished.
2: OCRizing.
Progress: [ 84.4 %]

A few seconds later,

1: Converting PDF to PNG.
convert-im6.q16: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `input.pdf.temp/input.png' @ warning/png.c/MagickPNGWarningHandler/1654.
1: Finished.
2: OCRizing.
2: Finished.        
3: Merging PDF files.
3: Finished.

Reference URL

Recommended Posts

Multiply PDF by OCR on command line on Linux (Ubuntu)
Easy df command on Linux
Search for large files on Linux from the command line
Convert PDF to Documents by OCR
Completion of docker command on Linux
GSI_DEM to geotiff conversion → UTM conversion → ascii conversion only on Ubuntu command line
Keep getting RSS on the command line
Status check command used (sometimes) on linux
Run bootgen on Debian GNU / Linux, Ubuntu
How to install php7.4 on Linux (Ubuntu)
On Ubuntu Linux, set Tab to q
You search commandlinefu on the command line
Compiling the Linux kernel (Linux 5.x on Ubuntu 20.04)
Linux command # 4
Linux command # 3
Ubuntu Linux 20.04
Linux command # 5
Quickly display the QR code on the command line
[Note] Install wxPython 3.x on Linux Mint (Ubuntu)
Linux: Rename the process displayed by the ps command
Syntax highlighting on the command line using Pygments
My thoughts on python2.6 command line app template
Linux: Understand the information displayed by the top command
Try to create a new command on linux
Convert XLSX to CSV on the command line
How to build Java environment on Ubuntu (Linux)
Operate Route53 on the command line using AWS-CLI.
Think about the selective interface on the command line
How to create a Python 3.6.0 environment by putting pyenv on Amazon Linux and Ubuntu