I wanted to apply OCR to the PDF file on the command line, so I wrote a bash script with the file name ocrize for OCR.

If you use ocrize, please execute the following command in advance to install the required packages.

$ sudo apt install tesseract-ocr-jpn imagemagick
#Tesseract to multiply vertical Japanese PDF by OCR-ocr-jpn-Need vert.

Execute ocrize as follows (make it executable with chmod in advance).

$ ./ocrize input.pdf
# or
$ ocrize input.pdf #For example/usr/local/When ocrize is placed in a place where the path passes, such as bin.

The following describes the procedure for applying OCR to the PDF file input.pdf on the command line.

Step.0 Preparation

Before applying OCR to input.pdf, store the file name in a variable and create a primary storage directory.

$ stem="input" #File name without extension.
$ dir="${file}.temp"
$ mkdir ${dir} # input.A temp directory will be created.

Step.1 Divide the PDF file into image files

Use ImageMagick's convert command to split input.pdf into image files.

$ convert -density 300 -geometry 1000 ${stem}.pdf ${dir}/${stem}.png

The higher the value of option -density, the better the image quality, but if you specify a higher value, you need to change the resource of ImageMagick.

When the above command is executed, the file that divides input.pdf into pages and converts it to PNG is output under the input.temp directory as shown below.

$ ls ${dir}
input-0.png   input-1.png   input-2.png  input-3.png

If you get the following error with the convert command, modify ImageMagick's policy.xml (in my environment, policy.xml was directly under / etc / ImageMagick-6).

Error.1 convert: not authorized

Let's modify the right of the policy tag whose domain is coder and pattern is PDF so that PDF can be read and written as follows.

<policy domain="coder" rights="read|write" pattern="PDF" />

Error.2 convert-im6.q16: cache resources exhausted

We're out of memory, so let's fix the policy tag with domain resource. In my case, I fixed the place where name is memory and disk.

<policy domain="resource" name="memory" value="1GiB"/>
<policy domain="resource" name="disk" value="2GiB"/>

You can check the resource with the following command.

$  identify -list resource
# or
$  identify -list Resource

Step.2 Create a PDF with OCR applied from the image file

OCR is applied to each image file under the input.temp directory.

$ N=$(ls ${dir} | grep -c '' | awk '{printf "%d", $1-1}')
$ for n in $(seq 0 ${N}); do tesseract -l jpn+eng ${dir}/${stem}-${n}.png ${dir}/${stem}-${n} pdf; done

The recognition language is specified by the option -l. If you want to recognize in Japanese first and then in English, specify jpn + eng.

Step.3 Combine PDF files with OCR into one

Use the pdfunite command to combine multiple PDF files into one.

$ pdfunite $(for n in $(seq 0 ${N}); do echo ${dir}/${stem}-${n}.pdf; done) ocrized-${stem}.pdf

The PDF file is output in page order with for so that the page order is not broken. After execution, ocrized-input.pdf, which is input.pdf multiplied by OCR, will be created directly under the current directory.

Finally, delete the input.temp directory and you're done!

$ rm -r ${dir}

ocrize

The following is a bash script that summarizes the above contents. Save it with the file name ʻocrize` and use it.

#! /bin/bash

if [ $# -eq 1 ]; then
    file=$1
    ext=$(echo ${file} | rev | cut -d '.' -f 1 | rev)
    dir=${file}.temp
    if [ -d ${dir} ]; then
        echo "${dir} allready exists. Please remove this directory."
        exit 1
    fi
    if [ ${ext} = "pdf" -o ${ext} = "PDF" ]; then
        if [ ! -f ${file} ]; then
            echo "${file} dose not exist."
            exit 1
        fi
        stem=$(echo ${file} | rev | cut -c 5- | rev)
        mkdir ${dir}
        echo "1: Converting PDF to PNG."
        convert -density 300 -geometry 1000 ${file} ${dir}/${stem}.png
        echo "1: Finished."
       	N=$(ls ${dir} | grep -c '' | awk '{printf "%d", $1-1}')
        echo "2: OCRizing."
        for n in $(seq 0 ${N}); do
            p=$(echo "${n} ${N}" | awk '{printf "%5.1f", ($1+1)/($2+1)*100}')
            echo -ne "Progress: [${p} %]\r"
            tesseract -l jpn+eng ${dir}/${stem}-${n}.png ${dir}/${stem}-${n} pdf >& /dev/null
            rm ${dir}/${stem}-${n}.png
        done
        echo "2: Finished.        "
        echo "3: Merging PDF files."
        #pdftk $(for n in $(seq 0 ${N}); do echo ${dir}/${stem}-${n}.pdf; done) output ocrized-${pdffile}
        pdfunite $(for n in $(seq 0 ${N}); do echo ${dir}/${stem}-${n}.pdf; done) ocrized-${file}
        echo "3: Finished."
        rm -r ${dir}
    else
        echo "Extension must be pdf or PDF."
        exit 1
    fi
else
    echo "Usage: $ ocrize input.pdf"
    exit 1
fi

When you run ʻocrize`, the script will run as follows.

$ ./ocrize input.pdf
1: Converting PDF to PNG.
convert-im6.q16: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `input.pdf.temp/input.png' @ warning/png.c/MagickPNGWarningHandler/1654.
1: Finished.
2: OCRizing.
Progress: [ 84.4 %]

A few seconds later,

1: Converting PDF to PNG.
convert-im6.q16: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `input.pdf.temp/input.png' @ warning/png.c/MagickPNGWarningHandler/1654.
1: Finished.
2: OCRizing.
2: Finished.        
3: Merging PDF files.
3: Finished.

Multiply PDF by OCR on command line on Linux (Ubuntu)

Step.0 Preparation

Step.1 Divide the PDF file into image files

Step.2 Create a PDF with OCR applied from the image file

Step.3 Combine PDF files with OCR into one

Reference URL