I ran GhostScript with python, split the PDF into pages, and converted it to a JPEG image.

Introduction

Hello! I had the opportunity to use GhostScript in python, so this time I would like to summarize the knowledge gained from it. Specifically, use GhostScript to convert PDF to JPEG image (PDF⇒JPEG conversion). The PDF-to-JPEG conversion that we are going to do in this article is "not only one-page PDF, but also multiple-page PDF is decomposed into one page and saved as a JPEG image."

Why GhostScript?

GhostScript is an interpreter for the PostScript language and PDF, as you can see on the official website.

Quoted from the official website

An interpreter for the PostScript language and for PDF.

When I ask google teacher what I want to do in this article, I think that pdf2image and PyPDF2 are common solutions, but the reason why I used GhostScript is simple.

Because there was a PDF that could not be converted with PyPDF2 and could be converted with GhostScript lol

For GhostScript operations, refer to this site. (No python is used on this site.) http://michigawi.blogspot.com/2011/12/ghostscriptpdfjpg.html

Development environment

Windows10 python3.6.9 GhostScript 9.50

Preparation (GhostScript download)

This time, download GhostScript from the following. https://www.ghostscript.com/

Please download according to each development environment. In this article, AGPL for Windows (64bit) is selected.

ghostscript_download.png

Execute the downloaded file to set up.

Setup screen

ghostscript_setup_1.png ghostscript_setup_2.png ghostscript_setup_3.png ghostscript_setup_4.png

What you want to do ①: Divide the PDF page

As explained in "Introduction", this is a process required for a PDF with multiple pages, and it is not necessary for a PDF with one page originally. Since it will be long, I picked up only the parts related to the division. If you don't have the time, please check the entire source code in the following chapters.

As a prerequisite, create an "input" folder in the same directory as the source code, and place the PDF file that will be the input image there. A directory called "output_tmp" is automatically created and the paginated PDF file is saved.


if __name__ == "__main__":

    
    #Get the current directory
    current_dir = os.getcwd()
    #Get input directory
    indir = current_dir + "\\input\\"
    indir_error = current_dir + "\\input_error\\"
    
    #Get output directory
    outdir = current_dir + "\\output\\"
    outdir_tmp = current_dir + "\\output_tmp\\"
    
    #Create if there is no output folder
    if os.path.isdir(outdir) == False:
        os.mkdir(outdir)

    if os.path.isdir(outdir_tmp) == False:
        os.mkdir(outdir_tmp)
    
    if os.path.isdir(indir_error) == False:
        os.mkdir(indir_error)            
    
    #Select an image in the input directory
    all_images = glob.glob(indir + "*")
   
     #Processing for each image
    for img in all_images:
        
        #Get the image name of the input image (with extension)
        basename = os.path.basename(img)
        print(basename)
        
        #Separate the extension from the image name of the input image
        name,ext = os.path.splitext(basename)
        
        try:
            #images = convert_from_path(img)
            with open(img,"rb") as pdf_file: #There is img, but it is a PDF file that is input.
                source = PdfFileReader(pdf_file)
            
                num_pages = source.getNumPages()
                
                #Single page PDF is output_Copy to tmp and proceed to the next image
                if num_pages == 1:
                    copyfile(img,outdir_tmp + basename )
                    continue
            
            
                for i in range(0,num_pages):
                    file_object = source.getPage(i)
                    pdf_file_name = outdir_tmp + name + "_" + str(i+1) + ".pdf"

                    pdf_file_writer = PdfFileWriter()

                
                    with open(pdf_file_name, 'wb') as f:

                        pdf_file_writer.addPage(file_object)
                        pdf_file_writer.write(f)                    

        except:
            print(basename + "*** error ***")
            copyfile(img,indir_error + basename )



What you want to do ②: Convert PDF to JPEG image

This is running in GhostScript. What I want to do ① Select the PDF files that have been divided into pages in order and convert them to JPEG. The flow is like defining a command first and executing it in the subprocess module.


    print("***** pdf to jpg *****")

    ######################################
    #
    # pdf2jpeg(ghost script)Related settings
    #
    ######################################
        
    exe_name = r'C:\Program Files\gs\gs9.50\bin\gswin64c.exe'
    exe_option = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=100 -dQFactor=1.0 -dDisplayFormat=16 -r600" 
    exe_outfile = "-sOutputFile="   

    #Select an image in the input directory
    all_pdfs = glob.glob(outdir_tmp + "*")
    
    for pdf in all_pdfs:      
    
        #Get the image name of the input image (with extension)
        basename = os.path.basename(pdf)
        print(basename)

        #Separate the extension from the image name of the input image
        name,ext = os.path.splitext(basename)   

        outputfile = outdir + name + ".jpg "

        cmd_list = list()
        cmd_list.append(exe_name)
        cmd_list.extend(exe_option.split()) #This is extend, not append
        cmd_list.append(exe_outfile + outputfile)
        cmd_list.append(pdf)
        
        subprocess.call(cmd_list)



The entire


import os
import glob

from PyPDF2 import PdfFileReader
from PyPDF2 import PdfFileWriter

import subprocess
from shutil import copyfile 

"""
Main
"""
if __name__ == "__main__":

    
    #Get the current directory
    current_dir = os.getcwd()
    #Get input directory
    indir = current_dir + "\\input\\"
    indir_error = current_dir + "\\input_error\\"
    
    #Get output directory
    outdir = current_dir + "\\output\\"
    outdir_tmp = current_dir + "\\output_tmp\\"
    
    #Create if there is no output folder
    if os.path.isdir(outdir) == False:
        os.mkdir(outdir)

    if os.path.isdir(outdir_tmp) == False:
        os.mkdir(outdir_tmp)
    
    if os.path.isdir(indir_error) == False:
        os.mkdir(indir_error)            
    
    #Select an image in the input directory
    all_images = glob.glob(indir + "*")
   
     #Processing for each image
    for img in all_images:
        
        #Get the image name of the input image (with extension)
        basename = os.path.basename(img)
        print(basename)
        
        #Separate the extension from the image name of the input image
        name,ext = os.path.splitext(basename)
        
        try:
            #images = convert_from_path(img)
            with open(img,"rb") as pdf_file: #There is img, but it is a PDF file that is input.
                source = PdfFileReader(pdf_file)
            
                num_pages = source.getNumPages()
                
                #Single page PDF is output_Copy to tmp and proceed to the next image
                if num_pages == 1:
                    copyfile(img,outdir_tmp + basename )
                    continue
            
            
                for i in range(0,num_pages):
                    file_object = source.getPage(i)
                    pdf_file_name = outdir_tmp + name + "_" + str(i+1) + ".pdf"

                    pdf_file_writer = PdfFileWriter()

                
                    with open(pdf_file_name, 'wb') as f:

                        pdf_file_writer.addPage(file_object)
                        pdf_file_writer.write(f)                    

        except:
            print(basename + "*** error ***")
            copyfile(img,indir_error + basename )

    print("***** pdf to jpg *****")

    ######################################
    #
    # pdf2jpeg(ghost script)Related settings
    #
    ######################################
        
    exe_name = r'C:\Program Files\gs\gs9.50\bin\gswin64c.exe'
    exe_option = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=100 -dQFactor=1.0 -dDisplayFormat=16 -r600" 
    exe_outfile = "-sOutputFile="   

    #Select an image in the input directory
    all_pdfs = glob.glob(outdir_tmp + "*")
    
    for pdf in all_pdfs:      
    
        #Get the image name of the input image (with extension)
        basename = os.path.basename(pdf)
        print(basename)

        #Separate the extension from the image name of the input image
        name,ext = os.path.splitext(basename)   

        outputfile = outdir + name + ".jpg "

        cmd_list = list()
        cmd_list.append(exe_name)
        cmd_list.extend(exe_option.split())                   
        cmd_list.append(exe_outfile + outputfile)
        cmd_list.append(pdf)
        
        subprocess.call(cmd_list)



at the end

Because the source code was created quickly for what I wanted to do, it became quite dirty source code lol (It's not a laugh)

In addition, there were some PDFs that could not be converted to JPEG well even with GhostScript. The cause is not clear, but after a little research, it seems that the fonts do not support it. Depending on the response, I will investigate a little more.

If you have any questions, please leave a comment.

Recommended Posts

I ran GhostScript with python, split the PDF into pages, and converted it to a JPEG image.
Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)
I want to cut out only the face from a person image with Python and save it ~ Face detection and trimming with face_recognition ~
Convert PDF to image (JPEG / PNG) with Python
Convert the image in .zip to PDF with Python
I made a program to convert images into ASCII art with Python and OpenCV
[Introduction to system trading] I drew a Stochastic Oscillator with python and played with it ♬
I tried to find the entropy of the image with python
Probably the easiest way to create a pdf with Python3
I tried to divide the file into folders with Python
Recursively get the Excel list in a specific folder with python and write it to Excel.
I made a server with Python socket and ssl and tried to access it from a browser
Return the image data with Flask of Python and draw it to the canvas element of HTML
I made a function to crop the image of python openCV, so please use it.
I also tried to imitate the function monad and State monad with a generator in Python
I converted the time to an integer, factored it into prime factors, and made a bot to tweet the result (xkcd material)
[Introduction to Python] How to split a character string with the split function
I tried to make a periodical process with Selenium and Python
I wanted to solve the ABC164 A ~ D problem with Python
Find the white Christmas rate by prefecture with Python and map it to a map of Japan
I set the environment variable with Docker and displayed it in Python
I want to convert a table converted to PDF in Python back to CSV
A python program that resizes a video and turns it into an image
Image the pdf file and stamp all pages with confidential stamps (images).
Start the webcam to take a still image and save it locally
[python] Send the image captured from the webcam to the server and save it
Output the report as PDF from DB with Python and automatically attach it to an email and send it
The story of making a tool to load an image with Python ⇒ save it as another name
I tried "smoothing" the image with Python + OpenCV
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
[Python] I introduced Word2Vec and played with it.
I tried "differentiating" the image with Python + OpenCV
ffmpeg-Build a python environment and split the video
Divide each PowerPoint slide into a JPG file and output it with python
I want to make a game with Python
[ES Lab] I tried to develop a WEB application with Python and Flask ②
I want to write an element to a file with numpy and check it.
I tried "binarizing" the image with Python + OpenCV
I generated a lot of images like Google Calendar favicon with Python and incorporated it into Vue's project
I made a system that automatically decides whether to run tomorrow with Python and adds it to Google Calendar.
GAE --With Python, rotate the image based on the rotation information of EXIF and upload it to Cloud Storage.
I want to write to a file with Python
I tried to make a simple image recognition API with Fast API and Tensorflow
Extract images and tables from pdf with python to reduce the burden of reporting
I want to replace the variables in the python template file and mass-produce it in another file.
[Note] How to write QR code and description in the same image with python
Quickly create a Python data analysis dashboard with Streamlit and deploy it to AWS
I tried to compare the processing speed with dplyr of R and pandas of Python
It is easy to execute SQL with Python and output the result in Excel
I want to handle optimization with python and cplex
I tried to touch the CSV file with Python
I tried to draw a route map with Python
I tried to solve the soma cube with python
I want to inherit to the back with python dataclass
I want to work with a robot in python.
I made a LINE BOT with Python and Heroku
I want to split a character string with hiragana
I tried to automatically generate a password with Python3
A memo that I touched the Datastore with python
[Python] Mask the image into a circle using Pillow
POST the image with json and receive it with flask