Convert PDFs to images in bulk with Python

What you want to do in this article

Convert PDF files to images (PNG), one file per page. Verification of business form output Since pre-processing is assumed, multiple PDF files can be processed together.

Things to prepare

--Environment where Python3 runs. This article uses Python 3.8.1 (Windows 64bit) --Poppler. Open source command line tools for working with PDFs --pdf2 image. Wrapper module that makes Poppler available from Python

Install Poppler

Head family (source) https://poppler.freedesktop.org/

Binaries for Windows are available here. http://blog.alivate.com.au/poppler-windows/

The installation procedure is summarized on this site. http://pdf-file.nnn2.com/?p=863 If you do not include the language file in the latter half of the explanation, the Japanese file name will be garbled, so be sure to include it.

Added on January 6, 2020

Since imread () and imwrite () cannot handle file names other than ascii, it is necessary to change the file name to ascii characters when post-processing using openCV-Python. It's okay to URL-encode it like base = urllib.parse.quote (pdf_file.stem), but it's unreadable by people.

If it is difficult to rename the original data, there is also such a countermeasure. About dealing with problems when handling file paths including Japanese in Python OpenCV cv2.imread and cv2.imwrite https://qiita.com/SKYS/items/cbde3775e2143cad745

Install pdf2image

pip install pdf2image

Click here for Github https://github.com/Belval/pdf2image

Python code

pdf2img.py


import pathlib
import pdf2image

pdf_files = pathlib.Path('in_pdf').glob('*.pdf')
img_dir = pathlib.Path('out_img')

for pdf_file in pdf_files:
    base = pdf_file.stem
    images = pdf2image.convert_from_path(pdf_file, grayscale=True, size=640)
    for index, image in enumerate(images):
        image.save(img_dir/pathlib.Path(base + '-{}.png'.format(index + 1)),
                   'png')

What I'm doing is simple: I'm reading a PDF file in the in_pdf folder of the current directory and outputting {PDF filename}-{page} .png to the out_img folder.

Example) Some form.pdf → Some form-1.png Some form-2.png

Image conversion parameters images = pdf2image.convert_from_path(pdf_file, grayscale=True, size=640) You can set it at.

--Grayscale with grayscale = True. Color if you set grayscale = False or omit the specification --Output so that it fits in n pixels square with size = n. Size calculated by DPI value if not specified --Specify a DPI value with dpi = n (default value is 200 DPI). When there is a size specification, that has priority

There are many other settings you can make, but for the time being, this is enough.

The image format is image.save(img_dir/pathlib.Path(base + '-{}.png'.format(index + 1)), 'png') Where image.save(img_dir/pathlib.Path(base + '-{}.jpg'.format(index + 1)), 'jpeg') Then, it will be output in JPEG format.

Recommended Posts

Convert PDFs to images in bulk with Python
How to convert / restore a string with [] in python
How to run tests in bulk with Python unittest
Convert the image in .zip to PDF with Python
Number recognition in images with Python
Convert markdown to PDF in Python
How to collect images in Python
Convert list to DataFrame with python
Working with DICOM images in Python
[Road to intermediate Python] Install packages in bulk with pip
Convert images to sepia toning with PIL (Python Imaging Library)
Convert images passed to Jason Statham-like in Python to ASCII art
Try logging in to qiita with Python
Convert memo at once with Python 2to3
Convert psd file to png in Python
Convert Excel data to JSON with python
Convert Hiragana to Romaji with Python (Beta)
How to work with BigQuery in Python
Convert FX 1-minute data to 5-minute data with Python
Add Gaussian noise to images with python2.7
Convert HEIC files to PNG files with Python
Convert Chinese numerals to Arabic numerals with Python
To work with timestamp stations in Python
Convert from Markdown to HTML in Python
Convert absolute URLs to relative URLs in Python
Read text in images with python OCR
Upload images to Google Drive with Python
Sample to convert image to Wavelet with Python
Convert files written in python etc. to pdf with syntax highlighting
Convert FBX files to ASCII <-> BINARY in Python
Convert 202003 to 2020-03 with pandas
Convert svg file to png / ico with Python
Convert Windows epoch values to date with python
Log in to Yahoo Business with Selenium Python
How to use tkinter with python in pyenv
Bulk download images from specific URLs with python
Convert exponential notation float to str in Python
Convert cubic mesh code to WKT in Python
Convert strings to character-by-character list format with python
Convert images in multiple folders to different pdfs for each folder at once
Bulk download images from specific site URLs with python
How to do hash calculation with salt in Python
Convert NumPy array "ndarray" to lilt in Python [tolist ()]
Explain in detail how to make sounds with python
Convert CIDR notation netmask to dotted decimal notation in Python
How to convert floating point numbers to binary numbers in Python
Super Primer to python-Getting started with python3.5 in 3 minutes
I was addicted to scraping with Selenium (+ Python) in 2020
How to convert JSON file to CSV file with Python Pandas
I want to work with a robot in python.
Convert callback-style asynchronous API to async / await in Python
PyInstaller memorandum Convert Python [.py] to [.exe] with 2 lines
Convert / return class object to JSON format in Python
[Python] Created a method to convert radix in 1 second
Convert Webpay Entity type to Dict type (recursively in Python)
Convert numeric variables to categorical with thresholds in pandas
Connect to BigQuery with Python
Bordering images with python Part 1
[python] Convert date to string
Base64 encoding images in Python 3
Scraping with selenium in Python