[PYTHON] Convert garbled scanned images to PDF with Pillow and PyPDF

Introduction

For the time being, I zip up the image data scanned by the company, and when I take it home and unzip it, the characters are garbled for some reason ... What experience do you have? ** I have! (Half-gile) ** However, since only the unexpected serial number remains without being garbled, it is often the case that the order of the files is somehow understood. (Example æ–‡å—000.jpg, æ–‡ås001.jpg) If you can remove only this garbled part from the file name and form a beautiful serial number file, it will look good and it will be helpful for later work. Since it was about 100 sheets of data, it would be a hassle to modify it, so I decided to make a Python script lightly. By the way, I tried to make it possible to convert image data into one PDF with one touch.

At that time, I used ** Pillow ** for image processing and ** PyPDF2 ** for PDF conversion, so I would like to explain about that.

The script is listed on github. Please refer to the README for how to use the script.

Introduction of functions

It's a messy script because I wrote it in a hurry, but it consists of the following functions

I will briefly introduce it below. Please skip it because it is redundant content name2number(folderpass, digits, extension) From the images in the folder path with the image data received as an argument (only those specified by the extension) The file names are serialized from the non-garbled part (000 in æ–‡å—000.jpg). The os is used to rewrite the file name, and the re module is used to handle regular expressions. The return value is an array of file names that skipped processing.

changeNameHand(existfiles) The return value of name2number is an array containing the file name that is covered by the serial number and the file name rewriting is skipped. When this function receives it as an argument, it interactively listens to the file name in standard input / output and rewrites it. The os is used to rewrite the file name, and the re module is used to handle regular expressions. In addition, the Pillow (PIL) module is used to display the image for image confirmation.

addstr(before, after, digits, extension) This function attaches the character string before the argument before the serial number and the character string after the argument after the serial number. I use it when the serial number alone is not tasteful The os is used to rewrite the file name, and the re module is used to handle regular expressions.

image2pdf(filename, digits, extension, removeflug) Use Pillow and PyPDF2 to combine the image data in the current directory into a single PDF. I will explain in detail later

makeZip(filename, flug) Zip the data in the current directory. I'm using the Zipfile module.

main() A function to get command line arguments. I am using the argparse module to get command line arguments.

Convert images to PDF with Pillow and PyPDF

Image2pdf () script excerpt

fileName2SerialNumber.py


def image2pdf(filename, digits, extension, removeflug):
    u"""
Function to convert image file to PDF
    """
    if (extension != ".jpg " and ".png " and ".gif"): #Play if not an image
        print("Unsupported image files! jpg, png,Onacious with gif")
        sys.exit(1)
    
    fileWriter = PdfFileWriter()
    files = os.listdir()
    ext = re.compile(extension)
    files.sort()
    count = 1
    removefiles = []
    for file in files:
        num = re.search('\\d{' + str(digits) +'}', file)
        if (num == None):
            pass
        else:
            if (ext.search(file) and num):
                image = PIL.Image.open(file)
                pdfFile = str(file).replace(extension, ".pdf")
                image.save(pdfFile, "PDF", resolution = 100.0)
                with open(pdfFile, "rb") as f:
                    fileReader = PdfFileReader(f, "rb")
                    pageNum = fileReader.getNumPages()
                    for i in range(pageNum):
                        fileWriter.addPage(fileReader.getPage(i))
                        print("%s%Write to page d" % (str(file), count))
                        count += 1
                    removefiles.append(pdfFile)
                    if (removeflug):
                        removefiles.append(file)
                    with open(filename, "wb") as outputs:
                        fileWriter.write(outputs)
    print("-------------------------------------------------------")
    print("Finished writing! file name%s \n" % filename)
    
    for file in removefiles:
        os.remove(file)
        print("file name%Deleted s" % str(file))
    
    print("--------------------------------------------------------")
    print("the end! Deliverables: %s" % filename)
    return None

Pillow (PIL) work

Pillow is a module for image processing in Python. You can enter with PIP.

To open a file with an image

python


image = PIL.Image.open(file)

will do.

This time I want to save the image as PDF, so I will rewrite it to PDF. Therefore

python


image.save(image, "PDF", resolution = 100.0)

You can save the image in PDF format by doing. It is also possible to freely change the image quality by changing the "PDF" part to another image standard or setting the resolution to something other than 100. It seems that it is necessary to rewrite the extension of the image file to .pdf before reading the image file as a point when converting to PDF format. So in this script

python


pdfFile = str(file).replace(extension, ".pdf")

The extension is changed as.

Also, to display an image, which is not used in this function After opening an image with PIL.image.open ()

python


image.show()

It can be displayed with.

Working with PyPDF2

PyPDF2 is a convenient module that can combine multiple PDF files into one and extract text data from PDF. You can use PIP.

First, create an instance to write the file.

python


fileWriter = PdfFileWriter()

Also, the instance for reading the PDF file is

python


fileReader = PdfFileReader(open(pdffile, "rb"), "rb")

Create with. Since opening of the file is involved, use with

python


with open(pdfFile, "rb") as f:
    fileReader = PdfFileReader(f, "rb")

I think that it is less troublesome to do.

To check the current total number of pages

python


pageNum = fileReader.getNumPages()

The number of pages is entered in pageNum.

To add a PDF page

python


fileWriter.addPage(fileReader.getPage(pageNumber))

will do. Returns the page object with the page number specified by getpage (number of pages). The page is added to the specified number by addingPage with the page object as an argument.

When the editing is completed, it is finally exported as a PDF file.

For export

python


fileWriter.write(open(filename, "wb")

You can write with. Since we use the open function, we use the with statement

python


with open(filename, "wb") as outputs:
     fileWriter.write(outputs)

I think that it is less troublesome to do.

in conclusion

Use Pillow and PyPDF2 to improve work efficiency!

Recommended Posts

Convert garbled scanned images to PDF with Pillow and PyPDF
Create an API to convert PDF files to TIF images with FastAPI and Docker
Convert PDF to image with ImageMagick
Convert from PDF to CSV with pdfplumber
Convert PDF files to PNG files with GIMP
Convert DICOM to PNG with Ascending and Descending
Convert PDF to image (JPEG / PNG) with Python
Convert PDFs to images in bulk with Python
How to convert SVG to PDF and PNG [Python]
Convert 202003 to 2020-03 with pandas
Convert the image in .zip to PDF with Python
I made a program to convert images into ASCII art with Python and OpenCV
Extract images and tables from pdf with python to reduce the burden of reporting
Convert color space from RGB to CIELAB with PIL (Pillow)
Send experiment results (text and images) to slack with Python
Convert images to sepia toning with PIL (Python Imaging Library)
Convert video to black and white with ffmpeg + python + opencv
Made it possible to convert PNG to JPG with Pillow of Python
Convert files written in python etc. to pdf with syntax highlighting
Convert PIL format images read from form with Django to base64 format
Prepare an environment to use OpenCV and Pillow with AWS Lambda
[Python] Try to recognize characters from images with OpenCV and pyocr
Convert .ipynb to .html (with BatchFile)
Convert PDF to Documents by OCR
Convert markdown to PDF in Python
Convert A4 PDF to A3 every 2 pages
Convert list to DataFrame with python
Convert sentences to vectors with gensim
Convert from pdf to txt 2 [pyocr]
Upload and download images with falcon
Convert facial images with PULSE to high image quality so that you can see pores and texture
How to process camera images with Teams and Zoom Sentiment analysis with Tensorflow
I made a CLI tool to convert images in each directory to PDF
Get tweets with Google Cloud Function and automatically save images to Google Photos
I tried to convert datetime <-> string with tzinfo using strftime () and strptime ()
Convert the spreadsheet to CSV and upload it to Cloud Storage with Cloud Functions
I made a network to convert black and white images to color images (pix2pix)
Image the pdf file and stamp all pages with confidential stamps (images).
Read CSV file with Python and convert it to DataFrame as it is
How to search using python's astroquery and get fits images with skyview
Convert memo at once with Python 2to3
Learn to colorize monochrome images with Chainer
Capturing images with Pupil, python and OpenCV
Convert character strings to features with RoBERTa
Convert Excel data to JSON with python
Generate many single-character images with Pillow (PIL)
Use PIL and Pillow with Cygwin Python
Convert FX 1-minute data to 5-minute data with Python
Convert PDF attached to email to text format
Add images to iOS photos with Pythonista
Convert array (struct) to json with golang
Add Gaussian noise to images with python2.7
Convert HEIC files to PNG files with Python
Convert Chinese numerals to Arabic numerals with Python
Importing and exporting GeoTiff images with Python
Cut out and connect images with ImageMagick
Upload images to Google Drive with Python
Sample to convert image to Wavelet with Python
Convert Mobile Suica usage history PDF to pandas Data Frame format with tabula-py
How to convert Web pages to PDF, PNG, JPG with VBA (Excel) (Selenium Basic)
Instantly convert Model to Dictionary with Django and initialize Form at explosive speed