Introduction

For the time being, I zip up the image data scanned by the company, and when I take it home and unzip it, the characters are garbled for some reason ... What experience do you have? ** I have! (Half-gile) ** However, since only the unexpected serial number remains without being garbled, it is often the case that the order of the files is somehow understood. (Example æ–‡å—000.jpg, æ–‡ås001.jpg) If you can remove only this garbled part from the file name and form a beautiful serial number file, it will look good and it will be helpful for later work. Since it was about 100 sheets of data, it would be a hassle to modify it, so I decided to make a Python script lightly. By the way, I tried to make it possible to convert image data into one PDF with one touch.

At that time, I used ** Pillow ** for image processing and ** PyPDF2 ** for PDF conversion, so I would like to explain about that.

The script is listed on github. Please refer to the README for how to use the script.

Introduction of functions

It's a messy script because I wrote it in a hurry, but it consists of the following functions

name2number(folderpass, digits, extension)
changeNameHand(existfiles)
addstr(before, after, digits, extension)
image2pdf(filename, digits, extension, removeflug)
makeZip(filename, flug)
main()

I will briefly introduce it below. Please skip it because it is redundant content name2number(folderpass, digits, extension) From the images in the folder path with the image data received as an argument (only those specified by the extension) The file names are serialized from the non-garbled part (000 in æ–‡å—000.jpg). The os is used to rewrite the file name, and the re module is used to handle regular expressions. The return value is an array of file names that skipped processing.

changeNameHand(existfiles) The return value of name2number is an array containing the file name that is covered by the serial number and the file name rewriting is skipped. When this function receives it as an argument, it interactively listens to the file name in standard input / output and rewrites it. The os is used to rewrite the file name, and the re module is used to handle regular expressions. In addition, the Pillow (PIL) module is used to display the image for image confirmation.

addstr(before, after, digits, extension) This function attaches the character string before the argument before the serial number and the character string after the argument after the serial number. I use it when the serial number alone is not tasteful The os is used to rewrite the file name, and the re module is used to handle regular expressions.

image2pdf(filename, digits, extension, removeflug) Use Pillow and PyPDF2 to combine the image data in the current directory into a single PDF. I will explain in detail later

makeZip(filename, flug) Zip the data in the current directory. I'm using the Zipfile module.

main() A function to get command line arguments. I am using the argparse module to get command line arguments.

Convert images to PDF with Pillow and PyPDF

Image2pdf () script excerpt

`fileName2SerialNumber.py`


def image2pdf(filename, digits, extension, removeflug):
    u"""
Function to convert image file to PDF
    """
    if (extension != ".jpg " and ".png " and ".gif"): #Play if not an image
        print("Unsupported image files! jpg, png,Onacious with gif")
        sys.exit(1)
    
    fileWriter = PdfFileWriter()
    files = os.listdir()
    ext = re.compile(extension)
    files.sort()
    count = 1
    removefiles = []
    for file in files:
        num = re.search('\\d{' + str(digits) +'}', file)
        if (num == None):
            pass
        else:
            if (ext.search(file) and num):
                image = PIL.Image.open(file)
                pdfFile = str(file).replace(extension, ".pdf")
                image.save(pdfFile, "PDF", resolution = 100.0)
                with open(pdfFile, "rb") as f:
                    fileReader = PdfFileReader(f, "rb")
                    pageNum = fileReader.getNumPages()
                    for i in range(pageNum):
                        fileWriter.addPage(fileReader.getPage(i))
                        print("%s%Write to page d" % (str(file), count))
                        count += 1
                    removefiles.append(pdfFile)
                    if (removeflug):
                        removefiles.append(file)
                    with open(filename, "wb") as outputs:
                        fileWriter.write(outputs)
    print("-------------------------------------------------------")
    print("Finished writing! file name%s \n" % filename)
    
    for file in removefiles:
        os.remove(file)
        print("file name%Deleted s" % str(file))
    
    print("--------------------------------------------------------")
    print("the end! Deliverables: %s" % filename)
    return None

Pillow (PIL) work

Pillow is a module for image processing in Python. You can enter with PIP.

To open a file with an image

`python`


image = PIL.Image.open(file)

will do.

This time I want to save the image as PDF, so I will rewrite it to PDF. Therefore

`python`


image.save(image, "PDF", resolution = 100.0)

You can save the image in PDF format by doing. It is also possible to freely change the image quality by changing the "PDF" part to another image standard or setting the resolution to something other than 100. It seems that it is necessary to rewrite the extension of the image file to .pdf before reading the image file as a point when converting to PDF format. So in this script

`python`


pdfFile = str(file).replace(extension, ".pdf")

The extension is changed as.

Also, to display an image, which is not used in this function After opening an image with PIL.image.open ()

`python`


image.show()

It can be displayed with.

Working with PyPDF2

PyPDF2 is a convenient module that can combine multiple PDF files into one and extract text data from PDF. You can use PIP.

First, create an instance to write the file.

`python`


fileWriter = PdfFileWriter()

Also, the instance for reading the PDF file is

`python`


fileReader = PdfFileReader(open(pdffile, "rb"), "rb")

Create with. Since opening of the file is involved, use with

`python`


with open(pdfFile, "rb") as f:
    fileReader = PdfFileReader(f, "rb")

I think that it is less troublesome to do.

To check the current total number of pages

`python`


pageNum = fileReader.getNumPages()

The number of pages is entered in pageNum.

To add a PDF page

`python`


fileWriter.addPage(fileReader.getPage(pageNumber))

will do. Returns the page object with the page number specified by getpage (number of pages). The page is added to the specified number by addingPage with the page object as an argument.

When the editing is completed, it is finally exported as a PDF file.

For export

`python`


fileWriter.write(open(filename, "wb")

You can write with. Since we use the open function, we use the with statement

`python`


with open(filename, "wb") as outputs:
     fileWriter.write(outputs)

I think that it is less troublesome to do.

in conclusion

Use Pillow and PyPDF2 to improve work efficiency!

[PYTHON] Convert garbled scanned images to PDF with Pillow and PyPDF

Introduction

Introduction of functions

Convert images to PDF with Pillow and PyPDF

Image2pdf () script excerpt

fileName2SerialNumber.py

Pillow (PIL) work

python

python

python

python

Working with PyPDF2

python

python

python

python

python

python

python

in conclusion

`fileName2SerialNumber.py`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`