Extract images and tables from pdf with python to reduce the burden of reporting

background

When writing a report, it became troublesome to clip and save images (circuit diagrams, etc.) from the pdf sent in pdf format, and to just copy the table. The useful apps and code that do them didn't come out as a quick look. Then let's make it. The extraction of the table did not go well, but the burden was reduced because I was able to get the values (; ^ _ ^ A)

I put it on git, so if you like https://github.com/kzrn3318/create_img_excel_from_pdf

environment

Installation of required libraries


pip install pypdf2
pip install pillow
pip install PyMuPDF
pip install fitz
pip install pandas
pip install camelot-py[cv]

If you don't have ghostscript at runtime, you may get an error, in which case you should install ghostscript. Since it is windows 10 at the time of code creation, it may not work on other os due to the character string of path, in that case please rewrite the path of the code so that it can be adapted. We have not confirmed the operation with other os.

Code body

Below is the code

main.py


import PyPDF2
from PIL import Image
import sys,os
import glob
import fitz
import camelot
import pandas as pd

def create_dir(img_dir , pdf_dir , excel_dir):
    img_dir_glob = glob.glob(str(img_dir))
    pdf_dir_glob = glob.glob(str(pdf_dir))
    excel_dir_glob = glob.glob(str(excel_dir))
    
    if len(pdf_dir_glob) > 0:
        pass
    else:
       os.mkdir(str(pdf_dir))
       
    if len(img_dir_glob) > 0:
        pass
    else:
        os.mkdir(str(img_dir))
    
    if len(excel_dir_glob) > 0:
        pass
    else:
        os.mkdir(str(excel_dir))
        

def create_page_pdf(pdf,page_count,pdf_dir):
    pdf_writer = PyPDF2.PdfFileWriter()
    pdf_writer.addPage(pdf.getPage(page_count))
    
    with open(".\\"+str(pdf_dir)+"\pdf{}.pdf".format(page_count),"wb") as f:
        pdf_writer.write(f)
    

def create_png(pdf_path,page_count,img_dir):
    pdf  = fitz.open(pdf_path)
    for num in range(len(pdf)):
        num_count = 0
        for image in pdf.getPageImageList(num):
            num_count += 1
            xref = image[0]
            pix = fitz.Pixmap(pdf,xref)
            
            if pix.n < 5:
                pix.writePNG(".\\"+str(img_dir)+"\img{}_{}.png ".format(page_count,num_count))
            else:
                pix = fitz.Pixmap(fitz.csRGB,xref)
                pix.writePNG(".\\"+str(img_dir)+"\img{}_{}.png ".format(page_count,num_count))
            
            pix = None
    
    pdf.close()
    
    
def create_excel(pdf_path,excel_dir,data_count):
    
    datas = camelot.read_pdf(pdf_path,split_text=True)
    data_count = data_count
    for data in datas:
        data_count += 1
        df =  data.df
        with pd.ExcelWriter(".\\"+str(excel_dir)+"\\from_pdf_{}.xlsx".format(data_count)) as file:
            df.to_excel(file,sheet_name="sheet1",index=False,header=False)
    return data_count


if __name__ == "__main__":
    args = sys.argv
    print([i for i in args])
    if len(args) >= 5:
        print("Received an argument.")
        pdf_file = args[1]
        pdf_dir = args[2]
        img_dir = args[3]
        excel_dir = args[4]
    else:
        try:
            pdf_file = args[1]
            print("Since the argument was not specified, it will be executed with the default value.")
        except:
            raise ValueError("At least one pdf file must be the argument. When specifying the output directory, specify four arguments.")
        pdf_dir ="pdf_list"
        img_dir="img_list"
        excel_dir="excel_data"
    
    pdf = PyPDF2.PdfFileReader(pdf_file)
    
    print("Image directory:"+str(img_dir))
    print("pdf Directory of each page:"+str(pdf_dir))
    print("Excel data directory:"+str(excel_dir))
    
    create_dir(img_dir,pdf_dir,excel_dir)
    
    page_count = 0
    for page in pdf.pages:
        create_page_pdf(pdf,page_count,pdf_dir)
        page_count += 1
        
    path_list = glob.glob(".\\"+pdf_dir+"\*.pdf")
    page_count = 0
    data_count = 0
    for path in path_list:
        page_count += 1
        create_png(path,page_count,img_dir)
        data_count = create_excel(path,excel_dir,data_count)
        
    print("Processing Exit\n")

The execution method is as follows, please execute in the same directory as the target pdf. At the time of execution, a directory for saving the paged pdf, a directory for saving the image extracted from the pdf, and a directory for extracting the table from the pdf and saving it are created. You can specify them with command line arguments.

python main.py (Target.pdf) (pdf paginated directory) (pdf extract image directory) (pdf table extraction directory)

Example


python main.py train1.pdf pdf_dir img_dir excel_dir

In the above example, pdf is saved for each page division directly under pdf_dir. Save the extracted image in img_dir. Save the extracted table converted to excel in excel_dir.

Partial explanation of the code

import PyPDF2
from PIL import Image
import sys,os
import glob
import fitz
import camelot
import pandas as pd

As you can see, it's a sight that people who usually write python often see. Import each package.

def create_dir(img_dir , pdf_dir , excel_dir):
    img_dir_glob = glob.glob(str(img_dir))
    pdf_dir_glob = glob.glob(str(pdf_dir))
    excel_dir_glob = glob.glob(str(excel_dir))

    if len(pdf_dir_glob) > 0:
        pass
    else:
       os.mkdir(str(pdf_dir))

    if len(img_dir_glob) > 0:
        pass
    else:
        os.mkdir(str(img_dir))

    if len(excel_dir_glob) > 0:
        pass
    else:
        os.mkdir(str(excel_dir))

It is a directory creation function, it determines whether the received argument already exists or not, and creates it if it does not exist.

def create_page_pdf(pdf,page_count,pdf_dir):
    pdf_writer = PyPDF2.PdfFileWriter()
    pdf_writer.addPage(pdf.getPage(page_count))

    with open(".\\"+str(pdf_dir)+"\pdf{}.pdf".format(page_count),"wb") as f:
        pdf_writer.write(f)

It is a function that divides the original pdf into pages and saves each. Create pdf (page number) .pdf directly under pdf_dir.

def create_png(pdf_path,page_count,img_dir):
    pdf  = fitz.open(pdf_path)
    for num in range(len(pdf)):
        num_count = 0
        for image in pdf.getPageImageList(num):
            num_count += 1
            xref = image[0]
            pix = fitz.Pixmap(pdf,xref)

            if pix.n < 5:
                pix.writePNG(".\\"+str(img_dir)+"\img{}_{}.png ".format(page_count,num_count))
            else:
                pix = fitz.Pixmap(fitz.csRGB,xref)
                pix.writePNG(".\\"+str(img_dir)+"\img{}_{}.png ".format(page_count,num_count))

            pix = None

    pdf.close()

Save the extracted image directly under img_dir in .png format. The file name will be img (page number) _ (image number on the page) .png.

def create_excel(pdf_path,excel_dir,data_count):

    datas = camelot.read_pdf(pdf_path,split_text=True)
    data_count = data_count
    for data in datas:
        data_count += 1
        df =  data.df
        with pd.ExcelWriter(".\\"+str(excel_dir)+"\\from_pdf_{}.xlsx".format(data_count)) as file:
            df.to_excel(file,sheet_name="sheet1",index=False,header=False)
    return data_count

Save the converted excel directly under excel_dir.

def create_excel(pdf_path,excel_dir,data_count):

    datas = camelot.read_pdf(pdf_path,split_text=True)
    data_count = data_count
    for data in datas:
        data_count += 1
        df =  data.df
        with pd.ExcelWriter(".\\"+str(excel_dir)+"\\from_pdf_{}.xlsx".format(data_count)) as file:
            df.to_excel(file,sheet_name="sheet1",index=False,header=False)
    return data_count


if __name__ == "__main__":
    args = sys.argv
    print([i for i in args])
    if len(args) >= 5:
        print("Received an argument.")
        pdf_file = args[1]
        pdf_dir = args[2]
        img_dir = args[3]
        excel_dir = args[4]
    else:
        try:
            pdf_file = args[1]
            print("Since the argument was not specified, it will be executed with the default value.")
        except:
            raise ValueError("At least one pdf file must be the argument. When specifying the output directory, specify four arguments.")
        pdf_dir ="pdf_list"
        img_dir="img_list"
        excel_dir="excel_data"

    pdf = PyPDF2.PdfFileReader(pdf_file)

    print("Image directory:"+str(img_dir))
    print("pdf Directory of each page:"+str(pdf_dir))
    print("Excel data directory:"+str(excel_dir))

    create_dir(img_dir,pdf_dir,excel_dir)

    page_count = 0
    for page in pdf.pages:
        create_page_pdf(pdf,page_count,pdf_dir)
        page_count += 1

    path_list = glob.glob(".\\"+pdf_dir+"\*.pdf")
    page_count = 0
    data_count = 0
    for path in path_list:
        page_count += 1
        create_png(path,page_count,img_dir)
        data_count = create_excel(path,excel_dir,data_count)

    print("Processing Exit\n")

This is the execution part of main.py. Each directory name is obtained from the command line argument and executed.

Summary

Until now, I used to trim from pdf, but I think it has become much easier. pyPDF2 and camelot had very few pheasants and it was hard (-_-;) It seems that there are still some improvements in table extraction, but it seems difficult to extract as it is due to the structure and writing style of pdf. This code is created on the assumption that the exported pdf is used. Please note that we have not tested whether the pdf instruction book scanned by adobe scan etc. can be applied.

Recommended Posts

Extract images and tables from pdf with python to reduce the burden of reporting
[python] Extract text from pdf and read characters aloud with Open-Jtalk
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Extract database tables with CSV [ODBC connection from R and python]
[Python] Try to recognize characters from images with OpenCV and pyocr
I want to extract an arbitrary URL from the character string of the html source with python
Extract text from PowerPoint with Python! (Compatible with tables)
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
Images created with matplotlib shift from dvi to pdf
Extract the table of image files with OneDrive & Python
The wall of changing the Django service from Python 2.7 to Python 3
Learn Nim with Python (from the beginning of the year).
Visualize the range of interpolation and extrapolation with python
Convert the image in .zip to PDF with Python
Extract the band information of raster data with python
I tried to automate the article update of Livedoor blog with Python and selenium.
I just wanted to extract the data of the desired date and time with Django
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to find the entropy of the image with python
Extract template of EML file saved from Thunderbird with python3.7
Probably the easiest way to create a pdf with Python3
Send experiment results (text and images) to slack with Python
Convert garbled scanned images to PDF with Pillow and PyPDF
Try to automate the operation of network devices with Python
I want to know the features of Python and pip
Save images on the web to Drive with Python (Colab)
Play with the password mechanism of GitHub Webhook and Python
Get the source of the page to load infinitely with python.
Extract the value closest to a value from a Python list element
Try to extract the features of the sensor data with CNN
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
How to scrape stock prices of individual stocks from the Nikkei newspaper website with Python
Return the image data with Flask of Python and draw it to the canvas element of HTML
How to know the number of GPUs from python ~ Notes on using multiprocessing with pytorch ~
Introduction to AI creation with Python! Part 1 I tried to classify and predict what the numbers are from the handwritten number images.
Operate Jupyter with REST API to extract and save Python code
Extract the xz file with python
The story of Python and the story of NaN
Repeat with While. Scripts to Tweet and search from the terminal
Create folders from '01' to '12' with python
[Python] Try to graph from the image of Ring Fit [OCR]
[Introduction to Python] I compared the naming conventions of C # and Python.
Existence from the viewpoint of Python
I want to output the beginning of the next month with Python
Output the contents of ~ .xlsx in the folder to HTML with Python
Try to extract a character string from an image with Python3
Extract text from images in Python
Post images from Python to Tumblr
Process the gzip file UNLOADed with Redshift with Python of Lambda, gzip it again and upload it to S3
Find the white Christmas rate by prefecture with Python and map it to a map of Japan
To improve the reusability and maintainability of workflows created with Luigi
Convert the result of python optparse to dict and utilize it
Try to calculate the position of the transmitter from the radio wave propagation model with python [Wi-Fi, Beacon]
I tried to improve the efficiency of daily work with Python
Coexistence of Python2 and 3 with CircleCI (1.0)
Operate Firefox with Selenium from python and save the screen capture
I tried to automatically collect images of Kanna Hashimoto with Python! !!
The fastest way to get camera images regularly with python opencv
PhytoMine-I tried to get the genetic information of plants with Python
Solve with Python [100 selected past questions that beginners and intermediates should solve] (005 --- 009 All search: All enumeration to reduce the number of streets by devising)
[Python] Three methods to compare the list of one-dimensional array and the list of two-dimensional array and extract only the matching values [json]