[Python] Continued-Convert PDF text to CSV page by page

After last time, I thought it was necessary to repair it, so it's a simple continuation.

The beginning of the matter

It's okay to output the PDF page in CSV format, but I said it was ton demo data. Specifically, the subtitle came in the middle. It's sober and painful.

I found the following site when I couldn't find a similar project. Analyzing the list of black companies of the Ministry of Health, Labor and Welfare with Python (PDFMiner.six)

I knew that I had a comrade and that I could manage with the coordinates. So I will try it.

Verification-Preparation-

Reference: Select PDFMiner to extract text information from PDF

It seems that pdfminer can also get the coordinate information of the layout. Until now, only character data was extracted with TextConverter, In PDFPageAggregator, coordinates and character data seem to be pulled out, so use this.

For the time being, check what kind of coordinates are available. I'm sorry I couldn't prepare the sample PDF ...

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter, PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox, LTTextLine, LTChar
from pdfminer.pdfpage import PDFPage

def convert_pdf_to_txt(self,p_d_f):
    
    fp = open(p_d_f, 'rb')
    for page in PDFPage.get_pages(fp):
        
        rsrcmgr = PDFResourceManager()
        laparams = LAParams()
        laparams.detect_vertical = True
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        
        #Get coordinates and character data from PDF
        interpreter.process_page(page)
        layout = device.get_result()
        
        #Display of coordinates and characters
        for node in layout:
            if isinstance(node, LTTextBox) or isinstance(node, LTTextLine):
                print(node.get_text())   #letter
                word =input(node.bbox)   #Coordinate
        word =input("---page end---")

An inefficient guy that checks at the command prompt.

To be honest, I don't really understand the judgment like LTTextBox, but I put it in as a magic. Let's find out properly.

inspection result-

This is an excerpt of the output result. The text is dummy.

---page end---
About popcorn machines

(68.28, 765.90036, 337.2, 779.9403599999999)
It is a machine that pops and makes popcorn.

(67.8, 697.71564, 410.4000000000001, 718.47564)
Please be careful when using it.

(67.8, 665.29564, 339.8400000000002, 686.05564)
The usage is as follows.

(67.8, 643.69564, 279.3600000000001, 653.65564000)
Description

(67.8, 730.11564, 87.96000000000001, 740.07564)

Tuples are the coordinates. The order is (x0, y0, x1, y1). For details, go to the reference site! To put it simply, if you look at y1, you can see the coordinates of the characters from the bottom. In other words, if y1 in the page is in descending order, the characters are arranged in the coordinates in order from the top = correct arrangement form (in this case).

So, looking at this output result, y1 in the last line is the second largest, so it is an irrelevant result from the viewpoint of simply arranging from the top. It may be sorted based on x0. I don't know anything. It seems that the coordinates are taken well, so I will do something with this y1.

Proposed solution

① Make a dictionary ② Sort the dictionary (key descending order) ③ Make it a character string ④ Clean up line breaks

This should work. If you are a sly person, please look only at the finished product.

① Make a dictionary

d=[]
for node in layout:
    if isinstance(node, LTTextBox) or isinstance(node, LTTextLine):
        y1 = node.bbox[3]
        #If it is a table, the coordinates of y1 are duplicated, so string concatenation
        if y1 in d:
           d[y1] += "|" + node.get_text()
        else:
           d[y1] = node.get_text()

Make a quick dictionary of coordinates and letters. I also take table measures to relax.

But to be honest, this method of making it open is a barren effort because it has holes. Because, it seems that the coordinates are taking characters line by line, but the mechanism is to set a margin padding value and take a block of characters in the near future as a "block". It seems that it is (certainly).

Solid story, if you do not set anything, the default margin will be applied, and multiple lines will be recognized as one block for sentences with tight line spacing and fine tables. So, if you get multiple lines of characters with the same coordinates, it's already a collapse of the Ese table operation.

If so, I'm talking about setting margin padding properly, but this time I haven't asked for that much, so I won't set it in particular. When the table comes out, let's try with a feeling of "I'm sorry!"

② Dictionary sort (key descending order)

Reference: Summary of Python sort (list, dictionary type, Series, DataFrame)

d2 = sorted(d.items(), key=lambda x: -x[0])

I did it! Ramuda Hatsuyoshi! By the way, if you do this, the dictionary will be a list. I don't really care as long as I can sort.

③ Make it a character string

text = ""
for d0 in d2:
     text += d0[1]

It's just round and round.

④ Clean up line breaks

Reference: Split comma-separated strings with Python, split, remove whitespace and list I am always indebted to you.

space = re.compile("[  ]+")
text = re.sub(space, "", text )
l_text = [a for a in text.splitlines() if a != '']
text = '\n'.join(l_text).replace('\n|', '|')

There are many spaces and line breaks. This is a solution to the problem. Replace white space and delete line breaks as a list. By the way, the line break before the symbol that was used as a mark when returning to the table is also deleted.

Finished product


from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter, PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox, LTTextLine, LTChar
from pdfminer.pdfpage import PDFPage

import csv,re,datetime
import pandas as pd

class converter(object):
    def convert_pdf_to_txt(self,p_d_f):
        print("system:pdf【" + p_d_f + "] Is read")
        
        df = pd.DataFrame(columns=["Update date and time","Sentence","page number"])
        
        cnt = 1
        space = re.compile("[  ]+")
        fp = open(p_d_f, 'rb')
       
        #Extract coordinates and character data from pdf
        for page in PDFPage.get_pages(fp):
            rsrcmgr = PDFResourceManager()
            laparams = LAParams()
            laparams.detect_vertical = True
            device = PDFPageAggregator(rsrcmgr, laparams=laparams)
            interpreter = PDFPageInterpreter(rsrcmgr, device)
            #Get coordinates and character data from PDF
            interpreter.process_page(page)
            layout = device.get_result() 
            
            #Create a dictionary of coordinates and data
            d={}
            for node in layout:
                if isinstance(node, LTTextBox) or isinstance(node, LTTextLine):
                    y1 = node.bbox[3]
                    #If it is a table, the coordinates of y1 are duplicated, so string concatenation
                    if y1 in d:
                       d[y1] += "|" + node.get_text()
                    else:
                       d.update({y1 : node.get_text()})
            
            #Sort by coordinates
            d2 = sorted(d.items(), key=lambda x: -x[0])
            
            #Bump into a string
            text = ""
            for d0 in d2:
                 text += ddd[1]
            
            #Remove blank line breaks
            text = re.sub(space, "", text)
            l_text = [a for a in text.splitlines() if a != '']
            text = '\n'.join(l_text).replace('\n|', '|')     
            
            df.loc[cnt,["Sentence","page number"]] = [text,cnt]
            cnt += 1
            
        fp.close()
        device.close()
         
        now = datetime.datetime.now()
        df["Update date and time"] = now

        csv_path = p_d_f.replace('.pdf', '.csv')
        with open(csv_path, mode='w', encoding='cp932', errors='ignore', newline='\n') as f:
             df.to_csv(f,index=False)

if __name__ == "__main__":

  p_d_f = "Somehow.pdf"
  con=converter()
  hoge=con.pdf_to_csv(p_d_f)

I haven't checked it well because I added and subtracted it from the last time, but something similar worked. If you get an error, please fix it yourself.

Recommended Posts

[Python] Continued-Convert PDF text to CSV page by page
[Python] Convert PDF text to CSV page by page (2/24 postscript)
How to add page numbers to PDF files (in Python)
How to save a table scraped by python to csv
Add page number to PDF
Speech to speech in python [text to speech]
Write to csv with Python
Convert PDF to Documents by OCR
Convert markdown to PDF in Python
[Python] Write to csv file with Python
Output to csv file with Python
Join csv normalized by Python pandas to make it easier to check
[Good By Excel] python script to generate sql to convert csv to table
[Python] How to convert db file to csv
Reintroduction to Python Decorators ~ Learn Decorators by Type ~
Answer to AtCoder Beginners Selection by Python3
[Python] Convert csv file delimiters to tab delimiters
Function to save images by date [python3]
I want to convert a table converted to PDF in Python back to CSV
Convert PDF attached to email to text format
Read Python csv and export to txt
Recommended books by 3 types related to Python
[Part1] Scraping with Python → Organize to csv!
Python> Output numbers from 1 to 100, 501 to 600> For csv
(Miscellaneous notes) Data update pattern from CSV data acquisition / processing to Excel by Python
[python] How to display list elements side by side
How to read a CSV file with Python 2/3
Scraping tabelog with python and outputting to CSV
Library comparison summary to generate PDF with Python
Upload text file to rental server by ftp
Convert PDF to image (JPEG / PNG) with Python
[Python] Convert from DICOM to PNG or CSV
tse --Introduction to Text Stream Editor in Python
Writing logs to CSV file (Python, C language)
How to convert SVG to PDF and PNG [Python]
How to erase the characters output by Python
COCO'S Breakfast Buffet List PDF Converted to CSV
[Python] How to sort instances by instance variables
I want to sell Mercari by scraping python
Execute Power Query by passing arguments to Python
Updated to Python 2.7.9
Csv in python
Python reference page
"Backport" to python 2
[Keras] Personal memo to classify images by folder [Python]
List of posts related to optimization by Python to docker
I tried to touch the CSV file with Python
Read the xml file by referring to the Python tutorial
Convert the image in .zip to PDF with Python
How to convert JSON file to CSV file with Python Pandas
[Python] Change standard input from keyboard to text file
[Python] A memo to write CSV vertically with Pandas
[Python-pptx] Output PowerPoint font information to csv with python
Merge two PDF files page by page with each other
Try to automate pdf format report creation with Python
Python script to create a JSON file from a CSV file
How to read csv containing only integers in Python
Python OpenCV tried to display the image in text.
How to read text by standard input or file name specification like cat in Python