Parsing PDFs, including text, is easy with Python ... I had a time when I was thinking that way.

Preface

Parsing PDFs, including text, is easy with Python ...

image.png

If character information is included, you can easily create a Web service by extracting character and table information from PDF and using that data. The result of a simple thought is as follows. I will.

** I tried using PDF data of online medical care based on the spread of the new coronavirus infection ** https://qiita.com/mima_ita/items/c0f28323f330c5f59ed8

The most important thing I got here is ** "Don't read PDF data on a computer, it's something that humans read" ** </ font>, and a few How to handle PDF using Python.

This time, I will explain how to handle PDF using that little Python. The experimental environment is Python 3.7.5 64bit of Window10.

PDF parsing

operands and operators

All PDF characters and graphics consist of operands and operators, the specifications of which are listed below. https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

There are various libraries that are useful for reading PDF in Python, but here I will use PyPDF2 to read the PDF. The feature of this library is that it is written entirely in Python, so it is possible to check what the PDF is like at the operands and operator levels.

Let's check what kind of operators and operators the following simple PDF actually consists of. http://needtec.sakura.ne.jp/doc/hello.pdf

The code below enumerates operands and operands.

import PyPDF2
from PyPDF2.pdf import ContentStream

with open("hello.pdf", "rb") as fp:
    pdf = PyPDF2.PdfFileReader(fp)
    for page_no in range(pdf.numPages):
        page = pdf.getPage(page_no)
        content = page['/Contents'].getObject()
        if not isinstance(content, ContentStream):
            content = ContentStream(content, pdf)
        for operands, operator in content.operations:
            print(operands, operator)

The result of doing this with the simple PDF above is as follows.

[1, 0, 0, 1, 0, 0] b'cm'
[] b'BT'
['/F1', 12] b'Tf'
[14.4] b'TL'
[] b'ET'
[] b'n'
[10, 10, 200, 200] b're'
[] b'S'
[] b'BT'
[1, 0, 0, 1, 100, 50] b'Tm'
['Hello'] b'Tj'
[] b'T*'
[] b'ET'

Specifications The following can be analyzed while reading the Annex A Operator Summary. I understand.

Operands Operator Description number of pages
x y width height re Lower left corner(x,y)Add a rectangular path from 133p
- S Draw a line along the current path 135p
a b c d e f Tm Specifies the matrix that determines the position of the text.
image.png
249
string Tj Display characters 250

In other words, the drawing will be as follows. (1) Draw a quadrangle with a width of 200 and a height of 200 from (10, 10) with the lower left as the origin. (2) Write the character "Hello" from (100, 50)

This time it was a simple example so I could read it, but drawing the text is very troublesome, and if I do not understand the behavior of Text-positioning operators and Text-Showing Operators, I will extract the characters from the PDF and their positions And the size cannot be known.

For example, there is the following PDF. http://needtec.sakura.ne.jp/doc/hello2.pdf

For the purpose of looking at it, there are only a few more Japanese and table matrices, but it is difficult to read this in the same way.

In addition, PyPDF2 has a function called page.extractText () that extracts pages, but it will be a lot of difficulty for non-American users. https://github.com/mstamy2/PyPDF2/issues

Parse PDF characters with PDFMiner

PDFMiner makes it easy to extract the characters in the PDF.

The following is a sample to extract the characters in the PDF.

from pdfminer.high_level import extract_text
print(extract_text('hello2.pdf'))

In addition, the true value of PDFMiner is not only to extract characters, but also to obtain the coordinates and size of the characters to be drawn. The following is a sample program that extracts specific PDF characters and their coordinate information.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import (
    LAParams,
    LTContainer,
    LTTextLine,
)

def get_objs(layout, results):
    if not isinstance(layout, LTContainer):
        return
    for obj in layout:
        if isinstance(obj, LTTextLine):
            results.append({'bbox': obj.bbox, 'text' : obj.get_text(), 'type' : type(obj)})
        get_objs(obj, results)

def main(path):
    with open(path, "rb") as f:
        parser = PDFParser(f)
        document = PDFDocument(parser)
        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed
        # https://pdfminersix.readthedocs.io/en/latest/api/composable.html#
        laparams = LAParams(
            all_texts=True,
        )
        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
            layout = device.get_result()
            results = []
            print('objs-------------------------')
            get_objs(layout, results)
            for r in results:
                print(r)


main('hello2.pdf')


The result of executing this program using PDF with mixed Japanese is as follows.

objs-------------------------
{'bbox': (90.744, 728.1928, 142.2056, 738.7528), 'text': 'Hello world\n', 'type': <class 'pdfminer.layout.LTTextLineHorizontal'>}
{'bbox': (168.5, 728.1928, 223.8356, 738.7528), 'text': 'The cat rang\n', 'type': <class 'pdfminer.layout.LTTextLineHorizontal'>}
{'bbox': (168.5, 709.7128, 202.8356, 720.2728), 'text': 'I am God\n', 'type': <class 'pdfminer.layout.LTTextLineHorizontal'>}
{'bbox': (90.744, 691.1128, 146.0456, 701.6728), 'text': 'God is dead\n', 'type': <class 'pdfminer.layout.LTTextLineHorizontal'>}
{'bbox': (168.5, 691.1128, 171.2456, 701.6728), 'text': ' \n', 'type': <class 'pdfminer.layout.LTTextLineHorizontal'>}
{'bbox': (168.5, 672.6328, 255.2756, 683.1928), 'text': 'Aw Neo sf\n', 'type': <class 'pdfminer.layout.LTTextLineHorizontal'>}
{'bbox': (90.744, 709.7128, 93.4896, 720.2728), 'text': ' \n', 'type': <class 'pdfminer.layout.LTTextLineHorizontal'>}
{'bbox': (90.744, 672.6328, 93.4896, 683.1928), 'text': ' \n', 'type': <class 'pdfminer.layout.LTTextLineHorizontal'>}
{'bbox': (85.104, 654.1528, 87.8496, 664.7128), 'text': ' \n', 'type': <class 'pdfminer.layout.LTTextLineHorizontal'>}

You can confirm that not only the contents of the characters in the PDF but also the coordinates are acquired.

Parse the table in PDF

There are no operands and operators that represent tables in PDF. I'm just representing the table using the rectangle drawing and text drawing described so far. Therefore, it is not possible to parse a PDF table as easily as parsing an HTML table or Excel.

Some Python libraries are trying to parse the PDF table. This time, I will use camelot, which is implemented entirely in Python.

See below for a comparison of camelot with other libraries. https://github.com/atlanhq/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools

The result of my comparison with tabula-py is as follows. ** [Convert PDF of Ministry of Health, Labor and Welfare to CSV or JSON](https://needtec.sakura.ne.jp/wod07672/2020/04/29/%e5%8e%9a%e7%94%9f%e5 % 8a% b4% e5% 83% 8d% e7% 9c% 81% e3% 81% aepdf% e3% 82% 92csv% e3% 82% 84json% e3% 81% ab% e5% a4% 89% e6% 8f % 9b% e3% 81% 99% e3% 82% 8b /) **

A simple camelot sample

Let's extract the table of PDF used earlier using camelot.

import camelot
tables = camelot.read_pdf('hello2.pdf')

for ix in tables[0].df.index:
    print(ix, tables[0].df.loc[ix][0], '|', tables[0].df.loc[ix][1])

result

0 Hello world|The cat rang
1  |I am God
2 God is dead|
3  |Aw Neo sf

In addition, you can directly specify the URL to open the PDF or export it to CSV or JSON. Some PDFs will now be able to extract the table.

If table extraction does not work as expected

In most cases, the default settings will work, but when you actually parse the PDF, it may behave unexpectedly. In this case, we recommend that you read the following document once.

Advanced Usage https://camelot-py.readthedocs.io/en/master/user/advanced.html

[Adjust the parameters passed to read_pdf](https://camelot-py.readthedocs.io/en/master/api.html#main-interface] while checking what cameralot recognizes in Visual Debbug. ) May work.

Adjusting parameters for PDFMiner

Although it is mentioned in Tweak layout generation, camelot is internally PDFMiner. using. If the table cannot be extracted from PDF by the above method, it may be possible to solve it by adjusting the parameters passed to PDFMiner.

For example, if you analyze the following PDF in the same way as the previous code, the second line cannot be extracted well. http://needtec.sakura.ne.jp/doc/hello4.pdf

** Output result **

0 1 |Ah ah
1  |2 good
2 3 |Uuu

This is the result of recognizing that the distance between "2" and "good" is too close and they are the same character string. To adjust this:

tables = camelot.read_pdf(
    'hello4.pdf',
    layout_kwargs = {
        'char_margin': 0.25
    }
)

layout_kwargs is an object of parameters to be passed to pdfminer.layout of PDFMiner. Char_margin considers two text chunks closer than this value to be contiguous. The default is 0.5, which is shorter and less likely to be considered the same text.

The result of executing with this parameter is as follows.

0 1 |Ah ah
1  |2 good
2 3 |Uuu
--------------------------
0 1 |Ah ah
1 2 |Good
2 3 |Uuu

When including a dotted line

When processing a table containing a dotted line with camelot, the dotted line is not recognized.

Detect dotted line #370 https://github.com/atlanhq/camelot/issues/370

For example, the following PDF is one of them. ➀ Vertical dotted line https://github.com/atlanhq/camelot/files/3565115/Test.pdf

② Horizontal dotted line https://github.com/mima3/yakusyopdf/blob/master/20200502/%E5%85%B5%E5%BA%AB%E7%9C%8C.pdf

This solution can be dealt with by the method in the following article.

・ ** [Process the dotted line as a solid line with camelot](https://needtec.sakura.ne.jp/wod07672/2020/05/03/camelot%e3%81%a7%e7%82%b9%e7% b7% 9a% e3% 82% 92% e5% ae% 9f% e7% b7% 9a% e3% 81% a8% e3% 81% 97% e3% 81% a6% e5% 87% a6% e7% 90% 86% e3% 81% 99% e3% 82% 8b /) **

To put it simply, the image data being processed by camelot is forcibly processed and the dotted line is replaced with a solid line to continue the processing.

If you can't do anything

There are cases where nothing can be done to create a PDF. For example, [For data containing long characters that extend beyond the cell](https://qiita.com/mima_ita/items/c0f28323f330c5f59ed8#pdf%E3%81%8B%E3%82%89%E3%83%86%E3% 83% BC% E3% 83% 96% E3% 83% AB% E3% 82% 92% E6% 8A% BD% E5% 87% BA% E3% 81% 99% E3% 82% 8B% E9% 9A% 9B% E3% 81% AE% E5% 95% 8F% E9% A1% 8C% E7% 82% B9) and so on.

Also, if you forget to draw the ruled line in the first place, it will not work properly.

PDF update

The method of reading PDF has been explained up to the previous section. Next, let's briefly consider updating the PDF.

Create a new PDF

You can use reportlab to output the contents of drawing text and figures to PDF.


from io import BytesIO
from reportlab.pdfgen import canvas

with open('hello.pdf', 'wb') as output_stream:
    buffer = BytesIO()
    c = canvas.Canvas(buffer, pagesize=(300, 300))
    c.rect(10, 10, 200, 200, fill=0)
    c.drawString(100, 50, 'Hello')
    c.showPage()
    c.save()
    buffer.seek(0)
    output_stream.write(buffer.getvalue())

This output will be the PDF used earlier. http://needtec.sakura.ne.jp/doc/hello.pdf

Rewrite existing PDF page

I investigated various ways to read an existing PDF and rewrite the graphic information and text on the page, but honestly it seemed difficult. The method introduced here is to add new shapes and text to the PDF page.

・ ** [Replace the dotted line of PDF with a solid line (PyPDF2 + reportlab)](https://needtec.sakura.ne.jp/wod07672/2020/05/04/pdf%e3%81%ae%e7%82% b9% e7% b7% 9a% e3% 82% 92% e5% ae% 9f% e7% b7% 9a% e3% 81% ab% e3% 81% 8a% e3% 81% 8d% e3% 81% 8b% e3% 81% 88% e3% 82% 8b /) ** ・ ** [Replace the dotted line of PDF with a solid line (PyMuPDF)](https://needtec.sakura.ne.jp/wod07672/2020/05/04/pdf%e3%81%ae%e7%82%b9% e7% b7% 9a% e3% 82% 92% e5% ae% 9f% e7% b7% 9a% e3% 81% ab% e3% 81% 8a% e3% 81% 8d% e3% 81% 8b% e3% 81% 88% e3% 82% 8bpymupdf /) **

In addition, PyPDF2 does not have a compression function, so I think it is better to use another method to update the file. In my environment, 3MB PDF has become 440MB.

Summary

Some of you may find it easy to use the PDF data in the explanation so far. If you don't have permission to modify the input PDF, basically think of it as ** Thorn Road ** </ font>. For example, in Excel, you can distinguish cells even if you forget to draw a ruled line, but in PDF, you cannot.

  • You can draw a line for all cells by the method used to replace the dotted line in PDF with a solid line, but I do not know whether the ruled line of that cell is forgotten or not drawn. .. ..

I would like to conclude this article by reiterating the most important things I mentioned at the beginning.

** "Don't read PDF data on a computer, it's something that humans read" ** </ font>

Recommended Posts