To extract textual information from PDF

Environment

`Dockerfile`


FROM python:3.6
ENV LC_ALL C.UTF-8
ENV LANG C.UTF-8  
RUN apt-get -y update && \
    apt-get install -y --fix-missing \
    build-essential \
    software-properties-common \
    poppler-utils && \
    apt-get clean && \
    rm -rf /tmp/* /var/tmp/* && \
    mkdir /api
WORKDIR /api
COPY requirements.txt /api/requirements.txt
RUN pip3 install --upgrade pip && \
    pip3 install --upgrade -r requirements.txt
EXPOSE 8888
ENTRYPOINT jupyter notebook --ip=0.0.0.0 --allow-root --no-browser

`requirements.txt`


pandas==0.24.2
pillow==7.0.0
opencv-python==3.4.2.16
pdfminer==20191125
jupyter==1.0.0

$ docker build -t pdfminer -f ./Dockerfile .
$ docker run -it -v `pwd`:/api -p 8888:8888 --name pdfminer pdfminer bash

Extract text information from PDF

If the container is created successfully, Jupiter will start automatically, so create a python file. The following settings are the code to extract the minimum character information and save it in a text file. This time, the PDF of the Financial Services Agency is test.pdf. https://www.fsa.go.jp/news/30/wp/supervisory_approaches_revised.pdf

`test.py`


from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox, LTTextLine, LTChar
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage

def pdfminer_config(line_overlap, word_margin, char_margin,line_margin, detect_vertical):
    laparams = LAParams(line_overlap=line_overlap,
                        word_margin=word_margin,
                        char_margin=char_margin,
                        line_margin=line_margin,
                        detect_vertical=detect_vertical)
    resource_manager = PDFResourceManager()
    device = PDFPageAggregator(resource_manager, laparams=laparams)
    interpreter = PDFPageInterpreter(resource_manager, device)
    return (interpreter, device)

def find_textboxes(layout_obj):
    if isinstance(layout_obj, LTTextBox):
        return [layout_obj]
    if isinstance(layout_obj, LTContainer):
        boxes = []
        for child in layout_obj:
            boxes.extend(find_textboxes(child))
        return boxes
    return []

def find_textlines(layout_obj):
    if isinstance(layout_obj, LTTextLine):
        return [layout_obj]
    if isinstance(layout_obj, LTTextBox):
        lines = []
        for child in layout_obj:
            lines.extend(find_textlines(child))
        return lines
    return []

def find_characters(layout_obj):
    if isinstance(layout_obj, LTChar):
        return [layout_obj]
    if isinstance(layout_obj, LTTextLine):
        characters = []
        for child in layout_obj:
            characters.extend(find_characters(child))
        return characters
    return []

def write_text(text_file, text):
    text_file.write(text)

text_file = open('output.txt', 'w')
with open("./test.pdf", 'rb') as f:
    interpreter, device = pdfminer_config(line_overlap=0.5, word_margin=0.1, char_margin=2, line_margin=0.5, detect_vertical=True)
    for page in PDFPage.get_pages(f):
        interpreter.process_page(page)  #Process the page.
        layout = device.get_result()  #Get the LTPage object.
        boxes = find_textboxes(layout)
        for box in boxes:
            write_text(text_file, box.get_text().strip())
        
text_file.close()

Adjustment by laparams

If you don't get the text you want, adjust the parameters in laparams. By changing char_margin, word_margin, line_margin, the grouped characters will change. set detect_vertivcal to True if there is a vertical sentence like Japanese.

`test.py`


interpreter, device = pdfminer_config(line_overlap=0.5, word_margin=0.1, char_margin=2.0, line_margin=0.5, detect_vertical=False)

スクリーンショット 2020-01-18 11.53.36.png

Contents of boxes

The boxes available in the code above are packed with a lot of information.

--Text information --Character position information (Since the unit is pt, unit conversion from pt to pixel is required when processing with opencv etc.)

print(boxes[0])
# >> <LTTextBoxHorizontal(0) 92.160,755.000,524.296,766.952 'However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.\n'>
print(boxes[0].get_text())
# >>However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.
print(boxes[0].bbox)
# >> (92.15997480600001, 754.9998879965001, 524.2961793060001, 766.9523361965001)
# >>Inside the tuple(x0, y0, x1, y1)The positions shown are as shown in the image.

スクリーンショット 2020-01-18 11.46.08.png

Contents of lines

LTTextLines are listed in the box. So let's get the LTTextLine using find_textline, which we didn't use in the code above.

`test.py`


lines = find_textlines(boxes[0])
print(lines[0])
# >><LTTextLineHorizontal 92.160,755.000,524.296,766.952 'However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.\n'>
print(lines[0].get_text())
# >>However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.
print(lines[0].bbox)
# >> (92.15997480600001, 754.9998879965001, 524.2961793060001, 766.9523361965001)

Contents of characters

In addition, LTChar is listed in the lines. In addition to character information and location information, fonts are also packed in it.

`test.py`


characters = find_characters(lines[0])
print(characters[0])
# >><LTChar 92.160,755.000,104.160,766.952 matrix=[12.00,0.00,0.00,12.00, (92.16,756.68)] font='AHTYXM+MS-PGothic' adv=1.0 text='Or'>
print(characters[0].get_text())
# >>Or
print(characters[0].bbox)
# >> (92.15997480600001, 754.9998879965001, 104.16042480600001, 766.9523361965001)

If I have time, I would like to introduce how to change the color of the acquired part.

[PYTHON] Select PDFMiner to extract text information from PDF

To extract textual information from PDF

Environment

Dockerfile

requirements.txt

Extract text information from PDF

test.py

Adjustment by laparams

test.py

Contents of boxes

Contents of lines

test.py

Contents of characters

test.py

`Dockerfile`

`requirements.txt`

`test.py`

`test.py`

`test.py`

`test.py`