Dockerfile
FROM python:3.6
ENV LC_ALL C.UTF-8
ENV LANG C.UTF-8
RUN apt-get -y update && \
apt-get install -y --fix-missing \
build-essential \
software-properties-common \
poppler-utils && \
apt-get clean && \
rm -rf /tmp/* /var/tmp/* && \
mkdir /api
WORKDIR /api
COPY requirements.txt /api/requirements.txt
RUN pip3 install --upgrade pip && \
pip3 install --upgrade -r requirements.txt
EXPOSE 8888
ENTRYPOINT jupyter notebook --ip=0.0.0.0 --allow-root --no-browser
requirements.txt
pandas==0.24.2
pillow==7.0.0
opencv-python==3.4.2.16
pdfminer==20191125
jupyter==1.0.0
$ docker build -t pdfminer -f ./Dockerfile .
$ docker run -it -v `pwd`:/api -p 8888:8888 --name pdfminer pdfminer bash
If the container is created successfully, Jupiter will start automatically, so create a python file. The following settings are the code to extract the minimum character information and save it in a text file. This time, the PDF of the Financial Services Agency is test.pdf. https://www.fsa.go.jp/news/30/wp/supervisory_approaches_revised.pdf
test.py
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox, LTTextLine, LTChar
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage
def pdfminer_config(line_overlap, word_margin, char_margin,line_margin, detect_vertical):
laparams = LAParams(line_overlap=line_overlap,
word_margin=word_margin,
char_margin=char_margin,
line_margin=line_margin,
detect_vertical=detect_vertical)
resource_manager = PDFResourceManager()
device = PDFPageAggregator(resource_manager, laparams=laparams)
interpreter = PDFPageInterpreter(resource_manager, device)
return (interpreter, device)
def find_textboxes(layout_obj):
if isinstance(layout_obj, LTTextBox):
return [layout_obj]
if isinstance(layout_obj, LTContainer):
boxes = []
for child in layout_obj:
boxes.extend(find_textboxes(child))
return boxes
return []
def find_textlines(layout_obj):
if isinstance(layout_obj, LTTextLine):
return [layout_obj]
if isinstance(layout_obj, LTTextBox):
lines = []
for child in layout_obj:
lines.extend(find_textlines(child))
return lines
return []
def find_characters(layout_obj):
if isinstance(layout_obj, LTChar):
return [layout_obj]
if isinstance(layout_obj, LTTextLine):
characters = []
for child in layout_obj:
characters.extend(find_characters(child))
return characters
return []
def write_text(text_file, text):
text_file.write(text)
text_file = open('output.txt', 'w')
with open("./test.pdf", 'rb') as f:
interpreter, device = pdfminer_config(line_overlap=0.5, word_margin=0.1, char_margin=2, line_margin=0.5, detect_vertical=True)
for page in PDFPage.get_pages(f):
interpreter.process_page(page) #Process the page.
layout = device.get_result() #Get the LTPage object.
boxes = find_textboxes(layout)
for box in boxes:
write_text(text_file, box.get_text().strip())
text_file.close()
If you don't get the text you want, adjust the parameters in laparams. By changing char_margin, word_margin, line_margin, the grouped characters will change. set detect_vertivcal to True if there is a vertical sentence like Japanese.
test.py
interpreter, device = pdfminer_config(line_overlap=0.5, word_margin=0.1, char_margin=2.0, line_margin=0.5, detect_vertical=False)
The boxes available in the code above are packed with a lot of information.
--Text information --Character position information (Since the unit is pt, unit conversion from pt to pixel is required when processing with opencv etc.)
print(boxes[0])
# >> <LTTextBoxHorizontal(0) 92.160,755.000,524.296,766.952 'However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.\n'>
print(boxes[0].get_text())
# >>However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.
print(boxes[0].bbox)
# >> (92.15997480600001, 754.9998879965001, 524.2961793060001, 766.9523361965001)
# >>Inside the tuple(x0, y0, x1, y1)The positions shown are as shown in the image.
LTTextLines are listed in the box. So let's get the LTTextLine using find_textline, which we didn't use in the code above.
test.py
lines = find_textlines(boxes[0])
print(lines[0])
# >><LTTextLineHorizontal 92.160,755.000,524.296,766.952 'However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.\n'>
print(lines[0].get_text())
# >>However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.
print(lines[0].bbox)
# >> (92.15997480600001, 754.9998879965001, 524.2961793060001, 766.9523361965001)
In addition, LTChar is listed in the lines. In addition to character information and location information, fonts are also packed in it.
test.py
characters = find_characters(lines[0])
print(characters[0])
# >><LTChar 92.160,755.000,104.160,766.952 matrix=[12.00,0.00,0.00,12.00, (92.16,756.68)] font='AHTYXM+MS-PGothic' adv=1.0 text='Or'>
print(characters[0].get_text())
# >>Or
print(characters[0].bbox)
# >> (92.15997480600001, 754.9998879965001, 104.16042480600001, 766.9523361965001)
If I have time, I would like to introduce how to change the color of the acquired part.
Recommended Posts