Introduction

I had to translate 20 uncopyable pdf files written in English as below, so I want to extract the text and apply it to google translate etc ...

Purpose

Extract text from pdf file.

What was used

This time, pdfminer was used. https://github.com/pdfminer/pdfminer.six

I also referred to the following articles. https://qiita.com/mczkzk/items/894110558fb890c930b5

Process flow

1.Please input pdf path:, then enter the pdf file name 2. Change the extension of the input file name to .txt and create a text file 3. Output the result to it

It is a simple operation such as.

result

The result of specifying the pdf file earlier is as follows.

Only one arrow was output. Strange, To check with other pdf files, specify the following pdf created with word. 　　　　　　　　　　　

The result is as follows. 　　　　　　　　　　　　　　　

Also, the arrow above is output, but both English and Japanese are output well. The program doesn't seem to matter. I thought it was a problem due to the protection of pdf, so I tried to remove the protection with "Print to pdf", but Also, only one arrow was output.

Consideration

Since it was confirmed that pdfminer itself works well, it is considered that the problem is in the pdf file. I think the cause is that the image quality is poor, probably because the target pdf file was scanned.

program

Since pdfminer is too convenient, it became a very short program.

`pdf2text.py`


from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

input_path = input("Please input pdf path : ")
output_path,ext = input_path.split(".")
output_path += ".txt"

manager = PDFResourceManager()

with open(output_path, "wb") as output:
    with open(input_path, 'rb') as input:
        with TextConverter(manager, output, codec='utf-8', laparams=LAParams()) as conv:
            interpreter = PDFPageInterpreter(manager, conv)
            for page in PDFPage.get_pages(input):
                interpreter.process_page(page)

[PYTHON] Conversion from pdf to txt 1 [pdfminer]