[PYTHON] Conversion from pdf to txt 1 [pdfminer]

Introduction

I had to translate 20 uncopyable pdf files written in English as below, so I want to extract the text and apply it to google translate etc ...

1.PNG

Purpose

Extract text from pdf file.

What was used

This time, pdfminer was used. https://github.com/pdfminer/pdfminer.six

I also referred to the following articles. https://qiita.com/mczkzk/items/894110558fb890c930b5

Process flow

1.Please input pdf path:, then enter the pdf file name 2. Change the extension of the input file name to .txt and create a text file 3. Output the result to it

It is a simple operation such as.

result

The result of specifying the pdf file earlier is as follows.

2.PNG

Only one arrow was output. Strange, To check with other pdf files, specify the following pdf created with word.            3.PNG

The result is as follows.                4.PNG

Also, the arrow above is output, but both English and Japanese are output well. The program doesn't seem to matter. I thought it was a problem due to the protection of pdf, so I tried to remove the protection with "Print to pdf", but Also, only one arrow was output.

Consideration

Since it was confirmed that pdfminer itself works well, it is considered that the problem is in the pdf file. I think the cause is that the image quality is poor, probably because the target pdf file was scanned.

program

Since pdfminer is too convenient, it became a very short program.

pdf2text.py


from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

input_path = input("Please input pdf path : ")
output_path,ext = input_path.split(".")
output_path += ".txt"

manager = PDFResourceManager()

with open(output_path, "wb") as output:
    with open(input_path, 'rb') as input:
        with TextConverter(manager, output, codec='utf-8', laparams=LAParams()) as conv:
            interpreter = PDFPageInterpreter(manager, conv)
            for page in PDFPage.get_pages(input):
                interpreter.process_page(page)

Recommended Posts

Conversion from pdf to txt 1 [pdfminer]
Convert from pdf to txt 2 [pyocr]
Select PDFMiner to extract text information from PDF
Convert from PDF to CSV with pdfplumber
Easy conversion from UTC to local time
Extract Japanese text from PDF with PDFMiner
Sum from 1 to 10
Images created with matplotlib shift from dvi to pdf
Changes from Python 3.0 to Python 3.5
Changes from Python 2 to Python 3.0
Transition from WSL1 to WSL2
[Python] Conversion from WGS84 to plane orthogonal coordinate system
Consider a conversion from a Python recursive function to a non-recursive function
How to (force) pdf conversion of IPython Notebook slides
From editing to execution
Automatic conversion from MySQL Workbench mwb file to sql file
Post from Python to Slack
Cheating from PHP to Python
Porting from argparse to hydra
Migrating from Chainer v1 to Chainer v2
OCR from PDF in Python
Add page number to PDF
Anaconda updated from 4.2.0 to 4.3.0 (python3.5 updated to python3.6)
Migrated from Flask-RESTPlus to Flask-RESTX
Update python-social-auth from 0.1.x to 0.2.x
Migrate from requirements.txt to pipenv
Switch from python2.7 to python3.6 (centos7)
Connect to sqlite from python
Convert a large number of PDF files to text files using pdfminer