Convert pdf to Text on the command line. No knowledge of Python required. About pdf2txt.py attached to pdfminer and adjustment parameters.

This article is the 18th day article of Saison Information Systems Advent Calendar 2020.

I will explain the command first.

--For those who have researched various things and got here, I will describe the parameters of pdf2txt.py first. ――It was detailed here. https://www.unixuser.org/~euske/python/pdfminer/

python.exe pdf2txt.py [options] -o [OutFilename] [InFilename] 

Example:

c:\python38\python.exe c:\python38\Scripts\pdf2txt.py -M 3.0 -o c:\work\OutFile.txt c:\work\InFile.pdf

Commentary

Parameters that mainly require read adjustment.

(Somewhere, I saw a statement that the conversion result is often not good if it is left as standard.)

--Set the adjustment to [options]. Character spacing (M) Word spacing (W) Line spacing (L) Vertical reading (V) --If no parameter is specified, the default value will be adopted. M = 1.0, W = 0.2, L = 0.3. It is a horizontal reading. --If you want to read vertically, specify -V.

I will post the Google Japanese translation of the linked URL.

These are the parameters used for layout analysis. In a real PDF file, depending on the authoring software, the text part may be split into several chunks during execution. Therefore, text extraction requires splicing text chunks. In the following figure, two text chunks that are closer than char_margin (shown as M) are considered consecutive and are grouped together. Also, two lines that are closer than line_margin (L) are grouped as a text box. This is a rectangular area that contains a "cluster" of text parts. In addition, if the distance between two words is greater than word_margin (W), whitespace between words may not be represented as spaces, so whitespace characters (spaces) must be inserted as needed, but each Indicated by the position of. word. Each value is specified as a ratio of length to the size of each character in question, not as the actual length. The default values ​​are M = 1.0, L = 0.3, and W = 0.2, respectively. image.png

There are other parameters such as image export, but they are omitted in this article. For details, please refer to the following. https://www.unixuser.org/~euske/python/pdfminer/


Then, it is an outline again.

--One of the Python libraries, pdfminer, has a module called pdf2txt that works if you call it. --You can convert from pdf to text using pdf2txt, but it may not work as expected. ――Even in such a case, pdfminer has parameters for adjustment, so it can be read depending on the adjustment. --There were other descriptions in pdfminer, but when I checked the case of implementing it in convenient pdf2txt.py, I did not find any material in Japanese, so I will describe it. --Since commands can be executed, it is convenient to execute from ETL tools such as DataSpider and job controllers.


About the procedure when you want to use it as a Python execution module

No Python development environment is required. An execution environment is required. (I thought I'd do it with a link, but it's not surprising T_T)

--Installation of Python runtime environment (windows) https://www.python.org/downloads/ From image.png Select windows. image.png Select final image.png Select the package with the installer, download and install it. The installation destination is different for each person, but personally I like pythonxxx (version. 391 for 3.9.1) directly under the drive.

--Install pdfminner.six with pip At the command prompt

pip install pdfminer.six

If there is a pdf2txt.py file under Scripts in the python folder, it should work. is. image.png

By the way, as I was writing an article, I noticed that it is a very convenient pdfminer, but the author seems to be Japanese. Yusuke Shinyama. Thank you very much.

that's all

There is a problem with the article. We would appreciate it if you could contact us if you have any inconvenience. The content of the article is not official of the company, but is described as an individual.

Recommended Posts

Convert pdf to Text on the command line. No knowledge of Python required. About pdf2txt.py attached to pdfminer and adjustment parameters.
Convert XLSX to CSV on the command line
Convert a large number of PDF files to text files using pdfminer
Convert the result of python optparse to dict and utilize it
Use AWS lambda to scrape the news and notify LINE of updates on a regular basis [python]
Convert PDF attached to email to text format
Get the size of the image file on the web (Python3, no additional library required)
About the * (asterisk) argument of python (and itertools.starmap)
Think about the selective interface on the command line
Extract images and tables from pdf with python to reduce the burden of reporting
How to pass arguments when invoking python script from blender on the command line
Steps to use the AWS command line interface (Python / awscli) on Mac OS X
Think about how to program Python on the iPad
[Python] Convert PDF text to CSV page by page (2/24 postscript)
Convert the image in .zip to PDF with Python