Introduction

Open data is It is a political activity that seeks to freely process government data, freely redistribute it, and make it freely available for commercial use. Currently, it is attracting attention from the perspective of political transparency and economic revitalization. The Japanese government is actually starting to release data.

-> Reference site: Open DATA METI | Ministry of Economy, Trade and Industry's open data catalog site

However, as a problem of open data in Japan, ☆ There are many cases where open data of 1 comes out. Open data is rated 5 stars due to its openness.

☆ 1 open data, that is, PDF It is considered the most closed because it is not structured data.

However, it is difficult to explain the importance of machine readability to civil servants who are not tech-savvy. Even if you understand it, it is delicate whether you can allocate a budget for machine readability. As a matter of fact, we need to confront PDF.

PDFMiner is an interesting tool for that.

PDFMiner is a Python library for mainly acquiring and analyzing text information from PDF. Looking at Google Trends, it seems that it has been attracting attention since around 2011.

There is already an app that converts PDF to TXT / HTML, PDFMiner is a mechanism to manage the components of PDF pages in a tree structure. （ex.LTPage->LTTextBox->LTTextLine->LTChar,LTText） You can make finer adjustments.

You can make finer adjustments, so for example I think we can prepare a program to convert the standard PDF of each ministry and agency to TXT.

Installation procedure

Follow the steps below to install PDFMiner.

Install Python (2.4 <= version <3.0).
Download and unzip Source.
Run setup.py on the console (terminal).

python setup.py install

Check the operation after installation.

pdf2txt.py samples/simple1.pdf

#After executing the command, it is OK if Hello World is displayed continuously.
#　->We have succeeded in extracting the text from the sample PDF.

Perform additional installation to handle CJK Unified Kanji.

make cmap
python setup.py install

The Win environment does not have make, so execute this instead.

mkdir pdfminer\cmap
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install

Command line tools

PDFMiner seems to come with two command line tools. One is * pdf2txt.py *. Converts the specified PDF to TXT / HTML.

pdf2txt.py  -o output.txt input.pdf

The other is * dumppdf.py *. This is a debugging tool that outputs the specified content of the specified PDF in pseudo XML format. It can also be used to extract only specific elements such as images.

dumppdf.py -a foo.pdf

For detailed specifications, see Original article.

API Read the sample code after reading the tree diagram of API Explanatory Material (English) to deepen your understanding. Also, there seems to be Page introduced as a more detailed example (English). .. In addition, it seems that there was already a Japanese blog article.

to be continued

Let's use it now. Enjoy it! Enjoy mining life!

[PYTHON] Thorough capture PDF open data. PDF text analysis starting with PDFMiner.

Introduction

Installation procedure

Command line tools

to be continued