Introduction

I had to parse the PDF file. I wanted to do it in Python for the time being. A library called PDFMiner seems to be useful, so I tried using it.

PDFMiner

http://www.unixuser.org/~euske/python/pdfminer/index.html https://github.com/euske/pdfminer/

Creating a Python2 environment

I only had a Python3 environment at hand, so I increased the Python2 environment.

$ pyenv install 2.7.13
...abridgement...
$ pyenv local 2.7.13
$ python --version
Python 2.7.13

It's done.

PDF analysis with PDFMiner

Introduction

$ git clone https://github.com/euske/pdfminer.git
Cloning into 'pdfminer'...
remote: Counting objects: 3164, done.
remote: Total 3164 (delta 0), reused 0 (delta 0), pack-reused 3164
Receiving objects: 100% (3164/3164), 6.01 MiB | 406.00 KiB/s, done.
Resolving deltas: 100% (2245/2245), done.
$ cd ./pdfminer
$ make cmap
...abridgement...
$ python ./setup.py install
...abridgement...

Trial

I will use it as a trial.

$ cat ./samples/simple1.pdf | head
%PDF-1.4
1 0 obj
<<
 /Type /Catalog
 /Outlines 2 0 R
 /Pages 3 0 R
>>
endobj
2 0 obj
<<
$ ./tools/pdf2txt.py ./samples/simple1.pdf
Hello

World

Hello

World

H e l l o

W o r l d

H e l l o

W o r l d

Apparently this tool called pdf2txt.py works fine.

Recommended Posts

Try using PDFMiner

Try using Tkinter

Try using docker-py

Try using cookiecutter

Try using geopandas

Try using Selenium

Try using scipy

Try using pandas.DataFrame

Try using django-swiftbrowser

Try using matplotlib