[PYTHON] Try using PDFMiner

Introduction

I had to parse the PDF file. I wanted to do it in Python for the time being. A library called PDFMiner seems to be useful, so I tried using it.

PDFMiner

http://www.unixuser.org/~euske/python/pdfminer/index.html https://github.com/euske/pdfminer/

Creating a Python2 environment

I only had a Python3 environment at hand, so I increased the Python2 environment.

$ pyenv install 2.7.13
...abridgement...
$ pyenv local 2.7.13
$ python --version
Python 2.7.13

It's done.

PDF analysis with PDFMiner

Introduction

$ git clone https://github.com/euske/pdfminer.git
Cloning into 'pdfminer'...
remote: Counting objects: 3164, done.
remote: Total 3164 (delta 0), reused 0 (delta 0), pack-reused 3164
Receiving objects: 100% (3164/3164), 6.01 MiB | 406.00 KiB/s, done.
Resolving deltas: 100% (2245/2245), done.
$ cd ./pdfminer
$ make cmap
...abridgement...
$ python ./setup.py install
...abridgement...

Trial

I will use it as a trial.

$ cat ./samples/simple1.pdf | head
%PDF-1.4
1 0 obj
<<
 /Type /Catalog
 /Outlines 2 0 R
 /Pages 3 0 R
>>
endobj
2 0 obj
<<
$ ./tools/pdf2txt.py ./samples/simple1.pdf
Hello

World

Hello

World

H e l l o

W o r l d

H e l l o

W o r l d


Apparently this tool called pdf2txt.py works fine.

Recommended Posts

Try using PDFMiner
Try using Tkinter
Try using docker-py
Try using cookiecutter
Try using geopandas
Try using Selenium
Try using scipy
Try using pandas.DataFrame
Try using django-swiftbrowser
Try using matplotlib
Try using tf.metrics
Try using PyODE
Try using virtualenv (virtualenvwrapper)
[Azure] Try using Azure Functions
Try using virtualenv now
Try using W & B
Try using Django templates.html
[Kaggle] Try using LGBM
Try using Python's feedparser.
Try using Python's Tkinter
Try using Tweepy [Python2.7]
Try using Pytorch's collate_fn
Try using PythonTex with Texpad.
[Python] Try using Tkinter's canvas
Try using Jupyter's Docker image
Try using scikit-learn (1) --K-means clustering
Try using matplotlib with PyCharm
Try using Azure Logic Apps
Try using Kubernetes Client -Python-
[Kaggle] Try using xg boost
Try using the Twitter API
Try using OpenCV on Windows
Try using Jupyter Notebook dynamically
Try using AWS SageMaker Studio
Try tweeting automatically using Selenium.
Try using SQLAlchemy + MySQL (Part 1)
Try using the Twitter API
Try using SQLAlchemy + MySQL (Part 2)
Try using Django's template feature
Try using the PeeringDB 2.0 API
Try using Pelican's draft feature
Try using pytest-Overview and Samples-
Try using folium with anaconda
Try using Janus gateway's Admin API
Try using Spyder included in Anaconda
Try using design patterns (exporter edition)
Try using Pillow on iPython (Part 1)
Try using Pillow on iPython (Part 2)
Try using Pleasant's API (python / FastAPI)
Try using LevelDB in Python (plyvel)
Try using pynag to configure Nagios
Try using PyCharm's remote debugging feature
Try using ArUco on Raspberry Pi
Try using cheap LiDAR (Camsense X1)
[Sakura rental server] Try using flask.
Try using Pillow on iPython (Part 3)
Reinforcement learning 8 Try using Chainer UI
Try to get statistics using e-Stat
Try using Python argparse's action API
Try using the Python Cmd module
Try using Python's networkx with AtCoder