Read table data in PDF file with Python

PDF data

People in the world seem to love PDF, and even if they say they hate it, they have to deal with it. However, it is normal for people to think that it is a little time to spend hours on it. There are cases where there is only PDF table data, but there was a super convenient library called tabula-py that was useful in such cases. Make a note.

https://github.com/chezou/tabula-py

About tabula

tabula is a Java library for extracting PDF tables. tabula-py is the trumpet. Therefore, you need to install Java to use it.

After installing Java, you can use the Python library by doing the following.

$ pip install tabula-py

How to Use

It's easy to use, and you can use the read_pdf function to read the table in the PDF file. The number of people positive for the new coronavirus of the Ministry of Health, Labor and Welfare (excluding those returning from charter flights) and the number of people conducting PCR tests (https://www.mhlw.go.jp/content/10906000/000618483.pdf) are used as examples. ..


from tabula import read_pdf

df = read_pdf("https://www.mhlw.go.jp/content/10906000/000618483.pdf")

The result of reading the table is displayed as below.

read_pdf.png

It looks like the above because there are multiple tables. Specify the table to retrieve next.

table1.png

As you can see above, the table is in the form of a pandas data table. It's super convenient. In this PDF file, the data is divided into two columns, so you need to rub the table. In this case as well, since it is a data table, you can use the pandas concat function.

table2.png

Since it is a data frame, it is easy to visualize.

table3.png

With that feeling, you can easily get PDF table data by using tabula-py!

Recommended Posts

Read table data in PDF file with Python
[Automation] Extract the table in PDF with Python
Read json data with python
Read Protocol Buffers data in Python3
Read files in parallel with Python
[python] Read data
Get additional data in LDAP with python
Exclusive control with lock file in Python
Read CSV file with python (Download & parse CSV file)
Try working with binary data in Python
Let's read the RINEX file with Python ①
Read the file line by line in Python
Read the file line by line in Python
Read a character data file with numpy
[Python] Read the specified line in the file
Read text in images with python OCR
[Automation] Read mail (msg file) with Python
Read a file in Python with a relative path from the program
[Python] Read a csv file with a large data size using a generator
Data analysis with python 2
File operations in Python
How to read a CSV file with Python 2/3
Read DXF in python
File processing in Python
Read data with python / netCDF> nc.variables [] / Check data size
Read a file containing garbled lines in Python
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Rasterize PDF in Python
[Python] How to read excel file with pandas
Read a Python # .txt file for a super beginner in Python with a working .py
File operations in Python
Read line by line from a file with Python
Read Python csv file
Python / numpy> Read the data file with the item name line> Use genfromtxt ()
Data analysis with Python
Collectively register data in Firestore using csv file in Python
Convert the image in .zip to PDF with Python
Read QR code from image file with Python (Mac)
Read json file with Python, format it, and output json
Run a Python file with relative import in PyCharm
Sample data created with python
Handle Ambient data in Python
Read csv with python pandas
Scraping with selenium in Python
Working with LibreOffice in Python
Download the file in Python
Scraping with chromedriver in python
Display UTM-30LX data in Python
Debugging with pdb in Python
Draw netCDF file with python
Get Youtube data with python
OCR from PDF in Python
Read Euler's formula in Python
Working with sounds in Python
Scraping with Selenium in Python
Scraping with Tor in Python
Read Namespace-specified XML in Python
Tweet with image in Python
Read Outlook emails in Python
Combined with permutations in Python
Integrate PDF files with Python