[Automation] Extract the table in PDF with Python

PDF is awkward

Working with PDF files in Python is hard.

In some cases, the table is embedded in the PDF.

For example

A lot of table data is embedded in 2018 Zeneiren Stress Check Service Implementation Result Report. I will.

For example, suppose you want to extract [Table 14 Percentage of people with high stress by industry] on page 40 of the file.

Select this table and copy and paste it into Excel. pdf_table_copy.png

Copy and paste into Excel.

pdf_copy2excel.png

Oh? It doesn't work. All the data is listed in one cell.

In fact, you can use Python to convert the tables in this PDF to csv or Excel relatively easily.

Convert PDF table to csv in Python

The procedure to convert a table in PDF to csv or excel with Python is 2 steps.

Step 1. Extract the table from PDF as a pandas DataFrame Step 2. Write DataFrame as csv or excel

Let's look at them in order.

Step 1. Extract the table from PDF as a pandas DataFrame

To extract a pdf table as a DataFrame, use a module called ** tabula **.

I think that there are many people who do not have tabula installed. pip install tabula-py Install with.

In addition, this tabula runs on Java, so you also need to install Java.

Now that tabula is ready, import pandas and tabula.

python



import pandas as pd
import tabula

To extract a table from PDF tabula.read_pdf("xxx.pdf", lattice=True, pages='xxxx') Use the function.

--In " xxx.pdf ", write the path of the PDF file you want to read. --lattice = True is an option to determine cells by table borders. If the table you want to extract is separated by a ruled line, specify lattice = True. --pages specifies the pages you want to load. --If you want to read only the 40th page, specify as pages = '40'. --If you want to load pages 40-45, specify as pages = '40 -45'. --If you want to load all pages, you can use pages ='all'.

--The function returns a `list of pandas.DataFrame``. If you have multiple tables, you can retrieve this list in order with a for statement.

Now, let's read the table on page 40 of the "2018 Zeneiren Stress Check Service Implementation Result Report".

python



# lattice=True to determine cells by table axis
dfs = tabula.read_pdf("2018 Zeneiren stress check service implementation result report.pdf", lattice=True, pages = '40')

for df in dfs:
    display(df)

tabula_df.png

It's strange that \ r is included because the column name was broken in the cell, but the contents of the table can be extracted well.

For the time being, let's correct the column names properly. You can change the column name with df.rename (columns = {'original column name':'changed column name'}). When changing multiple column names, it is OK if you connect the 'original column name':' the changed column name' by separating them with commas.

python


df = df.rename(columns={'High stress\Number of people': 'High stressス者数', 'High stressス\Percentage of r people': 'High stressス者の割合'})

Step 2. Write DataFrame as csv or excel

Originally, pandas has a function to write DataFrame as csv or Excel.

Save as csv: df.to_csv ("filename.csv", index = None) Save as Excel: df.to_excel ("filename.xlsx", index = None)

Summary

Finally, I will summarize this code.

pyhon


import pandas as pd
import tabula
 
# lattice=True to determine cells by table axis
dfs = tabula.read_pdf("2018 Zeneiren stress check service implementation result report.pdf", lattice=True, pages = '40')

#Check if you can get the PDF table properly
for df in dfs:
    display(df)

# csv/Save as Excel(This time dfs[0]only)
df = dfs[0].rename(columns={'High stress\Number of people': 'High stressス者数', 'High stressス\Percentage of r people': 'High stressス者の割合'})
df.to_csv("PDF table.csv", index=None) # csv
df.to_excel("PDF table.xlsx", index=None) # Excel

reference

Here are other automation series I wrote. If you are interested, please!

[Automation] Read a Word document with Python https://qiita.com/konitech913/items/c30236bdf47775535e2f

[Automation] Convert Python code into an exe file https://qiita.com/konitech913/items/6259f13e057bc25ebc23

[Automation] Send Outlook email with Python https://qiita.com/konitech913/items/51867dbe24a2a4272bb6

[Automation] Read Outlook emails with Python https://qiita.com/konitech913/items/8a285522b0c118d5f905

[Automation] Read mail (msg file) with Python https://qiita.com/konitech913/items/fa0cf66aad27d16258c0

[Automation] Operate the clipboard with Python and paste the table into Excel https://qiita.com/konitech913/items/83975332e395a387eace

Recommended Posts

[Automation] Extract the table in PDF with Python
Read table data in PDF file with Python
Extract the table of image files with OneDrive & Python
Convert the image in .zip to PDF with Python
Extract the xz file with python
[Automation] Extract Outlook appointments with Python
Display Python 3 in the browser with MAMP
Associate the table set in python models.py
[Python] Get the files in a folder with Python
Rasterize PDF in Python
UI Automation in Python
Load the network modeled with Rhinoceros in Python ②
Extract email attachments received in Thunderbird with Python
Load the network modeled with Rhinoceros in Python ①
Try translating with Python while maintaining the PDF layout
[Python] Get the numbers in the graph image with OCR
Crawl the URL contained in the twitter tweet with python
Get the result in dict format with Python psycopg2
Write letters in the card illustration with OpenCV python
[Python] How to rewrite the table style with python-pptx [python-pptx]
Extract the band information of raster data with python
Scraping with selenium in Python
Working with LibreOffice in Python
Download the file in Python
Find the difference in Python
Debugging with pdb in Python
OCR from PDF in Python
Working with sounds in Python
Scraping with Selenium in Python
Scraping with Tor in Python
Tweet with image in Python
Combined with permutations in Python
Integrate PDF files with Python
UI Automation Part 2 in Python
Call the API with python3.
Extract images and tables from pdf with python to reduce the burden of reporting
Try scraping the data of COVID-19 in Tokyo with Python
How to extract any appointment in Google Calendar with Python
Probably the easiest way to create a pdf with Python3
Manipulate the clipboard in Python and paste the table into Excel
[Homology] Count the number of holes in data with Python
Automatically generate frequency distribution table in one shot with Python
Extract multiple list duplicates in Python
Number recognition in images with Python
Testing with random numbers in Python
Getting the arXiv API in Python
Extract the maximum value with pandas.
GOTO in Python with Sublime Text 3
Working with LibreOffice in Python: import
Convert markdown to PDF in Python
Python in the browser: Brython's recommendation
Save the binary file in Python
[Automation with python! ] Part 1: Setting file
Scraping with Selenium in Python (Basic)
Hit the Sesami API in Python
CSS parsing with cssutils in Python
[Automation] Send Outlook email with Python
Numer0n with items made in Python
Get the weather with Python requests
Get the weather with Python requests 2
Open UTF-8 with BOM in Python