The PDF format is a convenient format for passing data to people or distributing it together with other materials in reports, but it is troublesome in terms of data reusability because it has been fixed. There are many. It would be nice for me to submit a table of thousands of lines in A4 format in the report myself, but I wrote this because there was no original data when I wanted to use it at a later date and I had to extract it from PDF. It was.
Please write the code below. You also need to install a Java library called tabula separately. The Pyhon module is just that trumpet.
import tabula
import PyPDF2
import pandas as pd
FILE_PATH = "./test.pdf"
with open(FILE_PATH, mode='rb') as f:
pages = PyPDF2.PdfFileReader(f).getNumPages()
for i in range(pages+1):
tmp = tabula.read_pdf(FILE_PATH, pages = i, encoding = "utf-8_sig", spreadsheet=True)
df = pd.concat([df, tmp], ignore_index=True)
df = tabula.read_pdf(FILE_PATH, lattice=True, pages = '1' )
df[0].to_csv("./test.csv", encoding="shift_jis")
Just run the .py file above. Have a good PDF life.
Recommended Posts