[PYTHON] Convert from PDF to CSV with pdfplumber

pdfplumber

Process the dotted line as a solid line with camelot (Hough transform) https://qiita.com/barobaro/items/af850ac29dbc983eb39b

Again, camelot is not good at extracting tables other than solid lines. It seems that it can be easily extracted with pdfplumber

Could not be converted

Go To EAT Business Official Site Shiga Prefecture Characters are not recognized, can be extracted with camelot

I was able to convert

List of medical institutions that provide medical care using telephones and information and communication equipment

List of medical institutions that provide medical care using telephones and information and communication equipment (Hyogo Prefecture)

wget https://www.mhlw.go.jp/content/000691131.pdf -O data.pdf
pip install pdfplumber
import pdfplumber
import pandas as pd

with pdfplumber.open("data.pdf") as pdf:

    dfs = []

    for page in pdf.pages:

        data = page.extract_table()
        df_tmp = pd.DataFrame(data[2:], columns=data[1])

        dfs.append(df_tmp)

df = pd.concat(dfs)

df.to_csv("hyogo.csv", encoding="utf_8_sig")

PDF of Go To EaT in Chiba Prefecture

https://www.chiba-gte.jp/downloads/store_list.pdf

wget https://www.chiba-gte.jp/downloads/store_list.pdf -O data.pdf
import pdfplumber
import pandas as pd

with pdfplumber.open("data.pdf") as pdf:

    dfs = []

    for page in pdf.pages:

        data = page.extract_table()
        df_tmp = pd.DataFrame(data)

        dfs.append(df_tmp)

df = pd.concat(dfs)

df1 = df.mask(df.isna() | (df == "")).dropna(thresh=4)

df2 = df1[df1[0] != "paper"].reset_index(drop=True)

df2.set_axis(["paper", "Electronic", "Store name", "Street address", "TEL"], axis=1, inplace=True)

df2.index += 1

df2.to_csv("data.csv")

Recommended Posts

Convert from PDF to CSV with pdfplumber
Convert from pdf to txt 2 [pyocr]
Convert PDF to image with ImageMagick
Convert PDF files to PNG files with GIMP
Convert 202003 to 2020-03 with pandas
Convert PDF to image (JPEG / PNG) with Python
[Python] Convert from DICOM to PNG or CSV
Images created with matplotlib shift from dvi to pdf
[Python] Convert PDF text to CSV page by page (2/24 postscript)
Convert the image in .zip to PDF with Python
How to convert JSON file to CSV file with Python Pandas
Write to csv with Python
Convert SDF to CSV quickly
Convert Select query obtained from Postgre with Go to JSON
Convert color space from RGB to CIELAB with PIL (Pillow)
Convert garbled scanned images to PDF with Pillow and PyPDF
Convert .ipynb to .html (with BatchFile)
Convert PDF to Documents by OCR
Convert markdown to PDF in Python
[Python] Write to csv file with Python
Create folders from '01' to '12' with python
Conversion from pdf to txt 1 [pdfminer]
Output to csv file with Python
Convert list to DataFrame with python
Convert sentences to vectors with gensim
How to convert from .mgz to .nii.gz
[Python] Convert CSV file uploaded to S3 to JSON file with AWS Lambda
Convert files written in python etc. to pdf with syntax highlighting
Convert PDF of Go To Eat Hokkaido campaign dealer list to CSV
Convert PIL format images read from form with Django to base64 format
[Python] How to convert db file to csv
Convert memo at once with Python 2to3
How to easily convert format from Markdown
How to convert csv to tsv in CLI
[Python] Convert csv file delimiters to tab delimiters
Convert character strings to features with RoBERTa
Convert Excel data to JSON with python
Convert Hiragana to Romaji with Python (Beta)
Convert from katakana to vowel kana [python]
Extract Japanese text from PDF with PDFMiner
Convert FX 1-minute data to 5-minute data with Python
Convert PDF attached to email to text format
Convert array (struct) to json with golang
[Part1] Scraping with Python → Organize to csv!
Python> Output numbers from 1 to 100, 501 to 600> For csv
Convert HEIC files to PNG files with Python
Convert Chinese numerals to Arabic numerals with Python
Convert from Markdown to HTML in Python
Sample to convert image to Wavelet with Python
[Data science basics] I tried saving from csv to mysql with python
Convert PDF of available stores of Go To EAT in Kagoshima prefecture to CSV
I want to convert a table converted to PDF in Python back to CSV
Convert the spreadsheet to CSV and upload it to Cloud Storage with Cloud Functions
Convert PDF of Kumamoto Prefecture Go To EAT member store list to CSV
Read CSV file with Python and convert it to DataFrame as it is
Convert PDF of Go To EAT member stores in Ishikawa prefecture to CSV
Convert PDF of new corona outbreak case in Aichi prefecture to CSV
Preprocessing with Python. Convert Nico Nico Douga tag search results to CSV format
How to create sample CSV data with hypothesis
How to read a CSV file with Python 2/3
Csv output from Google search with [Python]! 【Easy】