[PYTHON] COCO'S Breakfast Buffet List PDF Converted to CSV

Introduction

COCO'S breakfast buffet list store list Get PDF file and convert to CSV For reference, I converted to PDF with camelot, data cleansed with Pandas, and converted to CSV.

Preparation

apt install python3-tk ghostscript
pip install camelot-py[cv]
pip install pandas

Data cleansing

import camelot
import pandas as pd

tables = camelot.read_pdf(
    "https://www.cocos-jpn.co.jp/menu_pdf/bvshoplist.pdf",
    pages="all",
    split_text=True,
    strip_text="\n",
    line_scale=40,
)

#Column name
columns = ["".join(i) for i in zip(*(tables[0].df.head(2).values))]

dfs = [table.df.iloc[3:].set_axis(columns, axis=1) for table in tables]

#Renumber
df = pd.concat(dfs).reset_index(drop=True)
df.index += 1

#Replace empty string with missing
df.mask(df == "", inplace=True)

#If the implementation date is daily, weekdays, Saturdays and Sundays
df["Implementation date"] = df["Implementation date"].where(df["Usage fee"].isnull(), df["Usage fee"])

#Complement daily store information
df.fillna(method="ffill", inplace=True)

#Delete the usage charge column
df.drop("Usage fee", axis=1, inplace=True)

#Tax-included amount
adult = (
    df["grown up"]
    .str.extractall("([0-9]+)")
    .unstack()
    .rename(columns={0: "grown up_Tax excluded", 1: "grown up_tax included"}, level=1)
)
adult.columns = adult.columns.droplevel(level=0)
df["grown up"] = adult["grown up_tax included"].astype(int)

#Tax-included amount
child = (
    df["Elementary school students and younger"]
    .str.extractall("([0-9]+)")
    .unstack()
    .rename(columns={0: "child_Tax excluded", 1: "child_tax included"}, level=1)
)
child.columns = child.columns.droplevel(level=0)
df["Elementary school students and younger"] = child["child_tax included"].astype(int)

#Address column name change
df.rename(columns={"After the address": "Street address"}, inplace=True)

#Unicode normalization of addresses, whitespace removal
df["Street address"] = df["Street address"].str.normalize("NFKC").str.replace(" ", "")

df.to_csv("cocos.csv", encoding="utf_8_sig")

reference

COCO'S Breakfast Buffet List PDF files are acquired and converted to CSV

Recommended Posts

COCO'S Breakfast Buffet List PDF Converted to CSV
Convert PDF of Go To Eat Hokkaido campaign dealer list to CSV
Convert from PDF to CSV with pdfplumber
I want to convert a table converted to PDF in Python back to CSV
Convert PDF of Kumamoto Prefecture Go To EAT member store list to CSV
Convert PDF of Chiba Prefecture Go To EAT member store list to CSV (command)
Convert PDF of product list containing effective surfactants for new coronavirus to CSV
Convert PDF of list of Go To EAT member stores in Niigata prefecture to CSV
[Python] Continued-Convert PDF text to CSV page by page