[PYTHON] Convert PDF of the situation of people infected in Tokyo with the new coronavirus infection of the Tokyo Metropolitan Health and Welfare Bureau to CSV

Convert PDF of Status of people infected in Tokyo with new coronavirus infection of Tokyo Metropolitan Health and Welfare Bureau to CSV

import pathlib
from urllib.parse import urljoin

from bs4 import BeautifulSoup
import pandas as pd
import pdfplumber
import requests
from tqdm.notebook import tqdm

def fetch_file(url, dir="."):

    r = requests.get(url)
    r.raise_for_status()

    p = pathlib.Path(dir, pathlib.PurePath(url).name)
    p.parent.mkdir(parents=True, exist_ok=True)

    with p.open(mode="wb") as fw:
        fw.write(r.content)
    return p

url = "https://www.fukushihoken.metro.tokyo.lg.jp/iryo/kansen/todokedehcyouseisya.html"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
}

r = requests.get(url, headers=headers)
r.raise_for_status()

soup = BeautifulSoup(r.content, "html.parser")

tag = soup.select_one("div#main p.filelink > a.pdf")

link = urljoin(url, tag.get("href"))

path_pdf = fetch_file(link)

dfs = []

#Convert PDF
with pdfplumber.open(path_pdf) as pdf:

    for page in tqdm(pdf.pages):

        table = page.extract_table()

        df_tmp = pd.DataFrame(table[1:], columns=table[0])

        dfs.append(df_tmp)

#Combine all pages
df = pd.concat(dfs)

df.shape

#Whitespace before and after, normalization
for col in df.select_dtypes(include=object).columns:
    df[col] = df[col].str.strip().str.normalize("NFKC")

#Change extension to CSV
path_csv = path_pdf.with_suffix(".csv")

df.to_csv(path_csv, encoding="utf_8_sig", index=False)

df1 = df.copy()

#Data wrangling

import datetime

dt_now = datetime.datetime.now()

#Complement the date with the current year and convert it to the date, and if the date is in the future from the present, set it one year ago
def str2date(s: pd.Series) -> pd.Series:

    df = s.str.extract("(\d{1,2})Moon(\d{1,2})Day").rename(columns={0: "month", 1: "day"}).fillna(0).astype(int)

    df["year"] = dt_now.year

    tmp = pd.to_datetime(df, errors="coerce")

    df["year"] = df["year"].mask(tmp > dt_now, df["year"] - 1)

    return pd.to_datetime(df, errors="coerce")

df1["Release date YMD"] = str2date(df1["Release date"])
df1["Date of onset YMD"] = str2date(df1["Date of onset"])
df1["Confirmed date YMD"] = str2date(df1["Fixed date"])

p = path_csv.with_name(path_csv.name.replace(".csv", "_c.csv"))

df1.to_csv(p, index=False, encoding="utf_8_sig")

#download

from google.colab import files

files.download(str(p))

Corona relationship

Recommended Posts

Convert PDF of the situation of people infected in Tokyo with the new coronavirus infection of the Tokyo Metropolitan Health and Welfare Bureau to CSV
Convert PDF of Sagamihara City presentation materials (occurrence status, etc.) regarding new coronavirus infection to CSV
If the people of Tokyo become seriously ill with the new coronavirus, they may be taken to a hospital in Kagoshima prefecture.
Convert PDF of new corona outbreak case in Aichi prefecture to CSV
Considering the situation in Japan by statistician Nate Silver, "The number of people infected with coronavirus is meaningless"
Create a bot that posts the number of people positive for the new coronavirus in Tokyo to Slack
Convert PDF of product list containing effective surfactants for new coronavirus to CSV
I tried to predict the number of people infected with coronavirus in Japan by the method of the latest paper in China
I tried to predict the number of people infected with coronavirus in consideration of the effect of refraining from going out
Convert the image in .zip to PDF with Python
The theory that the key to controlling infection with the new coronavirus is hyperdispersion of susceptibility
I tried to summarize the new coronavirus infected people in Ichikawa City, Chiba Prefecture
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
[Python] Automatically read prefectural information on the new coronavirus from the PDF of the Ministry of Health, Labor and Welfare and write it in a spreadsheet or Excel.
Scraping PDF of the status of test positives in each prefecture of the Ministry of Health, Labor and Welfare
Let's visualize the number of people infected with coronavirus with matplotlib
I tried to predict the number of domestically infected people of the new corona with a mathematical model
Convert PDF of the progress of the division of labor (trends in insurance dispensing) of the Japan Pharmaceutical Association to CSV
[Python] Create a script that uses FeedParser and LINE Notify to notify LINE of the latest information on the new coronavirus of the Ministry of Health, Labor and Welfare.
Convert PDF of available stores of Go To EAT in Kagoshima prefecture to CSV
Convert the spreadsheet to CSV and upload it to Cloud Storage with Cloud Functions
Data cleansing of open data of the occurrence situation of the Ministry of Health, Labor and Welfare
Convert PDF of Go To EAT member stores in Ishikawa prefecture to CSV
I tried to predict the behavior of the new coronavirus with the SEIR model.
Convert from PDF to CSV with pdfplumber
Convert PDF of list of Go To EAT member stores in Niigata prefecture to CSV
Create a BOT that displays the number of infected people in the new corona
I tried to automatically send the literature of the new coronavirus to LINE with Python
Factfulness of the new coronavirus seen in Splunk
Data wrangling (pdfplumber) PDF about influenza outbreak situation of Ministry of Health, Labor and Welfare
Extract images and tables from pdf with python to reduce the burden of reporting
Data Langling PDF on the outbreak of influenza by the Ministry of Health, Labor and Welfare
Posted the number of new corona positives in Tokyo to Slack (deployed on Heroku)
I drew a Python graph using public data on the number of patients positive for the new coronavirus (COVID-19) in Tokyo + with a link to the national version of practice data
Predict the number of people infected with COVID-19 with Prophet
A server that returns the number of people in front of the camera with bottle.py and OpenCV
Use hash to lighten collision detection of about 1000 balls in Python (related to the new coronavirus)
Scraping PDF of the national list of minimum wages by region of the Ministry of Health, Labor and Welfare
Scraping the member stores of Go To EAT in Osaka Prefecture and converting them to CSV
Let's take a look at the infection tendency of the new coronavirus COVID-19 in each country and the medical response status (additional information).
[Python] The status of each prefecture of the new coronavirus is only published in PDF, but I tried to scrape it without downloading it.