[PYTHON] COVID-19 Hokkaido Data Edition (2) Toward open data + automatic update

Lottery

COVID-19 Hokkaido Data Edition ① Initial data creation by scraping etc. COVID-19 Hokkaido Data Edition (2) Toward open data + automatic update ← This article! COVID-19 Hokkaido Data Edition ③ Full Automation

スクリーンショット 2020-03-10 17.37.10.png

In ①, we summarized V0 in the above figure, but after that, this article will summarize the utilization of external APIs such as open data portals.

Challenges in V0

In V0, data acquisition relied on (1) scraping from the road website and (2) static CSV files provided by Sapporo City. This has the following problems:

--I want to use data that is properly published as open data, not scraped data. --Scraping is vulnerable to changes in the table structure of a website --Operational effort to push every time the CSV file is updated

Then what happens after V1 in the above figure? It will be very easy if the lower left flow is realized.

Currently, (probably) Excel-controlled tabular data is converted to PDF for easy viewing by users and published, or the data is posted directly in HTML, so-called machine-readable data. (I understand the intention of making materials that anyone can read immediately). I think it would have been difficult to collect data on this website if even positive patient data was provided in PDF (at least it was an HTML table, so I could read it by force). I hope many people will recognize that there is such a good thing if various information is distributed as raw data (CSV etc.) as open data.

If my feelings are realized, it will be in the lower left shape. I hope that happens in the near future.

If I could afford it, it would be realized in about two days.

Migration to open data

Hokkaido

The data of Hokkaido will be updated not only on the website but also on the Hokkaido Open Data Portal from time to time. The person in charge in Hokkaido uploads the latest CSV file to this portal at any time. As the name suggests, the data published here is open data, and anyone can access CSV data for free.

Sapporo

In addition, Sapporo City also operates an open data portal site Sapporo City ICT Utilization Platform DATA SMART CITY SAPPORO, and the number of consultations at the counter is open data. It was published as. Creating data is the work of the person in charge as in Hokkaido. Thank you.

It was a tremendous sense of speed (I think it was possible to make a smooth transition because of the soil of such a portal). In spite of the sense of speed that the image of the government is overturned, we have modified the previous script to correspond to the open data portal.

https://github.com/codeforsapporo/covid19hokkaido_scraping/tree/f3923df7f6a3781e94ef5b514c9d9ec7fe5aa4b1

When main.py is executed, the data input to REMOTE_SOURCES in settings.py is patrolled and json etc. are spit out. The CSV data released this time can access the file through the API, but since it was necessary to read the CSV data after all, it is not an access via the API but a direct link to the resource. Convert the read CSV to dict and dump it to json.

On this repository, the execution of main.py is scheduled every 15 minutes and spits out jsons to the gh-pages branch. So, if the data of each portal is updated, it will be reflected in json etc. within 15 minutes. These jsons can be accessed from the outside without CORS restrictions (like). In other words, it functions as a pseudo API server </ b>. If those jsons are read by asynchronous communication on the front side, data update will be fully automated.

  • Currently, there is no validation etc. at the time of json generation, so there is a high risk in full automation and it has not been implemented yet.

At the end

Until the last time, data was stored by scraping the website and static CSV files, but with the release of open data at a speed that is overturned by the common sense of Hokkaido and Sapporo, the ideal shape shown in the top figure is achieved. It was realized. Certainly, the change was triggered by the new coronavirus, but without a portal site, the movement would not be as smooth as this, and there must have been steady activity to create such a site. As long as I am grateful for the activities of my predecessors and wish to move forward.

Recommended Posts

COVID-19 Hokkaido Data Edition (2) Toward open data + automatic update
COVID-19 Hokkaido Data Edition ③ Full automation (validation / error detection)
The story of verifying the open data of COVID-19