[PYTHON] I tried using PDF data of online medical care based on the spread of the new coronavirus infection

Introduction

Below is a list of institutions that support online medical care based on the spread of the new coronavirus infection.

https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/iryou/rinsyo/index_00014.html

Let's use this to create a process to find out what the online medical care service in the neighborhood looks like. We will also consider whether this will allow us to use the PDF data provided by the government.

Deliverables

https://needtec.sakura.ne.jp/yakusyopdf/ https://github.com/mima3/yakusyopdf

If you enter the longitude and latitude and search ... image.png

A list of nearby hospitals will be displayed, click the line. image.png

Then detailed information will be displayed on the map. image.png

Description

Process flow

(1) Obtain a PDF from the homepage of the list of institutions that support online medical care based on the spread of the new coronavirus infection.

(2) Extract the table information from PDF and convert it to JSON. Please refer to the following page for the processing of this area. ・ [Convert PDF of Ministry of Health, Labor and Welfare to CSV or JSON](https://needtec.sakura.ne.jp/wod07672/2020/04/29/%e5%8e%9a%e7%94%9f%e5% 8a% b4% e5% 83% 8d% e7% 9c% 81% e3% 81% aepdf% e3% 82% 92csv% e3% 82% 84json% e3% 81% ab% e5% a4% 89% e6% 8f% 9b% e3% 81% 99% e3% 82% 8b /)

(3) Combine the JSON of each prefecture into one JSON.

(4) Calculate and record the longitude and latitude from the address in JSON using the Yahoo! Geocoder API.

(5) Store it in the database and display it on the screen based on that information.

Problems when extracting tables from PDF.

To tell the truth, extracting the data in the table from PDF is quite troublesome. There are libraries called tabula and camelot, but that's not enough. There is none. This section describes the problems that occurred using camelot.

Separation of data containing long characters does not work.

For example, suppose you have data that extends beyond the contents of the cell as shown below. image.png

In this case, the phone number and URL are detected as one unit. This time, as a method to deal with this problem, if there is no required item, we check the left and right cells to see if they are merged. However, if the target is an unformatted data column such as a zip code or phone number, it cannot be restored.

Memory usage issues

camelot consumes a lot of memory, so it's better to process it page by page. I think this area will be helpful. https://github.com/camelot-dev/camelot/issues/28

Also, due to the problem of the paper size of the data handled this time, it is better to operate with a 64-bit process because the memory is insufficient if it is a 32-bit process even for each page.

Dotted line cannot be detected.

Although it is mentioned in the following issue, camelot does not recognize well if you make a table with a dotted line.

Detect dotted line #370 https://github.com/atlanhq/camelot/issues/370

Here's how to solve this problem: [Treat dotted lines as solid lines with camelot](https://needtec.sakura.ne.jp/wod07672/2020/05/03/camelot%e3%81%a7%e7%82%b9%e7%b7%9a % e3% 82% 92% e5% ae% 9f% e7% b7% 9a% e3% 81% a8% e3% 81% 97% e3% 81% a6% e5% 87% a6% e7% 90% 86% e3 % 81% 99% e3% 82% 8b /)

Problems with PDF of the list of supported organizations

This is extremely troublesome, and I think that PDF analysis processing is impossible fully automatically.

size of paper

It takes time to process because some papers have a side size of 1m. Also, it is impossible to process after converting a common PDF to WORD because the size is too large.

The PDF of the list of compatible medical institutions is updated daily, and the URL also changes.

The PDF of the list of supported medical institutions is updated daily, but the URL seems to change with each update. For example, the URL of Tokyo is as follows.

As of April 28, 2020 https://www.mhlw.go.jp/content/000625693.pdf

As of April 29, 2020 https://www.mhlw.go.jp/content/000626106.pdf

For this reason, the PDF URL must be obtained from the link for online medical care based on the spread of the new coronavirus infection.

Presence or absence of legend

The first line of data may or may not contain a legend. There is a legend in the case of Tokyo, but not in Hokkaido.

In other words, it is necessary to adjust the acquisition position of the data row on the first page for each prefecture.

The handling of page headers differs for each prefecture.

For example, compare the PDFs of Tokyo and Ibaraki prefectures. The header is included in the second and subsequent pages of Tokyo, but not in Ibaraki.

In other words, it is necessary to adjust the acquisition position of the data row on the second and subsequent pages for each prefecture. Moreover, this is not always the case even in the same prefecture.

In fact, until April, Hokkaido had header lines on the second and subsequent pages. Probably, this is because the specification at the time of file output sometimes fluctuates.

The handling of items is different for each prefecture.

For example, compare Hokkaido with Aichi and Yamanashi prefectures.

Hokkaido image.png

** Aichi Prefecture ** image.png

Yamanashi Prefecture image.png

The items in the column can vary from prefecture to prefecture, and even if they are common items, you need to adjust their position. Furthermore, even in the same prefecture, the items are not always the same. In fact, until April, Yamanashi Prefecture did not divide the line between telephone consultation and online consultation.

Notation blur

There is an item "whether or not medical treatment is carried out using the telephone etc. for the first visit", but in many cases, 〇 or × (or blank) is written, but the notation is incorrect. For example, the following annotations may be made.

○
* Scheduled to be done in the future

Then, it is not a matter of taking only the first letter, and the expressions are diverse. At least at the moment, there are the following notational blurs.

** An expression that indicates the existence of "whether or not medical treatment is carried out using the telephone for the first visit" **

letter code
E38087
E2978B
E297AF
E296B3
Yes E58FAF
E2978F
E296B2

** An expression that indicates the absence of "whether or not medical treatment is provided using the telephone for the first visit" **

letter code
Blank
× C397
EFBD98
E29893
E29C95
X 58
- 2D
EFBC8D
EFBCB8
E29C96
no E590A6

Other problems

The actual PDF conversion result is as follows. There is a problem with the analysis program, but there are also errors in the PDF in the first place.

https://github.com/mima3/yakusyopdf/blob/master/20200503

Summary

This time, I converted the PDF published by the Ministry of Health, Labor and Welfare so that it can be easily processed by a computer, and created a Web application using it.

Data can be converted automatically up to a certain point, but as long as PDF is used, full automation is impossible. Also, even if you fix it by hand, the update frequency is high, so there are some strict points.

** If you need at least accurate data and update frequently, it is safe to avoid extracting data from PDF like this time. ** **

Also, if you are in a position to publish the data, we would appreciate it if you could consider the following points.

――Can you publish it in other than PDF? ――When using data, Excel is better. PDFs look the same but are much more difficult. --For data that may be updated, include the last update date ――Avoid changing the format for each prefecture. ――I think you're doing it in good faith and easy to see, but it's getting harder for the machine to process. --Unify the expressions as much as possible.

that's all.

Recommended Posts

I tried using PDF data of online medical care based on the spread of the new coronavirus infection
I tried to display the infection condition of coronavirus on the heat map of seaborn
I tried using the API of the salmon data project
[Python] I tried collecting data using the API of wikipedia
Plot the spread of the new coronavirus
I tried to predict the infection of new pneumonia using the SIR model: ☓ Wuhan edition ○ Hubei edition
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried using the image filter of OpenCV
I tried to automatically send the literature of the new coronavirus to LINE with Python
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
Text extraction from images of criteria for determining information on new coronavirus infections in Hyogo Prefecture
I tried using PDF data of online medical care based on the spread of the new coronavirus infection
Factfulness of the new coronavirus seen in Splunk
Let's take a look at the infection tendency of the new coronavirus COVID-19 in each country and the medical response status (additional information).
Preprocessing of prefecture data
Try scraping the data of COVID-19 in Tokyo with Python
Web scraping of comedy program information and notification on LINE
Analyzing data on the number of corona patients in Japan
I tried to tabulate the number of deaths per capita of COVID-19 (new coronavirus) by country
I analyzed tweets about the new coronavirus posted on Twitter
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried face recognition of the laughter problem using Keras.
Data analysis based on the election results of the Tokyo Governor's election (2020)
I drew a Python graph using public data on the number of patients positive for the new coronavirus (COVID-19) in Tokyo + with a link to the national version of practice data
[Python] The status of each prefecture of the new coronavirus is only published in PDF, but I tried to scrape it without downloading it.
I tried to get the index of the list using the enumerate function
[Python] I wrote the route of the typhoon on the map using folium
I tried cross-validation based on the grid search results with scikit-learn
I tried using the COTOHA API (there is code on GitHub)
I analyzed the tweets about the new coronavirus posted on Twitter Part 2
I looked at the meta information of BigQuery & tried using it
I tried to digitize the stamp stamped on paper using OpenCV
I tried to visualize BigQuery data using Jupyter Lab on GCP
I tried AdaNet on table data
I tried using GrabCut of OpenCV
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I tried using the checkio API
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried logistic regression analysis for the first time using Titanic data
Using Keras's stateful RNN, I tried automatic composition based on wav files.
I tried to get the batting results of Hachinai using image processing
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried using the trained model VGG16 of the deep learning library Keras
I tried to extract and illustrate the stage of the story using COTOHA
I tried to streamline the standard role of new employees with Python
I tried to perform a cluster analysis of customers using purchasing data
[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.
I tried the common story of using Deep Learning to predict the Nikkei 225
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried to analyze the New Year's card by myself using python
Folding @ Home on Linux Mint to contribute to the analysis of the new coronavirus
I tried the common story of predicting the Nikkei 225 using deep learning (backtest)
I tried the asynchronous server of Django 3.0
I tried using YOUTUBE Data API V3
Estimate the peak infectivity of the new coronavirus
I tried using the BigQuery Storage API
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[Python] I tried to judge the member image of the idol group using Keras
I tried object detection with YOLO v3 (TensorFlow 2.1) on the GPU of windows!
The epidemic forecast of the new coronavirus was released on the Web at explosive speed