[PYTHON] The story of verifying the open data of COVID-19

This article is the last day (25th day) of Civictech 1st year Advent Calendar 2020. (That said, it's a digestion of the articles I've written down, and it's not a very interesting article ...)

Hi, this is y-chan, a Hyogo prefecture version coronavirus summary site contributor. I feel busy with actively contributing to the original Tokyo site, developing various things under SecHack365, and running CCC2020. This time, I did something that I probably didn't do (according to our research), such as verifying the open data of COVID-19, so I think I should write a little about it. I think there are some mistakes because it contains stories that I have heard, but I would appreciate it if you kindly correct it with an edit request.

We are doing data wrangling

I would like to give you a preface. The act of acquiring, shaping, or processing open data that we are doing is called data wrangling. The word data wrangling itself seems to be a coined word, but it seems that wrangling has a meaning such as "taming". By the way, I haven't touched the data until now, so I didn't know the term itself until recently.

Validate open data? What are you talking about

As the heading says. Usually, when using something like open data, validation is a strange story. Open data is published by governments and local governments, and data input is data that humans would have done. Of course, there can be human error. It's rarely perfect. Such errors, noises, and missing values ​​are corrected by the users of open data themselves by normalizing data errors at the stage of "preprocessing" or by erasing (treating as none) the data. It is used for. In addition, this "pre-processing" seems to be included in the data wrangling. However, in data wrangling, the act of "verification" is not usually done and is not supposed to be done.

I think there are many people who say "What?" When they say that they will verify the open data of COVID-19 when data wrangling. When I said "I'm making a verification mechanism" within the team of the summary site, one person responded with "What does it mean to verify?" "It should be preprocessed rather than verified." Pretreatment is also required, and we are doing some, but I think that many of the new coronavirus infection control sites in each region are cooperating with local governments regarding open data. In the Hyogo prefecture version, there is no particular cooperation, but I have pointed out before that "Isn't the data wrong?" The reason I made the point is simply because the data I format is out of order, but there are some sections where I thought that sensitive data such as the attributes of positive patients should be in the correct form. At that time, the number of data was small and there were only a few human errors, but now the number of infected people is increasing and the number of human errors is also increasing considerably. At the same time, it becomes difficult to find mistakes. So I decided to leave the discovery of mistakes to the program. This is the reason why I tried to verify open data. Well, I think it's something different to impose data corrections on prefecture officials ...

How to verify open data

Now, I would like to briefly write down how the verification of open data was performed. Verification, but what you are doing is simple and clear

--If the data is a character string, does it fit the standard? --If the data is a number and is published in multiple formats (daily, cumulative, etc.), is the number consistent (simply, the sum of the daily values ​​is one with the cumulative value? Do you do it etc.)

I'm only looking. In fact, mistakes such as typographical errors can be corrected, but numerical errors cannot be corrected accurately on this side, and there is no choice but to use them as they are or truncate them, so they are not corrected after all on the summary site. The data is posted almost as it is ... Also, I don't know if there are any rules for character strings, and I set the standard based on the data that has come out so far, so sometimes exceptional things get caught in the verification. By the way, the script is the same as the one for data scraping.

About the verification result

The verification result is Anyone can view it. As a result of the verification, if there is a possibility that the data is incorrect, I made a script to display the section as a message. However, there is still the problem that the meaning of the message is difficult to understand because the feeling of rush work is undeniable ... image.png Also, this is just the verification result based on the criteria I set. Although the difference in numerical values ​​seems to be a clear mistake, it is subtle whether a character string that does not fit the fixed form is a human error. This is also the reason why we usually do not verify open data.

Summary

  1. I did a mysterious act that the user side verifies open data
  2. Anyone can see the verification result
  3. Too much rush work makes verification difficult to understand
  4. I haven't done it yet, but I want to be able to perform preprocessing based on the verification results.

It's a summary that doesn't make sense, but I think it's a bit ridiculous that I haven't done "4" yet. However, since it is sensitive data, I am wondering how to change the numerical value or rewrite the attribute information of positive patients. I think that is the difficulty of open data called COVID-19.

Recommended Posts

The story of verifying the open data of COVID-19
The story of sys.path.append ()
The story of reading HSPICE data in Python
The story of building Zabbix 4.4
[Apache] The story of prefork
The story of FileNotFound in Python open () mode ='w'
Let's use the open data of "Mamebus" in Python
Try scraping the data of COVID-19 in Tokyo with Python
A network diagram was created with the data of COVID-19.
The story of rubyist struggling with python :: Dict data with pycall
The story of copying data from S3 to Google's TeamDrive
The story of Python and the story of NaN
The story of the "hole" in the file
The story of remounting the application server
The story of writing a program
Explain the mechanism of PEP557 data class
The story of trying to reconnect the client
The story of an error in PyOCR
The story of adding MeCab to ubuntu 16.04
The story of making Python an exe
Get the column list & data list of CASTable
The story of making an immutable mold
The story of manipulating python global variables
The story of trying deep3d and losing
The story of deciphering Keras' LSTM model.predict
The story of blackjack A processing (python)
The story of pep8 changing to pycodestyle
Visualize the export data of Piyo log
Data cleansing of open data of the occurrence situation of the Ministry of Health, Labor and Welfare
The story of low learning costs for Python
Plot the environmental concentration of organofluorine compounds on a map using open data
The story of making the Mel Icon Generator version2
Let's check the population transition of Matsue City, Shimane Prefecture with open data
Image processing? The story of starting Python for
The story of making a lie news generator
The story of finding the optimal n in N fist
COVID-19 Hokkaido Data Edition (2) Toward open data + automatic update
The story of misreading the swap line of the top command
The story of trying Sourcetrail × macOS × VS Code
The story of viewing media files in Django
The story of making a mel icon generator
[Small story] Download the image of Ghibli immediately
The transition of baseball as seen from the data
The story of moving from Pipenv to Poetry
Check the status of your data using pandas_profiling
Download the wind data of the Japan Meteorological Agency
Scraping the winning data of Numbers using Docker
A story about improving the program for partial filling of 3D binarized image data
The story of trying to contribute to COVID-19 analysis with AWS free tier and failing
A story that reduces the effort of operation / maintenance
The story of Python without increment and decrement operators.
The story of stopping the production service with the hostname command
The story of replacing Nvidia GTX 1650 with Linux Mint 20.1.
The story of building the fastest Linux environment in the world
The story of Hash Sum mismatch caused by gcrypto20
Open Chrome version of LINE from the command line [Linux]
The story of sharing the pyenv environment with multiple users
The story of making a music generation neural network
About the inefficiency of data transfer in luigi on-memory
Predict the number of people infected with COVID-19 with Prophet
Story of image analysis of PDF file and data extraction