[PYTHON] Analyze the usage status of the contact confirmation application (COCOA) posted in "Image" / Tesseract

Construction of a program to analyze the usage status of COCOA announced by the Ministry of Health, Labor and Welfare as an "image"

Introduction

――I personally summarize the transition of the number of downloads and the number of positive registrations of the contact confirmation application (COCOA) in a graph. I visited the official website around 18:00 every day, wrote the data to Google Sheet, and created a graph. ――However, this time and effort has become difficult, so I wondered if I could automate simple tasks. ――The information posted by the Ministry of Health, Labor and Welfare is an image of the text, and I thought that if this data could be automatically analyzed, it would be possible to automate the whole, so I made it as a trial.


Ministry of Health, Labor and Welfare COCOA special site (as of 8/11) 厚生労働省cocoa

This time's point

――Since the information to be posted is not posted as text data, it was necessary to acquire an image and perform character recognition (OCR). Therefore, "GCP Cloud Vision" and "** Tesseract **" were listed as candidates for OCR tools.

――This time, it was said that Tesseract can be easily used by using the PyOCR library in Python, so we will adopt this and verify the recognition accuracy. In the future, I would like to try using CloudVision and consider the accuracy of character recognition on both sides.

Implemented functions (all built in Python)

--Scraping function --OCR function --Data extraction function --Data writing function to Google spreadsheet --Image acquisition function for graphs created with Google Spreadsheet (Material for Tweet) --A function to post a graph of COCOA usage status to Twitter (because API has not been acquired, operation has not been confirmed)

What is Tesseract

Open source software that runs on a variety of operating systems and is distributed under the Apache License 2.0. It has a library for character recognition and a command line interface using it. From version 4.0, in addition to the conventional recognition engine, a recognition engine using an LSTM-based neural network is installed. Developer: Google --From wikipedia

Results of OCR performed

--Before OCR processing (image obtained from the site) cocoa_info_0810.png --After OCR processing

The contact confirmation app is currently "1" for both iOS and Android..1.2 "is distributed.
If you are using an older version of the app, please go to the App Store or Google Play.
Please search for "Approved App" and update.

The number of downloads is August 7, 17:As of 00, about 1 in total.There are 2.05 million cases.

・ It is the total number of both iOS and Android.

・ If you delete it after downloading and download it again, it will be counted multiple times.
There is a match.

The number of positive registrations is August 7, 17:As of 00, there are a total of 165 cases.

OCR recognition accuracy is high and stable

The only typographical error is that the "app" on the second line is recognized as "appli". Therefore, it was found that there is no problem in extracting the number of downloads and the number of positive registrations when extracting the data. We performed OCR processing on multiple sheets, but it was fairly stable and the data extraction was performed accurately.

Summary

――This system is operating normally except for the Twitter posting function, and we were able to simplify the update work by automatically creating graphs from data extraction. --Currently, only posting tweets is manual. ――As an issue, in the future, we would like to activate the tweet function after the Twitter API is issued to automate the process from analysis to information transmission.


Graph of changes in the number of downloads and the number of positive registrations automatically acquired from Google Sheet sheet_date0810

About the details of this project

Reference link

--Ministry of Health, Labor and Welfare special site "New Coronavirus Contact Confirmation Application (COCOA) COVID-19 Contact-Confirming Application" https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/cocoa_00138.html --About the introduction and usage of Tesseract https://rightcode.co.jp/blog/information-technology/python-tesseract-image-processing-ocr

Recommended Posts

Analyze the usage status of the contact confirmation application (COCOA) posted in "Image" / Tesseract
Visualized the usage status of the sink in the company
I tried to extract the text in the image file using Tesseract of the OCR engine
[Blender] Know the selection status of hidden objects in the outliner
Let's guess the development status of the city from the satellite image.
Find the average / standard deviation of the brightness values in the image