[PYTHON] COVID-19 Hokkaido Data Edition ③ Full automation (validation / error detection)

Lottery

COVID-19 Hokkaido Data Edition ① Initial data creation by scraping etc. COVID-19 Hokkaido Data Edition (2) Toward open data + automatic update COVID-19 Hokkaido Data Edition ③ Fully automated ← This article!

Towards full automation

Current status

--GitHub Actions has already automated the reading of the original data and the generation of the json file. --GitHub Pages implements hosting for automatically generated json files

Task

――Although the original data became open data, the file is created manually by the person in charge of each local government, so validation of the input value is required. --Some kind of alert is required if json file generation fails due to data inconsistency etc.

Validation

I generate a dict inside Python and write it to a json file with json.dump (). In other words, the dict data should be validated. So what should be validated? For example, if the key referenced at the front does not exist, it will be a problem, so a key check is necessary. Also, if the date key contains an integer value for some reason, this is also a problem. Therefore, type check is required for each key.

It seems that these two points can be mounted with full scratch, but I will use the wheels invented by the predecessor. Use "json schema".

Reference site: https://medium.com/veltra-engineering/python-json-schema-validation-6936238f107d

Install jsonschema

pip install jsonschema

Validation with json schema

Check if the given dict matches the JSON structure / type defined in advance. If it does not match, an error will occur, and if it matches, nothing will be output.

Since the usage etc. are explained carefully on the above reference site, only the implementation will be introduced.

SCHEMAS = {
    "patients":{Schema definition},
    "contacts":{Schema definition}
    #~ Omitted ~
}

And so on, define the schema definition as SCHEMAS for each key.


def validate(self):
    for key in self.data:
        jsonschema.validate(self.data[key], SCHEMAS[key])

The keys of self.data and SCHEMAS match. self.data [key] is a dict that should be output to json as it is. Therefore, if the key is missing due to a typo, or if the types do not match, an error will occur and the process will be interrupted ( json will not be generated </ b>). Weird json is not generated, and the last json that was output normally remains. (If 5 is entered in the place where 5 should be entered, it will be caught, but if 6 is entered, it will pass through. I feel that this kind of human error is unavoidable in the first place)

For example, if the data is string even though it is defined that integer should be included in date, the following error will occur and the process will be interrupted.


Traceback (most recent call last):
  File "main.py", line 231, in <module>
    dm.validate()
  File "main.py", line 90, in validate
    jsonschema.validate(self.data[key], SCHEMAS[key])
  File "/opt/hostedtoolcache/Python/3.8.2/x64/lib/python3.8/site-packages/jsonschema/validators.py", line 934, in validate
    raise error
jsonschema.exceptions.ValidationError: '2020-03-17T21:31:40.309090+09:00' is not of type 'integer'

Failed validating 'type' in schema['properties']['last_update']:
    {'default': '', 'type': 'integer'}

On instance['last_update']:
    '2020-03-17T21:31:40.309090+09:00'
##[error]Process completed with exit code 1.

Alert Slack for errors

Reference site: Qiita --Slack Webhook URL acquisition procedure Qiita --Periodically execute GitHub Actions and notify Slack of the result

Notify Slack if data generation fails for any reason. I described yaml as follows with reference to the above site.


name: Python application

on:
  schedule:
    - cron:  '0 * * * *'

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python 3.8
      uses: actions/setup-python@v1
      with:
        python-version: 3.8
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Run script
      run: |
        python main.py
    - name: Slack Notification
      #from here
      if: failure()
      uses: rtCamp/action-slack-notify@master
      env:
        SLACK_MESSAGE: 'Error occurred! Please check a log!'
        SLACK_TITLE: ':fire: Data Update Error :fire:'
        SLACK_USERNAME: covid19hokkaido_scraping
        SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
      #Add up to here
    - name: deploy
      uses: peaceiris/actions-gh-pages@v3
      with:
        github_token: ${{ secrets.GITHUB_TOKEN }}
        publish_dir: ./data
        publish_branch: gh-pages

You need to add SLACK_WEBHOOK to Setting → Secrets (otherwise the webhook URL will be open source).

The values stored in Secrets are encrypted. You can get it from yaml with secrets. {Name}. You cannot use {name} that starts with GITHUB.

If you implement the above, when an error occurs, it will be posted to Slack as follows. スクリーンショット 2020-03-17 22.04.29.png

At the end

This completes the automation process including validation of data acquisition → json file generation. However, since there are only key check and type check at present, there is no limit to validation, such as a mechanism to detect too much abnormal value. In addition, data generation has been automated, but asynchronous communication on the front side is being implemented including debugging. There is room for improvement in the future, but I think it has become dramatically easier to operate than the initial half-manual work. This is the end of the three serializations, thank you.

Recommended Posts

COVID-19 Hokkaido Data Edition ③ Full automation (validation / error detection)
COVID-19 Hokkaido Data Edition (2) Toward open data + automatic update