[PYTHON] Scrap the published csv with Github Action and publish it on Github Pages

Introduction

About this program

Open data (csv) released from Gifu prefecture, ・ Scraping regularly with github actions, -Output a json file as a simple dictionary array without editing ・ If there is a difference, push to the gh-pages branch ・ You can access the json file directly on github pages It is a program.

Background of publication

Developed this program for the development of Gifu Prefecture coronavirus countermeasure site. Although it is published in other cases, processing is included in csv-> json output, so Many corrections were needed for reference. Therefore, in this program, it is easy for other developers to develop by keeping the processing to the minimum and making it a format that can output the original csv data as it is in json.

Product Github https://github.com/CODE-for-GIFU/covid19-scraping

Json output on Github pages

http://code-for-gifu.github.io/covid19-scraping/patients.json http://code-for-gifu.github.io/covid19-scraping/testcount.json http://code-for-gifu.github.io/covid19-scraping/callcenter.json http://code-for-gifu.github.io/covid19-scraping/advicecenter.json

Reference CSV file

Gifu Prefecture Open Data https://data.gifu-opendata.pref.gifu.lg.jp/dataset/c11223-001

how to use

Run on Github

starting method

How to stop

main.yml


on:
  schedule:
    - cron: "*/10 * * * *”

Hosting on Github Pages

For details, refer to the official github action documentation. https://help.github.com/ja/actions

Run in local environment

pip install -r requirements.txt
python3 main.py

A json file will be generated in the / data folder.

Technical documentation

python

Main

main.py


os.makedirs('./data', exist_ok=True)
for remotes in REMOTE_SOURCES:
    data = import_csv_from(remotes['url'])
    dumps_json(remotes['jsonname'], data)

Data definition section

settings.py


#External resource definition
REMOTE_SOURCES = [
    {
        'url': 'https://opendata-source.com/source1.csv',
        'jsonname': 'source1.json',
    },
    {
        'url': 'https://opendata-source.com/source2.csv',
        'jsonname': 'source2.json',
    },
    {
        'url': 'https://opendata-source.com/source3.csv',
        'jsonname': 'source3.json',
    },
    {
        'url': 'https://opendata-source.com/source4.csv',
        'jsonname': 'source4.json',
    }
]

csv reading part

main.py


def import_csv_from(csvurl):
    request_file = urllib.request.urlopen(csvurl)
    if not request_file.getcode() == 200:
        return

    f = decode_csv(request_file.read())
    filename = os.path.splitext(os.path.basename(csvurl))[0]
    datas = csvstr_to_dicts(f)
    timestamp = (request_file.getheader('Last-Modified'))

    return {
        'data': datas,
        'last_update': dateutil.parser.parse(timestamp).astimezone(JST).isoformat()
    }

csv decoding part

main.py


def decode_csv(csv_data):
    print('csv decoding')
    for codec in CODECS:
        try:
            csv_str = csv_data.decode(codec)
            print('ok:' + codec)
            return csv_str
        except:
            print('ng:' + codec)
            continue
    print('Appropriate codec is not found.')

csv → json data conversion unit

main.py


def csvstr_to_dicts(csvstr):
    datas = []
    rows = [row for row in csv.reader(csvstr.splitlines())]
    header = rows[0]
    for i in range(len(header)):
        for j in range(len(UNUSE_CHARACTER)):
            header[i] = header[i].replace(UNUSE_CHARACTER[j], '')

    maindatas = rows[1:]
    for d in maindatas:
        #Skip blank lines
        if d == []:
            continue
        data = {}
        for i in range(len(header)):
            data[header[i]] = d[i]
        datas.append(data)
    return datas

json data output section

main.py


def dumps_json(file_name: str, json_data: Dict):
    with codecs.open("./data/" + file_name, "w", "utf-8") as f:
        f.write(json.dumps(json_data, ensure_ascii=False,
                           indent=4, separators=(',', ': ')))

Github Action It is built with a yml file.

Schedule

main.yml


on:
    schedule:
     - cron: "*/10 * * * *”

python script execution part

main.yml


    steps:
      - uses: actions/checkout@v2
      - name: Set up Python 3.8
        uses: actions/setup-python@v1
        with:
          python-version: 3.8
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run script
        run: |
          python main.py

push to gh-pages

main.yml


      - name: deploy
        uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./data
          publish_branch: gh-pages

References

Hokkaido: Python script for scraping --covid19hokkaido_scraping https://github.com/Kanahiro/covid19hokkaido_scraping/blob/master/main.py

Recommended Posts

Scrap the published csv with Github Action and publish it on Github Pages
Read the csv file with jupyter notebook and write the graph on top of it
Predict the amount of electricity used in 2 days and publish it in CSV
Deploy a Python app on Google App Engine and integrate it with GitHub
Extract the TOP command result with USER and output it as CSV
Convert the spreadsheet to CSV and upload it to Cloud Storage with Cloud Functions
Make a thermometer with Raspberry Pi and make it visible on the browser Part 3
Scraping the holojour and displaying it with CLI
Until the Sphinx documentation is published on GitHub
The guy who stumbled upon failing to publish a blog on github pages on Pelican
Install selenium on Mac and try it with python
Read the csv file and display it in the browser
POST the image with json and receive it with flask
Get the matched string with a regular expression and reuse it when replacing on Python3
Put Ubuntu in Raspi, put Docker on it, and control GPIO with python from the container
Create a new csv with pandas based on the local csv
[pyqtgraph] Add region to the graph and link it with the graph region
Play with the password mechanism of GitHub Webhook and Python
Life game with Python [I made it] (on the terminal & Tkinter)
Take an image with Pepper and display it on your tablet
Disguise the grass on GitHub and try to become an engineer.
Install lp_solve on Mac OS X and call it with python.
The quick action PDF creator on the Mac I want it on Windows
I ran GhostScript with python, split the PDF into pages, and converted it to a JPEG image.
Upload data to s3 of aws with a command and update it, and delete the used data (on the way)