[PYTHON] Scrap the published csv with Github Action and publish it on Github Pages


About this program

Open data (csv) released from Gifu prefecture, ・ Scraping regularly with github actions, -Output a json file as a simple dictionary array without editing ・ If there is a difference, push to the gh-pages branch ・ You can access the json file directly on github pages It is a program.

Background of publication

Developed this program for the development of Gifu Prefecture coronavirus countermeasure site. Although it is published in other cases, processing is included in csv-> json output, so Many corrections were needed for reference. Therefore, in this program, it is easy for other developers to develop by keeping the processing to the minimum and making it a format that can output the original csv data as it is in json.

Product Github https://github.com/CODE-for-GIFU/covid19-scraping

Json output on Github pages

http://code-for-gifu.github.io/covid19-scraping/patients.json http://code-for-gifu.github.io/covid19-scraping/testcount.json http://code-for-gifu.github.io/covid19-scraping/callcenter.json http://code-for-gifu.github.io/covid19-scraping/advicecenter.json

Reference CSV file

Gifu Prefecture Open Data https://data.gifu-opendata.pref.gifu.lg.jp/dataset/c11223-001

how to use

Run on Github

starting method

How to stop


    - cron: "*/10 * * * *”

Hosting on Github Pages

For details, refer to the official github action documentation. https://help.github.com/ja/actions

Run in local environment

pip install -r requirements.txt
python3 main.py

A json file will be generated in the / data folder.

Technical documentation




os.makedirs('./data', exist_ok=True)
for remotes in REMOTE_SOURCES:
    data = import_csv_from(remotes['url'])
    dumps_json(remotes['jsonname'], data)

Data definition section


#External resource definition
        'url': 'https://opendata-source.com/source1.csv',
        'jsonname': 'source1.json',
        'url': 'https://opendata-source.com/source2.csv',
        'jsonname': 'source2.json',
        'url': 'https://opendata-source.com/source3.csv',
        'jsonname': 'source3.json',
        'url': 'https://opendata-source.com/source4.csv',
        'jsonname': 'source4.json',

csv reading part


def import_csv_from(csvurl):
    request_file = urllib.request.urlopen(csvurl)
    if not request_file.getcode() == 200:

    f = decode_csv(request_file.read())
    filename = os.path.splitext(os.path.basename(csvurl))[0]
    datas = csvstr_to_dicts(f)
    timestamp = (request_file.getheader('Last-Modified'))

    return {
        'data': datas,
        'last_update': dateutil.parser.parse(timestamp).astimezone(JST).isoformat()

csv decoding part


def decode_csv(csv_data):
    print('csv decoding')
    for codec in CODECS:
            csv_str = csv_data.decode(codec)
            print('ok:' + codec)
            return csv_str
            print('ng:' + codec)
    print('Appropriate codec is not found.')

csv → json data conversion unit


def csvstr_to_dicts(csvstr):
    datas = []
    rows = [row for row in csv.reader(csvstr.splitlines())]
    header = rows[0]
    for i in range(len(header)):
        for j in range(len(UNUSE_CHARACTER)):
            header[i] = header[i].replace(UNUSE_CHARACTER[j], '')

    maindatas = rows[1:]
    for d in maindatas:
        #Skip blank lines
        if d == []:
        data = {}
        for i in range(len(header)):
            data[header[i]] = d[i]
    return datas

json data output section


def dumps_json(file_name: str, json_data: Dict):
    with codecs.open("./data/" + file_name, "w", "utf-8") as f:
        f.write(json.dumps(json_data, ensure_ascii=False,
                           indent=4, separators=(',', ': ')))

Github Action It is built with a yml file.



     - cron: "*/10 * * * *”

python script execution part


      - uses: actions/checkout@v2
      - name: Set up Python 3.8
        uses: actions/setup-python@v1
          python-version: 3.8
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run script
        run: |
          python main.py

push to gh-pages


      - name: deploy
        uses: peaceiris/actions-gh-pages@v3
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./data
          publish_branch: gh-pages


Hokkaido: Python script for scraping --covid19hokkaido_scraping https://github.com/Kanahiro/covid19hokkaido_scraping/blob/master/main.py

Recommended Posts

Scrap the published csv with Github Action and publish it on Github Pages
Read the csv file with jupyter notebook and write the graph on top of it
Predict the amount of electricity used in 2 days and publish it in CSV
Deploy a Python app on Google App Engine and integrate it with GitHub
Extract the TOP command result with USER and output it as CSV
Convert the spreadsheet to CSV and upload it to Cloud Storage with Cloud Functions
Make a thermometer with Raspberry Pi and make it visible on the browser Part 3
Scraping the holojour and displaying it with CLI
Until the Sphinx documentation is published on GitHub
The guy who stumbled upon failing to publish a blog on github pages on Pelican
Install selenium on Mac and try it with python
Read the csv file and display it in the browser
POST the image with json and receive it with flask
Get the matched string with a regular expression and reuse it when replacing on Python3
Put Ubuntu in Raspi, put Docker on it, and control GPIO with python from the container
Create a new csv with pandas based on the local csv
[pyqtgraph] Add region to the graph and link it with the graph region
Play with the password mechanism of GitHub Webhook and Python
Life game with Python [I made it] (on the terminal & Tkinter)
Take an image with Pepper and display it on your tablet
Disguise the grass on GitHub and try to become an engineer.
Install lp_solve on Mac OS X and call it with python.
The quick action PDF creator on the Mac I want it on Windows
I ran GhostScript with python, split the PDF into pages, and converted it to a JPEG image.
Upload data to s3 of aws with a command and update it, and delete the used data (on the way)