[PYTHON] Let's create a Docker environment that stores Qiita trend information!

I created an environment to save Qiita trend information with Docker. Basically, if you start the container, the scraping process will run every day and the JSON-converted trend information will be saved. This article is recommended for the following people.

――I want to analyze the trend of Qiita ――I want to study Python a little --Docker I want to touch it for a moment

* About the JSON format of Qiita to save

--author (list of trending authors) --list (list of trend articles --tag (list of tags attached to trend articles

The contents of the JSON that is actually saved are as follows.

ʻAuthor`: Get trending authors on Qiita

List author usernames.

[
    "uhyo",
    "suin",
    "Yz_4230",
    "atskimura",
    "pineappledreams",
    "Amanokawa",
    "k_shibusawa",
    "minakawa-daiki",
    "morry_48",
    "c60evaporator",
    "takuya_tsurumi",
    "TomoEndo",
    "yhatt",
    "CEML",
    "moritalous",
    "svfreerider",
    "daisukeoda",
    "karaage0703",
    "tommy19970714",
    "tyru",
    "galileo15640215",
    "keitah",
    "mocapapa",
    "akeome",
    "ssssssssok1",
    "yuno_miyako",
    "katzueno",
    "cometscome_phys",
    "mpyw",
    "akane_kato"
]

`list`: Get a list of trending articles on Qiita

The following information is output.

--Article UUID (article ID) --Article title --Article URL --Article author name --LGTM number --Tags attached to articles, tag URLs

[
    {
        "article_id":"e66cbca2f582e81d5b16",
        "article_title":"Let'Proxy server that blocks web pages using s Encrypt",
        "article_url":"https://qiita.com/uhyo/items/e66cbca2f582e81d5b16",
        "author_name":"uhyo",
        "likes":66,
        "tag_list":[
            {
                "tag_link":"/tags/javascript",
                "tag_name":"JavaScript"
            },
            {
                "tag_link":"/tags/node.js",
                "tag_name":"Node.js"
            },
            {
                "tag_link":"/tags/proxy",
                "tag_name":"proxy"
            },
            {
                "tag_link":"/tags/https",
                "tag_name":"HTTPS"
            },
            {
                "tag_link":"/tags/letsencrypt",
                "tag_name":"letsencrypt"
            }
        ]
    },
    {
        "article_id":"83ebaf96caa2c13c8b2f",
        "article_title":"Create a macOS screensaver with HTML / CSS / JS(No Swift skills required)",
        "article_url":"https://qiita.com/suin/items/83ebaf96caa2c13c8b2f",
        "author_name":"suin",
        "likes":60,
        "tag_list":[
            {
                "tag_link":"/tags/html",
                "tag_name":"HTML"
            },
            {
                "tag_link":"/tags/css",
                "tag_name":"CSS"
            },
            {
                "tag_link":"/tags/javascript",
                "tag_name":"JavaScript"
            },
            {
                "tag_link":"/tags/macos",
                "tag_name":"macos"
            }
        ]
    }
]

The trend of Qiita is updated twice a day at 5/17 o'clock every day, but since the articles do not change so much, I will only execute it once a day.

`tag`: Get the tag attached to the article trending in Qiita

[
    {
        "tag_link":"/tags/python",
        "tag_name":"Python"
    },
    {
        "tag_link":"/tags/r",
        "tag_name":"R"
    },
    {
        "tag_link":"/tags/%e6%a9%9f%e6%a2%b0%e5%ad%a6%e7%bf%92",
        "tag_name":"Machine learning"
    }
]

The tag is also acquired in the above list of articles, but since it is a tag linked to one article, if the same tag is attached to different articles, it will be duplicated. Therefore, we decided to omit duplicate tags and save only the trending tags in a list.

Let's create an environment where Python can be executed with Docker

We will create a simple Docker environment. The directory structure looks like the following.

├── batch
│   └── py
│       └── article.py
├── docker
│   └── python
│       ├── Dockerfile
│       ├── etc
│       │    └── cron.d
│       │        └── qiita
│       └── requirements.txt
├── docker-compose.yml
└── mnt
    └── json
        ├── author
        ├── list
        └── tag

batch directory I have a python file. This file is the actual file for scraping.
dockerdirectory You can find what you need inside the container or the actual cron settings here
mntdirectory I have a directory mounted on the host and scraping results in a JSON file here

Let's get the trend of Qiita by scraping (`batch directory`)

It is the contents of the real file ʻarticle.py` in the batch directory. I wrote such an article in the past, so I will explain the detailed method there. >> Get Qiita trends (ranking) and send them to Slack In this article, I will limit myself to programs.

There are two differences from the program in the above article. I just want a list of articles! I think that the above article is enough for people.

Get the tags and authors of trending articles
Save the acquired contents as JSON

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup
import json
import datetime
import os

def get_article_tags(detail_url):
    tag_list = []

    res = requests.get(detail_url, headers=headers)

    #Handle html with Beautiful Soup
    soup = BeautifulSoup(res.text, "html.parser")

    tags = soup.find_all(class_="it-Tags_item")
    for tag in tags:
        tag_name = tag.get_text()
        tag_link = tag.get('href')

        tag_list.append({
            'tag_name' : tag_name,
            'tag_link': tag_link
        })

    return tag_list

def write_json(json_list, path):
    with open(path, 'w') as f:
        f.write(json.dumps(json_list, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ':')))

def mkdir(path):
    os.makedirs(path, exist_ok=True)

def get_unique_list(seq):
    seen = []
    return [x for x in seq if x not in seen and not seen.append(x)]

def get_unique_tag(tag_lists):
    tags = []
    for v in tag_lists:
        for i in v:
            tags.append(i)
    return tags        

try:
    # Root URL
    url = "https://qiita.com/"
    headers = {
        "User-Agent" : "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1"
    }
    today_date = datetime.datetime.now().date()

    items = []
    item_json = []
    result = []

    res = requests.get(url, headers=headers)

    #Handle html with Beautiful Soup
    soup = BeautifulSoup(res.text, "html.parser")

    try:
        main_items = soup.find(class_="p-home_main") 

        for main_items in soup.find_all():
            if "data-hyperapp-props" in main_items.attrs:
                item_json.append(main_items["data-hyperapp-props"])
        items = json.loads(item_json[1])
    except:
        raise Exception("Not Found Json Dom Info")

    if 'edges' not in items['trend']:
        raise Exception("The expected list does not exist")

    try:
        item_detail_list = []
        tags_list = []
        author_list = []

        for edges in items['trend']['edges']:
            uuid = edges['node']['uuid']
            title = edges['node']['title']
            likes = edges['node']['likesCount']
            article_url =  url + edges['node']['author']['urlName'] + '/items/' + uuid
            author_name = edges['node']['author']['urlName']
            create_at = datetime.datetime.now().date()
            tag_list = get_article_tags(article_url)

            item = {
                'article_title' : title,
                'article_url' : article_url,
                'article_id' : edges['node']['uuid'],
                'likes' : likes,
                'uuid' : uuid,
                'author_name' : author_name,
                'tag_list' : tag_list,
            }

            item_detail_list.append(item)
            tags_list.append(tag_list)
            author_list.append(author_name)

        mkdir('/mnt/json/list/')
        mkdir('/mnt/json/tag/')
        mkdir('/mnt/json/author/')

        #Uniqu the tag
        tags_list = get_unique_tag(tags_list)

        #Export json file
        write_json(item_detail_list, f"/mnt/json/list/{today_date}.json")
        write_json(tags_list, f"/mnt/json/tag/{today_date}.json")
        write_json(author_list, f"/mnt/json/author/{today_date}.json")
    except:
        raise Exception("Can't Create Json")
    
except Exception as e:
    #json file creation failure
    mkdir('/mnt/log/')
    with open(f'/mnt/log/{today_date}', 'w') as f:
        f.write(e)

Next, create an environment to execute the above files.

Create Docker part to run Python (`docker directory`)

Create docker-compose.yml

It's not a big deal here. Directories and mounts on your PC with volumes.

version: "3"

qiita_batch:
  container_name: "qiita_batch"
  build: 
    context: ./docker/python
  tty: true
  volumes:
    - ./batch:/usr/src/app
    - ./mnt:/mnt

Creating a Dockerfile

Dockerfile Forgive me for being dirty ... just a brief explanation ↓

--Timezone setting in container (for cron setting) --Reflect cron --Install required modules with requirement.txt

If you want to run cron in the specified Japan time, setting the time zone is essential. I messed up something and finally changed to Japan time, but there must be a better way ...

The cron settings are put together in etc / cron.d / qiita and written to crontab later. I wonder if this is better because it will be easier to manage. Don't call the command crontab -r if you make a mistake ...! !!

FROM python:3

ARG project_dir=/usr/src/app
WORKDIR $project_dir

ADD requirements.txt $project_dir/py/
ADD /etc/cron.d/qiita /etc/cron.d/

ENV TZ=Asia/Tokyo

RUN apt-get update && \
    apt-get install -y cron less vim tzdata && \
    rm -rf /var/lib/apt/lists/* && \
    echo "${TZ}" > /etc/timezone && \
    rm /etc/localtime && \
    ln -s /usr/share/zoneinfo/Asia/Tokyo /etc/localtime && \
    dpkg-reconfigure -f noninteractive tzdata && \
    chmod 0744 /etc/cron.d/* && \
    touch /var/log/cron.log && \
    crontab /etc/cron.d/qiita && \
    pip install --upgrade pip　&& \
    pip install -r $project_dir/py/requirements.txt

CMD ["cron", "-f"]

Create requirement.txt that summarizes the packages required to execute Python

Since requirement.txt is just the output of what I was using on my Macbook Pro, it contains quite a lot of things. Please scrape what you don't need. All you need is beautifulsoup4, requests and json. I don't have enough! People are moving and not enough pip install! !!

appdirs==1.4.3
beautifulsoup4==4.8.1
bs4==0.0.1
certifi==2019.9.11
chardet==3.0.4
Click==7.0
filelock==3.0.12
get==2019.4.13
gunicorn==20.0.4
idna==2.8
importlib-metadata==1.5.0
importlib-resources==1.0.2
itsdangerous==1.1.0
Jinja2==2.11.1
MarkupSafe==1.1.1
post==2019.4.13
public==2019.4.13
query-string==2019.4.13
request==2019.4.13
requests==2.22.0
six==1.14.0
soupsieve==1.9.5
urllib3==1.25.7
virtualenv==20.0.1
Werkzeug==1.0.0
zipp==2.2.0

cron settings

The contents of /etc/cron.d/qiita

PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

LANG=ja_JP.UTF-8

# Create Qiita JSON (every day AM:10:00)
0 10 * * * python /usr/src/app/py/article.py >> /var/log/cron.log 2>&1

Like this! After that, if you do it with docker-compose up -d, it will start up, so if you leave it alone, it will go to Qiita for scraping and create a json file. Recommended because it can be done in a simple Docker environment!