I created an environment to save Qiita trend information with Docker. Basically, if you start the container, the scraping process will run every day and the JSON-converted trend information will be saved. This article is recommended for the following people.
――I want to analyze the trend of Qiita ――I want to study Python a little --Docker I want to touch it for a moment
--author (list of trending authors) --list (list of trend articles --tag (list of tags attached to trend articles
The contents of the JSON that is actually saved are as follows.
List author usernames.
[
"uhyo",
"suin",
"Yz_4230",
"atskimura",
"pineappledreams",
"Amanokawa",
"k_shibusawa",
"minakawa-daiki",
"morry_48",
"c60evaporator",
"takuya_tsurumi",
"TomoEndo",
"yhatt",
"CEML",
"moritalous",
"svfreerider",
"daisukeoda",
"karaage0703",
"tommy19970714",
"tyru",
"galileo15640215",
"keitah",
"mocapapa",
"akeome",
"ssssssssok1",
"yuno_miyako",
"katzueno",
"cometscome_phys",
"mpyw",
"akane_kato"
]
list
: Get a list of trending articles on QiitaThe following information is output.
--Article UUID (article ID) --Article title --Article URL --Article author name --LGTM number --Tags attached to articles, tag URLs
[
{
"article_id":"e66cbca2f582e81d5b16",
"article_title":"Let'Proxy server that blocks web pages using s Encrypt",
"article_url":"https://qiita.com/uhyo/items/e66cbca2f582e81d5b16",
"author_name":"uhyo",
"likes":66,
"tag_list":[
{
"tag_link":"/tags/javascript",
"tag_name":"JavaScript"
},
{
"tag_link":"/tags/node.js",
"tag_name":"Node.js"
},
{
"tag_link":"/tags/proxy",
"tag_name":"proxy"
},
{
"tag_link":"/tags/https",
"tag_name":"HTTPS"
},
{
"tag_link":"/tags/letsencrypt",
"tag_name":"letsencrypt"
}
]
},
{
"article_id":"83ebaf96caa2c13c8b2f",
"article_title":"Create a macOS screensaver with HTML / CSS / JS(No Swift skills required)",
"article_url":"https://qiita.com/suin/items/83ebaf96caa2c13c8b2f",
"author_name":"suin",
"likes":60,
"tag_list":[
{
"tag_link":"/tags/html",
"tag_name":"HTML"
},
{
"tag_link":"/tags/css",
"tag_name":"CSS"
},
{
"tag_link":"/tags/javascript",
"tag_name":"JavaScript"
},
{
"tag_link":"/tags/macos",
"tag_name":"macos"
}
]
}
]
The trend of Qiita is updated twice a day at 5/17 o'clock every day, but since the articles do not change so much, I will only execute it once a day.
tag
: Get the tag attached to the article trending in Qiita[
{
"tag_link":"/tags/python",
"tag_name":"Python"
},
{
"tag_link":"/tags/r",
"tag_name":"R"
},
{
"tag_link":"/tags/%e6%a9%9f%e6%a2%b0%e5%ad%a6%e7%bf%92",
"tag_name":"Machine learning"
}
]
The tag is also acquired in the above list of articles, but since it is a tag linked to one article, if the same tag is attached to different articles, it will be duplicated. Therefore, we decided to omit duplicate tags and save only the trending tags in a list.
We will create a simple Docker environment. The directory structure looks like the following.
├── batch
│ └── py
│ └── article.py
├── docker
│ └── python
│ ├── Dockerfile
│ ├── etc
│ │ └── cron.d
│ │ └── qiita
│ └── requirements.txt
├── docker-compose.yml
└── mnt
└── json
├── author
├── list
└── tag
batch
directory
I have a python file.
This file is the actual file for scraping.
docker
directory
You can find what you need inside the container or the actual cron settings here
mnt
directory
I have a directory mounted on the host and scraping results in a JSON file here
batch directory
)It is the contents of the real file ʻarticle.py` in the batch directory. I wrote such an article in the past, so I will explain the detailed method there. >> Get Qiita trends (ranking) and send them to Slack In this article, I will limit myself to programs.
There are two differences from the program in the above article. I just want a list of articles! I think that the above article is enough for people.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import json
import datetime
import os
def get_article_tags(detail_url):
tag_list = []
res = requests.get(detail_url, headers=headers)
#Handle html with Beautiful Soup
soup = BeautifulSoup(res.text, "html.parser")
tags = soup.find_all(class_="it-Tags_item")
for tag in tags:
tag_name = tag.get_text()
tag_link = tag.get('href')
tag_list.append({
'tag_name' : tag_name,
'tag_link': tag_link
})
return tag_list
def write_json(json_list, path):
with open(path, 'w') as f:
f.write(json.dumps(json_list, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ':')))
def mkdir(path):
os.makedirs(path, exist_ok=True)
def get_unique_list(seq):
seen = []
return [x for x in seq if x not in seen and not seen.append(x)]
def get_unique_tag(tag_lists):
tags = []
for v in tag_lists:
for i in v:
tags.append(i)
return tags
try:
# Root URL
url = "https://qiita.com/"
headers = {
"User-Agent" : "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1"
}
today_date = datetime.datetime.now().date()
items = []
item_json = []
result = []
res = requests.get(url, headers=headers)
#Handle html with Beautiful Soup
soup = BeautifulSoup(res.text, "html.parser")
try:
main_items = soup.find(class_="p-home_main")
for main_items in soup.find_all():
if "data-hyperapp-props" in main_items.attrs:
item_json.append(main_items["data-hyperapp-props"])
items = json.loads(item_json[1])
except:
raise Exception("Not Found Json Dom Info")
if 'edges' not in items['trend']:
raise Exception("The expected list does not exist")
try:
item_detail_list = []
tags_list = []
author_list = []
for edges in items['trend']['edges']:
uuid = edges['node']['uuid']
title = edges['node']['title']
likes = edges['node']['likesCount']
article_url = url + edges['node']['author']['urlName'] + '/items/' + uuid
author_name = edges['node']['author']['urlName']
create_at = datetime.datetime.now().date()
tag_list = get_article_tags(article_url)
item = {
'article_title' : title,
'article_url' : article_url,
'article_id' : edges['node']['uuid'],
'likes' : likes,
'uuid' : uuid,
'author_name' : author_name,
'tag_list' : tag_list,
}
item_detail_list.append(item)
tags_list.append(tag_list)
author_list.append(author_name)
mkdir('/mnt/json/list/')
mkdir('/mnt/json/tag/')
mkdir('/mnt/json/author/')
#Uniqu the tag
tags_list = get_unique_tag(tags_list)
#Export json file
write_json(item_detail_list, f"/mnt/json/list/{today_date}.json")
write_json(tags_list, f"/mnt/json/tag/{today_date}.json")
write_json(author_list, f"/mnt/json/author/{today_date}.json")
except:
raise Exception("Can't Create Json")
except Exception as e:
#json file creation failure
mkdir('/mnt/log/')
with open(f'/mnt/log/{today_date}', 'w') as f:
f.write(e)
Next, create an environment to execute the above files.
docker directory
)It's not a big deal here. Directories and mounts on your PC with volumes.
version: "3"
qiita_batch:
container_name: "qiita_batch"
build:
context: ./docker/python
tty: true
volumes:
- ./batch:/usr/src/app
- ./mnt:/mnt
Dockerfile Forgive me for being dirty ... just a brief explanation ↓
--Timezone setting in container (for cron setting) --Reflect cron --Install required modules with requirement.txt
If you want to run cron in the specified Japan time, setting the time zone is essential. I messed up something and finally changed to Japan time, but there must be a better way ...
The cron settings are put together in etc / cron.d / qiita and written to crontab later. I wonder if this is better because it will be easier to manage. Don't call the command crontab -r
if you make a mistake ...! !!
FROM python:3
ARG project_dir=/usr/src/app
WORKDIR $project_dir
ADD requirements.txt $project_dir/py/
ADD /etc/cron.d/qiita /etc/cron.d/
ENV TZ=Asia/Tokyo
RUN apt-get update && \
apt-get install -y cron less vim tzdata && \
rm -rf /var/lib/apt/lists/* && \
echo "${TZ}" > /etc/timezone && \
rm /etc/localtime && \
ln -s /usr/share/zoneinfo/Asia/Tokyo /etc/localtime && \
dpkg-reconfigure -f noninteractive tzdata && \
chmod 0744 /etc/cron.d/* && \
touch /var/log/cron.log && \
crontab /etc/cron.d/qiita && \
pip install --upgrade pip && \
pip install -r $project_dir/py/requirements.txt
CMD ["cron", "-f"]
Since requirement.txt is just the output of what I was using on my Macbook Pro, it contains quite a lot of things. Please scrape what you don't need. All you need is beautifulsoup4
, requests
and json
. I don't have enough! People are moving and not enough pip install! !!
appdirs==1.4.3
beautifulsoup4==4.8.1
bs4==0.0.1
certifi==2019.9.11
chardet==3.0.4
Click==7.0
filelock==3.0.12
get==2019.4.13
gunicorn==20.0.4
idna==2.8
importlib-metadata==1.5.0
importlib-resources==1.0.2
itsdangerous==1.1.0
Jinja2==2.11.1
MarkupSafe==1.1.1
post==2019.4.13
public==2019.4.13
query-string==2019.4.13
request==2019.4.13
requests==2.22.0
six==1.14.0
soupsieve==1.9.5
urllib3==1.25.7
virtualenv==20.0.1
Werkzeug==1.0.0
zipp==2.2.0
The contents of /etc/cron.d/qiita
PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LANG=ja_JP.UTF-8
# Create Qiita JSON (every day AM:10:00)
0 10 * * * python /usr/src/app/py/article.py >> /var/log/cron.log 2>&1
Like this!
After that, if you do it with docker-compose up -d
, it will start up, so if you leave it alone, it will go to Qiita for scraping and create a json file. Recommended because it can be done in a simple Docker environment!
Recommended Posts