As a review of the Hatena blog I wrote in 2019, I decided to create a WordCloud for my blog article [^ 1]. After a little research, I found an article calling the API of the blog in JavaScript. So I tried to do the same in Python. This article summarizes what I learned by moving my hands.
[^ 1]: Update the link after publication
The scripts covered in this article do the following:
--Get a list of Hatena blog articles you wrote --Separate the articles written on the Hatena blog into files and save them on your computer.
There are two topics to discuss:
$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.4
BuildVersion: 18E226
$ python -V
Python 3.7.3 #I am creating a virtual environment with the venv module
$ pip list | grep requests
requests 2.22.0
Preparation consists of two steps.
--Learn how to parse XML in Python --Call the API using your credentials
Hatena Blog AtomPub | http://developer.hatena.ne.jp/ja/documents/blog/apis/atom
An API using the Atom Publishing Protocol (AtomPub) is open to the public.
Hatena Blog In order to use AtomPub, the client needs to perform OAuth authentication, WSSE authentication, or Basic authentication.
This time, I implemented it with ** Basic authentication **.
As described in "Getting a list of blog entries" in the above document, the ** response is returned in XML **.
I used the standard module ElementTree XML API (https://docs.python.org/ja/3/library/xml.etree.elementtree.html).
The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.
Let's see the XML operation with a simple example. The same operation was performed on the response of the Hatena blog Atom Pub.
practice.py
import xml.etree.ElementTree as ET
sample_xml_as_string = """<?xml version="1.0"?>
<data>
<member name="Kumiko">
<grade>2</grade>
<instrument>euphonium</instrument>
</member>
<member name="Kanade">
<grade>1</grade>
<instrument>euphonium</instrument>
</member>
<member name="Mizore">
<grade>3</grade>
<instrument>oboe</instrument>
</member>
</data>"""
root = ET.fromstring(sample_xml_as_string)
Specify the -i
option [^ 2] when executing the script in the Python interpreter.
This specification brings up interactive mode after the script is executed (the intention is to avoid having to enter XML strings from interactive mode).
$ python -i practice.py
>>> root
<Element 'data' at 0x108ac4f48>
>>> root.tag # <data>Represents a tag
'data'
>>> root.attrib # <data>Attributes on tags(attribute)Not
{}
>>> for child in root: # <data>Nested in tags<member>You can take out the tag
... print(child.tag, child.attrib)
...
member {'name': 'Kumiko'} #member is an attribute called name(attribute)have
member {'name': 'Kanade'}
member {'name': 'Mizore'}
In addition to the for
statement, the XML tag nesting structure can also be handled by the find
and findall
methods [^ 3].
--find
: "Find the first child element with a specific tag"
--findall
: "Find only the direct child elements of the current element by tag"
>>> #Continued
>>> someone = root.find('member')
>>> print(someone.tag, someone.attrib)
member {'name': 'Kumiko'} #First child element(member)It has become
>>> members = root.findall('member')
>>> for member in members:
... print(member.tag, member.attrib)
...
member {'name': 'Kumiko'} #All child elements(member)Has been taken out
member {'name': 'Kanade'}
member {'name': 'Mizore'}
>>> for member in members:
... instrument = member.find('instrument')
... print(instrument.text) #Text part sandwiched between tags
...
euphonium
euphonium
oboe
This time, I analyzed the response of the blog Atom Pub with find
and findall
.
Information for calling Hatena Blog AtomPub with Basic authentication can be found in Hatena Blog Settings> Advanced Settings> AtomPub.
Basic authentication used the ʻauth argument of the
getmethod of
requests` [^ 4].
blog_entries_url = "https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry"
user_pass_tuple = ("nikkie-ftnext", "the_api_key")
r = requests.get(blog_entries_url, auth=user_pass_tuple)
root = ET.fromstring(r.text)
Pass the XML-formatted string r.text
to the ʻElementTree.fromstring` method.
We proceeded with XML analysis while comparing the actual response with "Getting a list of blog entries" at http://developer.hatena.ne.jp/ja/documents/blog/apis/atom.
In [20]: for child in root: #root is the same as in the example above
...: print(child.tag, child.attrib)
...:
{http://www.w3.org/2005/Atom}link {'rel': 'first', 'href': 'https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry'}
{http://www.w3.org/2005/Atom}link {'rel': 'next', 'href': 'https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry?page=1572227492'} #Next page of article list
{http://www.w3.org/2005/Atom}title {}
{http://www.w3.org/2005/Atom}subtitle {}
{http://www.w3.org/2005/Atom}link {'rel': 'alternate', 'href': 'https://nikkie-ftnext.hatenablog.com/'}
{http://www.w3.org/2005/Atom}updated {}
{http://www.w3.org/2005/Atom}author {}
{http://www.w3.org/2005/Atom}generator {'uri': 'https://blog.hatena.ne.jp/', 'version': '3977fa1b6c9f31b5eab4610099c62851'}
{http://www.w3.org/2005/Atom}id {}
{http://www.w3.org/2005/Atom}entry {} #Individual articles
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
We will also look at individual articles (tags are elements of entry).
>>> for item in child: #child contains an article (the rest of the for)
... print(item.tag, item.attrib)
...
{http://www.w3.org/2005/Atom}id {}
{http://www.w3.org/2005/Atom}link {'rel': 'edit', 'href': 'https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry/26006613457877510'}
{http://www.w3.org/2005/Atom}link {'rel': 'alternate', 'type': 'text/html', 'href': 'https://nikkie-ftnext.hatenablog.com/entry/rejectpy2019-plan-step0'}
{http://www.w3.org/2005/Atom}author {}
{http://www.w3.org/2005/Atom}title {} #Title (Target of WordCloud)
{http://www.w3.org/2005/Atom}updated {}
{http://www.w3.org/2005/Atom}published {} #Note: Also stored in draft
{http://www.w3.org/2007/app}edited {}
{http://www.w3.org/2005/Atom}summary {'type': 'text'}
{http://www.w3.org/2005/Atom}content {'type': 'text/x-markdown'} #Text
{http://www.hatena.ne.jp/info/xmlns#}formatted-content {'type': 'text/html'}
{http://www.w3.org/2005/Atom}category {'term': 'Speaker report'}
{http://www.w3.org/2007/app}control {} #Has information on whether or not the child element is draft
I want to make WordCloud from published articles, so check the draft of the child element of {http://www.w3.org/2007/app} control
(if yes, it is an unpublished draft, so it is not applicable) ..
The text used for WordCloud was used by connecting the title and the text.
Published in the following repositories: https://github.com/ftnext/hatenablog-atompub-python
main.py
import argparse
from datetime import datetime, timedelta, timezone
import os
from pathlib import Path
import xml.etree.ElementTree as ET
import requests
def load_credentials(username):
"""Returns the authentication information required for Hatena API access in tuple format"""
auth_token = os.getenv("HATENA_BLOG_ATOMPUB_KEY")
message = "Environment variable`HATENA_BLOG_ATOMPUB_KEY`Please set the API key of Atom Pub to"
assert auth_token, message
return (username, auth_token)
def retrieve_hatena_blog_entries(blog_entries_uri, user_pass_tuple):
"""GET access to Hatena Blog API and return XML representing article list as a character string"""
r = requests.get(blog_entries_uri, auth=user_pass_tuple)
return r.text
def select_elements_of_tag(xml_root, tag):
"""Parses the return XML and returns all child elements with the specified tag"""
return xml_root.findall(tag)
def return_next_entry_list_uri(links):
"""Returns the endpoint of the following blog article list"""
for link in links:
if link.attrib["rel"] == "next":
return link.attrib["href"]
def is_draft(entry):
"""Determine if a blog post is a draft"""
draft_status = (
entry.find("{http://www.w3.org/2007/app}control")
.find("{http://www.w3.org/2007/app}draft")
.text
)
return draft_status == "yes"
def return_published_date(entry):
"""Returns the publication date of a blog post
It was a specification that was returned even in the case of draft
"""
publish_date_str = entry.find(
"{http://www.w3.org/2005/Atom}published"
).text
return datetime.fromisoformat(publish_date_str)
def is_in_period(datetime_, start, end):
"""Determine if the specified date and time is included in the period from start to end"""
return start <= datetime_ < end
def return_id(entry):
"""Returns the ID part contained in the URI of the blog"""
link = entry.find("{http://www.w3.org/2005/Atom}link")
uri = link.attrib["href"]
return uri.split("/")[-1]
def return_contents(entry):
"""Connect and return the blog title and body"""
title = entry.find("{http://www.w3.org/2005/Atom}title").text
content = entry.find("{http://www.w3.org/2005/Atom}content").text
return f"{title}。\n\n{content}"
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("hatena_id")
parser.add_argument("blog_domain")
parser.add_argument("target_year", type=int)
parser.add_argument("--output", type=Path)
args = parser.parse_args()
hatena_id = args.hatena_id
blog_domain = args.blog_domain
target_year = args.target_year
output_path = args.output if args.output else Path("output")
user_pass_tuple = load_credentials(hatena_id)
blog_entries_uri = (
f"https://blog.hatena.ne.jp/{hatena_id}/{blog_domain}/atom/entry"
)
jst_tz = timezone(timedelta(seconds=9 * 60 * 60))
date_range_start = datetime(target_year, 1, 1, tzinfo=jst_tz)
date_range_end = datetime(target_year + 1, 1, 1, tzinfo=jst_tz)
oldest_published_date = datetime.now(jst_tz)
target_entries = []
while date_range_start <= oldest_published_date:
entries_xml = retrieve_hatena_blog_entries(
blog_entries_uri, user_pass_tuple
)
root = ET.fromstring(entries_xml)
links = select_elements_of_tag(
root, "{http://www.w3.org/2005/Atom}link"
)
blog_entries_uri = return_next_entry_list_uri(links)
entries = select_elements_of_tag(
root, "{http://www.w3.org/2005/Atom}entry"
)
for entry in entries:
if is_draft(entry):
continue
oldest_published_date = return_published_date(entry)
if is_in_period(
oldest_published_date, date_range_start, date_range_end
):
target_entries.append(entry)
print(f"{oldest_published_date}Get articles up to (all{len(target_entries)}Case)")
output_path.mkdir(parents=True, exist_ok=True)
for entry in target_entries:
id_ = return_id(entry)
file_path = output_path / f"{id_}.txt"
contents = return_contents(entry)
with open(file_path, "w") as fout:
fout.write(contents)
$ python main.py nikkie-ftnext nikkie-ftnext.hatenablog.com 2019 --output output/2019
2019-10-30 11:25:23+09:Get articles up to 00 (9 in total)
2019-06-13 10:18:36+09:Get articles up to 00 (18 in total)
2019-03-30 13:52:19+09:Obtained articles up to 00 (27 in total)
2018-12-23 10:24:06+09:Get articles up to 00 (32 in total)
# -> output/32 text files will be created under 2019 (contents are blogs I wrote)
In this way, I was able to get the blog article I wrote from the Hatena blog AtomPub!
When repeating the same thing, I would like to investigate and incorporate the following:
--It seems that you can only get your own blog from AtomPub
――If you want to target other people's blogs, you probably need to become a blog member.
--Or scraping after checking robots.txt?
--Parsing XML with namespaces
- https://docs.python.org/ja/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces
--You didn't have to specify something like {http://www.w3.org/2005/Atom} link
(I could write it in DRY as in the documentation)
--You could try beautifulsoup4
for XML parsing.
that's all.
Recommended Posts