Call the API of Hatena Blog from Python and save your blog articles individually on your PC

Overview

As a review of the Hatena blog I wrote in 2019, I decided to create a WordCloud for my blog article [^ 1]. After a little research, I found an article calling the API of the blog in JavaScript. So I tried to do the same in Python. This article summarizes what I learned by moving my hands.

[^ 1]: Update the link after publication

The scripts covered in this article do the following:

--Get a list of Hatena blog articles you wrote --Separate the articles written on the Hatena blog into files and save them on your computer.

There are two topics to discuss:

Preparation before calling the API of Hatena Blog
How to analyze the API response (XML) of Hatena Blog

Operating environment

$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.14.4
BuildVersion:	18E226
$ python -V
Python 3.7.3  #I am creating a virtual environment with the venv module
$ pip list | grep requests
requests    2.22.0

Preparation: Until the API of Hatena Blog is called

Preparation consists of two steps.

--Learn how to parse XML in Python --Call the API using your credentials

First of all, about the API of Hatena Blog

Hatena Blog AtomPub | http://developer.hatena.ne.jp/ja/documents/blog/apis/atom

An API using the Atom Publishing Protocol (AtomPub) is open to the public.

Hatena Blog In order to use AtomPub, the client needs to perform OAuth authentication, WSSE authentication, or Basic authentication.

This time, I implemented it with ** Basic authentication **.

As described in "Getting a list of blog entries" in the above document, the ** response is returned in XML **.

Preparation 1. Know how to parse XML with Python

I used the standard module ElementTree XML API (https://docs.python.org/ja/3/library/xml.etree.elementtree.html).

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.

Let's see the XML operation with a simple example. The same operation was performed on the response of the Hatena blog Atom Pub.

`practice.py`


import xml.etree.ElementTree as ET

sample_xml_as_string = """<?xml version="1.0"?>
<data>
    <member name="Kumiko">
        <grade>2</grade>
        <instrument>euphonium</instrument>
    </member>
    <member name="Kanade">
        <grade>1</grade>
        <instrument>euphonium</instrument>
    </member>
    <member name="Mizore">
        <grade>3</grade>
        <instrument>oboe</instrument>
    </member>
</data>"""

root = ET.fromstring(sample_xml_as_string)

Specify the -i option [^ 2] when executing the script in the Python interpreter. This specification brings up interactive mode after the script is executed (the intention is to avoid having to enter XML strings from interactive mode).

$ python -i practice.py
>>> root
<Element 'data' at 0x108ac4f48>
>>> root.tag  # <data>Represents a tag
'data'
>>> root.attrib  # <data>Attributes on tags(attribute)Not
{}
>>> for child in root:  # <data>Nested in tags<member>You can take out the tag
...     print(child.tag, child.attrib)
...
member {'name': 'Kumiko'}  #member is an attribute called name(attribute)have
member {'name': 'Kanade'}
member {'name': 'Mizore'}

In addition to the for statement, the XML tag nesting structure can also be handled by the find and findall methods [^ 3].

--find: "Find the first child element with a specific tag" --findall: "Find only the direct child elements of the current element by tag"

>>> #Continued
>>> someone = root.find('member')
>>> print(someone.tag, someone.attrib)
member {'name': 'Kumiko'}  #First child element(member)It has become
>>> members = root.findall('member')
>>> for member in members:
...     print(member.tag, member.attrib)
...
member {'name': 'Kumiko'}  #All child elements(member)Has been taken out
member {'name': 'Kanade'}
member {'name': 'Mizore'}
>>> for member in members:
...     instrument = member.find('instrument')
...     print(instrument.text)  #Text part sandwiched between tags
...
euphonium
euphonium
oboe

This time, I analyzed the response of the blog Atom Pub with find and findall.

Preparation 2. Call the API using your credentials

Information for calling Hatena Blog AtomPub with Basic authentication can be found in Hatena Blog Settings> Advanced Settings> AtomPub.

Basic authentication used the ʻauth argument of the getmethod ofrequests` [^ 4].

blog_entries_url = "https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry"
user_pass_tuple = ("nikkie-ftnext", "the_api_key")
r = requests.get(blog_entries_url, auth=user_pass_tuple)
root = ET.fromstring(r.text)

Pass the XML-formatted string r.text to the ʻElementTree.fromstring` method.

Analysis of API response of Hatena Blog

We proceeded with XML analysis while comparing the actual response with "Getting a list of blog entries" at http://developer.hatena.ne.jp/ja/documents/blog/apis/atom.

In [20]: for child in root:  #root is the same as in the example above
    ...:     print(child.tag, child.attrib)  
    ...:                                                                        
{http://www.w3.org/2005/Atom}link {'rel': 'first', 'href': 'https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry'} 
{http://www.w3.org/2005/Atom}link {'rel': 'next', 'href': 'https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry?page=1572227492'} #Next page of article list
{http://www.w3.org/2005/Atom}title {} 
{http://www.w3.org/2005/Atom}subtitle {} 
{http://www.w3.org/2005/Atom}link {'rel': 'alternate', 'href': 'https://nikkie-ftnext.hatenablog.com/'} 
{http://www.w3.org/2005/Atom}updated {} 
{http://www.w3.org/2005/Atom}author {} 
{http://www.w3.org/2005/Atom}generator {'uri': 'https://blog.hatena.ne.jp/', 'version': '3977fa1b6c9f31b5eab4610099c62851'} 
{http://www.w3.org/2005/Atom}id {} 
{http://www.w3.org/2005/Atom}entry {} #Individual articles
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {}

We will also look at individual articles (tags are elements of entry).

>>> for item in child:  #child contains an article (the rest of the for)
...     print(item.tag, item.attrib)
...
{http://www.w3.org/2005/Atom}id {}
{http://www.w3.org/2005/Atom}link {'rel': 'edit', 'href': 'https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry/26006613457877510'}
{http://www.w3.org/2005/Atom}link {'rel': 'alternate', 'type': 'text/html', 'href': 'https://nikkie-ftnext.hatenablog.com/entry/rejectpy2019-plan-step0'}
{http://www.w3.org/2005/Atom}author {}
{http://www.w3.org/2005/Atom}title {}  #Title (Target of WordCloud)
{http://www.w3.org/2005/Atom}updated {}
{http://www.w3.org/2005/Atom}published {}  #Note: Also stored in draft
{http://www.w3.org/2007/app}edited {}
{http://www.w3.org/2005/Atom}summary {'type': 'text'}
{http://www.w3.org/2005/Atom}content {'type': 'text/x-markdown'}  #Text
{http://www.hatena.ne.jp/info/xmlns#}formatted-content {'type': 'text/html'}
{http://www.w3.org/2005/Atom}category {'term': 'Speaker report'}
{http://www.w3.org/2007/app}control {}  #Has information on whether or not the child element is draft

I want to make WordCloud from published articles, so check the draft of the child element of {http://www.w3.org/2007/app} control (if yes, it is an unpublished draft, so it is not applicable) ..

The text used for WordCloud was used by connecting the title and the text.

The whole code

Published in the following repositories: https://github.com/ftnext/hatenablog-atompub-python

`main.py`


import argparse
from datetime import datetime, timedelta, timezone
import os
from pathlib import Path
import xml.etree.ElementTree as ET

import requests


def load_credentials(username):
    """Returns the authentication information required for Hatena API access in tuple format"""
    auth_token = os.getenv("HATENA_BLOG_ATOMPUB_KEY")
    message = "Environment variable`HATENA_BLOG_ATOMPUB_KEY`Please set the API key of Atom Pub to"
    assert auth_token, message
    return (username, auth_token)


def retrieve_hatena_blog_entries(blog_entries_uri, user_pass_tuple):
    """GET access to Hatena Blog API and return XML representing article list as a character string"""
    r = requests.get(blog_entries_uri, auth=user_pass_tuple)
    return r.text


def select_elements_of_tag(xml_root, tag):
    """Parses the return XML and returns all child elements with the specified tag"""
    return xml_root.findall(tag)


def return_next_entry_list_uri(links):
    """Returns the endpoint of the following blog article list"""
    for link in links:
        if link.attrib["rel"] == "next":
            return link.attrib["href"]


def is_draft(entry):
    """Determine if a blog post is a draft"""
    draft_status = (
        entry.find("{http://www.w3.org/2007/app}control")
        .find("{http://www.w3.org/2007/app}draft")
        .text
    )
    return draft_status == "yes"


def return_published_date(entry):
    """Returns the publication date of a blog post

It was a specification that was returned even in the case of draft
    """
    publish_date_str = entry.find(
        "{http://www.w3.org/2005/Atom}published"
    ).text
    return datetime.fromisoformat(publish_date_str)


def is_in_period(datetime_, start, end):
    """Determine if the specified date and time is included in the period from start to end"""
    return start <= datetime_ < end


def return_id(entry):
    """Returns the ID part contained in the URI of the blog"""
    link = entry.find("{http://www.w3.org/2005/Atom}link")
    uri = link.attrib["href"]
    return uri.split("/")[-1]


def return_contents(entry):
    """Connect and return the blog title and body"""
    title = entry.find("{http://www.w3.org/2005/Atom}title").text
    content = entry.find("{http://www.w3.org/2005/Atom}content").text
    return f"{title}。\n\n{content}"


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("hatena_id")
    parser.add_argument("blog_domain")
    parser.add_argument("target_year", type=int)
    parser.add_argument("--output", type=Path)
    args = parser.parse_args()

    hatena_id = args.hatena_id
    blog_domain = args.blog_domain
    target_year = args.target_year
    output_path = args.output if args.output else Path("output")

    user_pass_tuple = load_credentials(hatena_id)

    blog_entries_uri = (
        f"https://blog.hatena.ne.jp/{hatena_id}/{blog_domain}/atom/entry"
    )

    jst_tz = timezone(timedelta(seconds=9 * 60 * 60))
    date_range_start = datetime(target_year, 1, 1, tzinfo=jst_tz)
    date_range_end = datetime(target_year + 1, 1, 1, tzinfo=jst_tz)

    oldest_published_date = datetime.now(jst_tz)
    target_entries = []

    while date_range_start <= oldest_published_date:
        entries_xml = retrieve_hatena_blog_entries(
            blog_entries_uri, user_pass_tuple
        )
        root = ET.fromstring(entries_xml)

        links = select_elements_of_tag(
            root, "{http://www.w3.org/2005/Atom}link"
        )
        blog_entries_uri = return_next_entry_list_uri(links)

        entries = select_elements_of_tag(
            root, "{http://www.w3.org/2005/Atom}entry"
        )
        for entry in entries:
            if is_draft(entry):
                continue
            oldest_published_date = return_published_date(entry)
            if is_in_period(
                oldest_published_date, date_range_start, date_range_end
            ):
                target_entries.append(entry)
        print(f"{oldest_published_date}Get articles up to (all{len(target_entries)}Case)")

    output_path.mkdir(parents=True, exist_ok=True)

    for entry in target_entries:
        id_ = return_id(entry)
        file_path = output_path / f"{id_}.txt"
        contents = return_contents(entry)
        with open(file_path, "w") as fout:
            fout.write(contents)

Execution example

$ python main.py nikkie-ftnext nikkie-ftnext.hatenablog.com 2019 --output output/2019
2019-10-30 11:25:23+09:Get articles up to 00 (9 in total)
2019-06-13 10:18:36+09:Get articles up to 00 (18 in total)
2019-03-30 13:52:19+09:Obtained articles up to 00 (27 in total)
2018-12-23 10:24:06+09:Get articles up to 00 (32 in total)
# -> output/32 text files will be created under 2019 (contents are blogs I wrote)

In this way, I was able to get the blog article I wrote from the Hatena blog AtomPub!

Send

When repeating the same thing, I would like to investigate and incorporate the following:

--It seems that you can only get your own blog from AtomPub ――If you want to target other people's blogs, you probably need to become a blog member. --Or scraping after checking robots.txt? --Parsing XML with namespaces - https://docs.python.org/ja/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces --You didn't have to specify something like {http://www.w3.org/2005/Atom} link (I could write it in DRY as in the documentation) --You could try beautifulsoup4 for XML parsing.

that's all.