Call the API of Hatena Blog from Python and save your blog articles individually on your PC

Overview

As a review of the Hatena blog I wrote in 2019, I decided to create a WordCloud for my blog article [^ 1]. After a little research, I found an article calling the API of the blog in JavaScript. So I tried to do the same in Python. This article summarizes what I learned by moving my hands.

[^ 1]: Update the link after publication

The scripts covered in this article do the following:

--Get a list of Hatena blog articles you wrote --Separate the articles written on the Hatena blog into files and save them on your computer.

There are two topics to discuss:

  1. Preparation before calling the API of Hatena Blog
  2. How to analyze the API response (XML) of Hatena Blog

Operating environment

$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.14.4
BuildVersion:	18E226
$ python -V
Python 3.7.3  #I am creating a virtual environment with the venv module
$ pip list | grep requests
requests    2.22.0

Preparation: Until the API of Hatena Blog is called

Preparation consists of two steps.

--Learn how to parse XML in Python --Call the API using your credentials

First of all, about the API of Hatena Blog

Hatena Blog AtomPub | http://developer.hatena.ne.jp/ja/documents/blog/apis/atom

An API using the Atom Publishing Protocol (AtomPub) is open to the public.

Hatena Blog In order to use AtomPub, the client needs to perform OAuth authentication, WSSE authentication, or Basic authentication.

This time, I implemented it with ** Basic authentication **.

As described in "Getting a list of blog entries" in the above document, the ** response is returned in XML **.

Preparation 1. Know how to parse XML with Python

I used the standard module ElementTree XML API (https://docs.python.org/ja/3/library/xml.etree.elementtree.html).

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.

Let's see the XML operation with a simple example. The same operation was performed on the response of the Hatena blog Atom Pub.

practice.py


import xml.etree.ElementTree as ET

sample_xml_as_string = """<?xml version="1.0"?>
<data>
    <member name="Kumiko">
        <grade>2</grade>
        <instrument>euphonium</instrument>
    </member>
    <member name="Kanade">
        <grade>1</grade>
        <instrument>euphonium</instrument>
    </member>
    <member name="Mizore">
        <grade>3</grade>
        <instrument>oboe</instrument>
    </member>
</data>"""

root = ET.fromstring(sample_xml_as_string)

Specify the -i option [^ 2] when executing the script in the Python interpreter. This specification brings up interactive mode after the script is executed (the intention is to avoid having to enter XML strings from interactive mode).

$ python -i practice.py
>>> root
<Element 'data' at 0x108ac4f48>
>>> root.tag  # <data>Represents a tag
'data'
>>> root.attrib  # <data>Attributes on tags(attribute)Not
{}
>>> for child in root:  # <data>Nested in tags<member>You can take out the tag
...     print(child.tag, child.attrib)
...
member {'name': 'Kumiko'}  #member is an attribute called name(attribute)have
member {'name': 'Kanade'}
member {'name': 'Mizore'}

In addition to the for statement, the XML tag nesting structure can also be handled by the find and findall methods [^ 3].

--find: "Find the first child element with a specific tag" --findall: "Find only the direct child elements of the current element by tag"

>>> #Continued
>>> someone = root.find('member')
>>> print(someone.tag, someone.attrib)
member {'name': 'Kumiko'}  #First child element(member)It has become
>>> members = root.findall('member')
>>> for member in members:
...     print(member.tag, member.attrib)
...
member {'name': 'Kumiko'}  #All child elements(member)Has been taken out
member {'name': 'Kanade'}
member {'name': 'Mizore'}
>>> for member in members:
...     instrument = member.find('instrument')
...     print(instrument.text)  #Text part sandwiched between tags
...
euphonium
euphonium
oboe

This time, I analyzed the response of the blog Atom Pub with find and findall.

Preparation 2. Call the API using your credentials

Information for calling Hatena Blog AtomPub with Basic authentication can be found in Hatena Blog Settings> Advanced Settings> AtomPub.

hatenablog_atompub_config.png

Basic authentication used the ʻauth argument of the getmethod ofrequests` [^ 4].

blog_entries_url = "https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry"
user_pass_tuple = ("nikkie-ftnext", "the_api_key")
r = requests.get(blog_entries_url, auth=user_pass_tuple)
root = ET.fromstring(r.text)

Pass the XML-formatted string r.text to the ʻElementTree.fromstring` method.

Analysis of API response of Hatena Blog

We proceeded with XML analysis while comparing the actual response with "Getting a list of blog entries" at http://developer.hatena.ne.jp/ja/documents/blog/apis/atom.

In [20]: for child in root:  #root is the same as in the example above
    ...:     print(child.tag, child.attrib)  
    ...:                                                                        
{http://www.w3.org/2005/Atom}link {'rel': 'first', 'href': 'https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry'} 
{http://www.w3.org/2005/Atom}link {'rel': 'next', 'href': 'https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry?page=1572227492'} #Next page of article list
{http://www.w3.org/2005/Atom}title {} 
{http://www.w3.org/2005/Atom}subtitle {} 
{http://www.w3.org/2005/Atom}link {'rel': 'alternate', 'href': 'https://nikkie-ftnext.hatenablog.com/'} 
{http://www.w3.org/2005/Atom}updated {} 
{http://www.w3.org/2005/Atom}author {} 
{http://www.w3.org/2005/Atom}generator {'uri': 'https://blog.hatena.ne.jp/', 'version': '3977fa1b6c9f31b5eab4610099c62851'} 
{http://www.w3.org/2005/Atom}id {} 
{http://www.w3.org/2005/Atom}entry {} #Individual articles
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 
{http://www.w3.org/2005/Atom}entry {} 

We will also look at individual articles (tags are elements of entry).

>>> for item in child:  #child contains an article (the rest of the for)
...     print(item.tag, item.attrib)
...
{http://www.w3.org/2005/Atom}id {}
{http://www.w3.org/2005/Atom}link {'rel': 'edit', 'href': 'https://blog.hatena.ne.jp/nikkie-ftnext/nikkie-ftnext.hatenablog.com/atom/entry/26006613457877510'}
{http://www.w3.org/2005/Atom}link {'rel': 'alternate', 'type': 'text/html', 'href': 'https://nikkie-ftnext.hatenablog.com/entry/rejectpy2019-plan-step0'}
{http://www.w3.org/2005/Atom}author {}
{http://www.w3.org/2005/Atom}title {}  #Title (Target of WordCloud)
{http://www.w3.org/2005/Atom}updated {}
{http://www.w3.org/2005/Atom}published {}  #Note: Also stored in draft
{http://www.w3.org/2007/app}edited {}
{http://www.w3.org/2005/Atom}summary {'type': 'text'}
{http://www.w3.org/2005/Atom}content {'type': 'text/x-markdown'}  #Text
{http://www.hatena.ne.jp/info/xmlns#}formatted-content {'type': 'text/html'}
{http://www.w3.org/2005/Atom}category {'term': 'Speaker report'}
{http://www.w3.org/2007/app}control {}  #Has information on whether or not the child element is draft

I want to make WordCloud from published articles, so check the draft of the child element of {http://www.w3.org/2007/app} control (if yes, it is an unpublished draft, so it is not applicable) ..

The text used for WordCloud was used by connecting the title and the text.

The whole code

Published in the following repositories: https://github.com/ftnext/hatenablog-atompub-python

main.py


import argparse
from datetime import datetime, timedelta, timezone
import os
from pathlib import Path
import xml.etree.ElementTree as ET

import requests


def load_credentials(username):
    """Returns the authentication information required for Hatena API access in tuple format"""
    auth_token = os.getenv("HATENA_BLOG_ATOMPUB_KEY")
    message = "Environment variable`HATENA_BLOG_ATOMPUB_KEY`Please set the API key of Atom Pub to"
    assert auth_token, message
    return (username, auth_token)


def retrieve_hatena_blog_entries(blog_entries_uri, user_pass_tuple):
    """GET access to Hatena Blog API and return XML representing article list as a character string"""
    r = requests.get(blog_entries_uri, auth=user_pass_tuple)
    return r.text


def select_elements_of_tag(xml_root, tag):
    """Parses the return XML and returns all child elements with the specified tag"""
    return xml_root.findall(tag)


def return_next_entry_list_uri(links):
    """Returns the endpoint of the following blog article list"""
    for link in links:
        if link.attrib["rel"] == "next":
            return link.attrib["href"]


def is_draft(entry):
    """Determine if a blog post is a draft"""
    draft_status = (
        entry.find("{http://www.w3.org/2007/app}control")
        .find("{http://www.w3.org/2007/app}draft")
        .text
    )
    return draft_status == "yes"


def return_published_date(entry):
    """Returns the publication date of a blog post

It was a specification that was returned even in the case of draft
    """
    publish_date_str = entry.find(
        "{http://www.w3.org/2005/Atom}published"
    ).text
    return datetime.fromisoformat(publish_date_str)


def is_in_period(datetime_, start, end):
    """Determine if the specified date and time is included in the period from start to end"""
    return start <= datetime_ < end


def return_id(entry):
    """Returns the ID part contained in the URI of the blog"""
    link = entry.find("{http://www.w3.org/2005/Atom}link")
    uri = link.attrib["href"]
    return uri.split("/")[-1]


def return_contents(entry):
    """Connect and return the blog title and body"""
    title = entry.find("{http://www.w3.org/2005/Atom}title").text
    content = entry.find("{http://www.w3.org/2005/Atom}content").text
    return f"{title}。\n\n{content}"


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("hatena_id")
    parser.add_argument("blog_domain")
    parser.add_argument("target_year", type=int)
    parser.add_argument("--output", type=Path)
    args = parser.parse_args()

    hatena_id = args.hatena_id
    blog_domain = args.blog_domain
    target_year = args.target_year
    output_path = args.output if args.output else Path("output")

    user_pass_tuple = load_credentials(hatena_id)

    blog_entries_uri = (
        f"https://blog.hatena.ne.jp/{hatena_id}/{blog_domain}/atom/entry"
    )

    jst_tz = timezone(timedelta(seconds=9 * 60 * 60))
    date_range_start = datetime(target_year, 1, 1, tzinfo=jst_tz)
    date_range_end = datetime(target_year + 1, 1, 1, tzinfo=jst_tz)

    oldest_published_date = datetime.now(jst_tz)
    target_entries = []

    while date_range_start <= oldest_published_date:
        entries_xml = retrieve_hatena_blog_entries(
            blog_entries_uri, user_pass_tuple
        )
        root = ET.fromstring(entries_xml)

        links = select_elements_of_tag(
            root, "{http://www.w3.org/2005/Atom}link"
        )
        blog_entries_uri = return_next_entry_list_uri(links)

        entries = select_elements_of_tag(
            root, "{http://www.w3.org/2005/Atom}entry"
        )
        for entry in entries:
            if is_draft(entry):
                continue
            oldest_published_date = return_published_date(entry)
            if is_in_period(
                oldest_published_date, date_range_start, date_range_end
            ):
                target_entries.append(entry)
        print(f"{oldest_published_date}Get articles up to (all{len(target_entries)}Case)")

    output_path.mkdir(parents=True, exist_ok=True)

    for entry in target_entries:
        id_ = return_id(entry)
        file_path = output_path / f"{id_}.txt"
        contents = return_contents(entry)
        with open(file_path, "w") as fout:
            fout.write(contents)

Execution example

$ python main.py nikkie-ftnext nikkie-ftnext.hatenablog.com 2019 --output output/2019
2019-10-30 11:25:23+09:Get articles up to 00 (9 in total)
2019-06-13 10:18:36+09:Get articles up to 00 (18 in total)
2019-03-30 13:52:19+09:Obtained articles up to 00 (27 in total)
2018-12-23 10:24:06+09:Get articles up to 00 (32 in total)
# -> output/32 text files will be created under 2019 (contents are blogs I wrote)

In this way, I was able to get the blog article I wrote from the Hatena blog AtomPub!

Send

When repeating the same thing, I would like to investigate and incorporate the following:

--It seems that you can only get your own blog from AtomPub ――If you want to target other people's blogs, you probably need to become a blog member. --Or scraping after checking robots.txt? --Parsing XML with namespaces - https://docs.python.org/ja/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces --You didn't have to specify something like {http://www.w3.org/2005/Atom} link (I could write it in DRY as in the documentation) --You could try beautifulsoup4 for XML parsing.

that's all.

Recommended Posts

Call the API of Hatena Blog from Python and save your blog articles individually on your PC
Get the number of articles accessed and likes with Qiita API + Python
[Python] Get the text of the law from the e-GOV Law API
Call your own python module from the ROS package
Get your heart rate from the fitbit API in Python!
How to output the number of VIEWs, likes, and stocks of articles posted on Qiita to CSV (created with "Python + Qiita API v2")
Call the API with python3.
Install mecab on Sakura shared server and call it from python
How to get followers and followers from python using the Mastodon API
The road to installing Python and Flask on an offline PC
Operate Firefox with Selenium from python and save the screen capture
The story of Python and the story of NaN
Existence from the viewpoint of Python
Use the Flickr API from Python
Call C / C ++ from Python on Mac
[python] Send the image captured from the webcam to the server and save it
Learning notes from the beginning of Python 1
I stumbled on the Hatena Keyword API
Learning notes from the beginning of Python 2
Automatically save images of your favorite characters from Google Image Search with Python
Let's play with Python Receive and save / display the text of the input form
[C language] Be careful of the combination of buffering API and non-buffering system call
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
Install the Python API of the autonomous driving simulator LGSVL and execute the sample program
[Python] Automatically totals the total number of articles posted by Qiita using the API
Get the contents of git diff from python
Summary of the differences between PHP and Python
The answer of "1/2" is different between python2 and 3
Specifying the range of ruby and python arrays
Compare the speed of Python append and map
PHP and Python integration from scratch on Laravel
At the time of python update on ubuntu
About the * (asterisk) argument of python (and itertools.starmap)
A discussion of the strengths and weaknesses of Python
Call Polly from the AWS SDK for Python
Try accessing the YQL API directly from Python 3
[Python3] Take a screenshot of a web page on the server and crop it further
Build API server for checking the operation of front implementation with python3 and Flask
Extract images and tables from pdf with python to reduce the burden of reporting
I tried to automate the article update of Livedoor blog with Python and selenium.
I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.
Let's get notified of the weather in your favorite area from yahoo weather on LINE!
[Python] Save the result of web scraping the Mercari product page on Google Colab to Google Sheets and display the product image as well.