[PYTHON] Have Aozora Bunko read slowly: Code improvement

Last time succeeded in reading aloud while slowly registering it in the dictionary (SofTalk). This time, I will improve the previous code while reading the official documentation and related articles for studying.

From book card to URL acquisition

Improved code

aozora_urlsplit.py


import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def aozoraurl(base_url):
    page = requests.get(base_url)
    soup = BeautifulSoup(page.text, 'lxml')

    xhtml_relative_url = soup.select('table.download a[href*=".html"]')[0].get('href')
    zipdl_relative_url = soup.select('table.download a[href*=".zip"]')[0].get('href')

    xhtml_url = urljoin(base_url, xhtml_relative_url)
    zipdl_url = urljoin(base_url, zipdl_relative_url)

    return xhtml_url, zipdl_url

if __name__ == '__main__':
    print(aozoraurl("https://www.aozora.gr.jp/cards/000879/card85.html"))

Beautiful Soup 4

** Beautiful Soup is a Python library that retrieves data from HTML and XML files. ** ** From Beautiful Soup 4.2.0 Doc. Japanese translation

Parse the structure by passing an HTML document to the BeautifulSoup constructor

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'lxml')

'lxml' specifies a parser You can choose the HTML parser from the Python standard (Batteries included) html, parser, fast lxml, and very generous html5lib. The official documentation seems to recommend lxml

If you can, I recommend you install and use lxml for speed. From Beautiful Soup 4.9.0 documentation

Requests

** Requests is an Apache2 Licensed-based HTTP library written in Python that is designed to be user-friendly. ** ** From Requests: HTTP for Humans

Passing the URL to requests.get () will return a Response object You can use .text and .content to get the contents of Response

From the official document


import requests
r = request.get(URL)
r.content #Get binary data
r.text    #Get text data

I don't know the difference between binary and text, but according to [^ stack] here, it is .text for HTML and XML, and .content for images and PDF.

Get 2 types of URL (XHTML / zip)

There are many ways to get elements, but I think CSS selectors are the most versatile.

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

From the official document


css_soup.select("p.strikeout.body") 
# [<p class="body strikeout"></p>]

Extract each URL from the page of Aozora Bunko Toshoken Note that the return value of soup.select is list

xhtml_url = soup.select('table.download a[href*=".html"]')[0].get('href')
zip_url   = soup.select('table.download a[href*=".zip"]')[0].get('href')

urljoin

Links from the Aozora Bunko Toshoken page to each page are listed as relative links. Use ʻurljoin` to synthesize with the original URL and return to an absolute link

urllib.parse.urljoin(base, url, allow_fragments=True) ** Combine a "base URL" (base) with another URL (url) to form a complete URL ("absolute URL"). ** ** From urllib.parse --- parse URL into component

From the official document


>>> from urllib.parse import urljoin
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
'http://www.cwi.nl/%7Eguido/FAQ.html'

return a, b

If you want to return multiple return values (two URLs this time), it seems to be the simplest to write like this

In Python, simply return separated by commas to return strings or numbers at once. From How to return multiple return values with a Python function | note.nkmk.me

In this case, the return value will be a tuple.

From ruby extraction from XHTML page

Improved code

aozora_rubylist.py


from bs4 import BeautifulSoup
import requests
import jaconv

def aozoraruby(xhtml_url):
    page = requests.get(xhtml_url)
    soup = BeautifulSoup(page.content, 'lxml')

    _ruby = soup.select("ruby")

    _rubylist = [i.get_text().replace(")", "").split("(") for i in _ruby]

    for index, item in enumerate(_rubylist):
        _rubylist[index][1] = jaconv.kata2hira(item[1])
    
    return _rubylist

if __name__ == '__main__':
    print(aozoraruby("https://www.aozora.gr.jp/cards/000119/files/624_14544.html"))

Extraction of ruby

Extract the characters below the ruby tag from the Aozora Bunko XHTML body I think .select ("ruby ") is the easiest

Check the acquired contents for the time being You can extract the string part with .get_text (), but it cannot be applied to the list, so turn it with the for statement.

_ruby = soup.select("ruby")
for i in _ruby:
    print(i.get_text())

#Heavy
#Corpse
#Mariya
# ︙

It is output in the form of Kanji (ruby)

Creating a list of ruby

List the acquired ruby. It is difficult to use in a simple list of [" Kanji (ruby) "," Kanji (ruby) ", ...], so [[" Kanji "," Yomigana "], [" Kanji "," Yomigana " ], ...] Create a two-dimensional array

When creating a list with a for statement, you used to create an empty list before the for statement and add elements to it. However, it seems that it can be created in one sentence by using the method ** list comprehension ** [^ naihou1] [^ naihou2]

x = [i.get_text().replace(")", "").split("(") for i in _ruby]
# [['Heavy', 'Umbrella'], ['Corpse', 'Challenging'], ['Mariya', 'Mary'],...]

Hiragana

Hiragana with katakana reading kana to register in SofTalk dictionary According to this Qiita article, jaconv seems to be suitable for this usage.

jaconv (Japanese Converter) performs high-speed conversion of hiragana, katakana, full-width, and half-width characters. Since it is implemented only in Python, it can be used even in an environment where the C compiler cannot be used. From jaconv / README_JP.rst at master · ikegami-yukino / jaconv · GitHub

for index, item in enumerate(_rubylist):
    _rubylist[index][1] = jaconv.kata2hira(item[1])
# [['Heavy', 'Umbrella'], ['Corpse', 'Challenging'], ['Mariya', 'Mariya'],...]

[^ naihou1]: Concisely write list comprehension using comprehension --Qiita https://qiita.com/tag1216/items/040e482f9844805bce7f

[^ naihou2]: How to use list comprehension | note.nkmk.me https://note.nkmk.me/python-list-comprehension/

Recommended Posts

Have Aozora Bunko read slowly: Code improvement
Have Aozora Bunko read slowly
Have Aozora Bunko read slowly: More accurately
Have Aozora Bunko read slowly
Clustering books from Aozora Bunko with Doc2Vec
Have Aozora Bunko read slowly: More accurately
Have Aozora Bunko read slowly: Code improvement
Let's enjoy natural language processing with COTOHA API
Hard-to-read code improvement notes