Last time succeeded in reading aloud while slowly registering it in the dictionary (SofTalk). This time, I will improve the previous code while reading the official documentation and related articles for studying.
aozora_urlsplit.py
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def aozoraurl(base_url):
page = requests.get(base_url)
soup = BeautifulSoup(page.text, 'lxml')
xhtml_relative_url = soup.select('table.download a[href*=".html"]')[0].get('href')
zipdl_relative_url = soup.select('table.download a[href*=".zip"]')[0].get('href')
xhtml_url = urljoin(base_url, xhtml_relative_url)
zipdl_url = urljoin(base_url, zipdl_relative_url)
return xhtml_url, zipdl_url
if __name__ == '__main__':
print(aozoraurl("https://www.aozora.gr.jp/cards/000879/card85.html"))
Beautiful Soup 4
** Beautiful Soup is a Python library that retrieves data from HTML and XML files. ** ** From Beautiful Soup 4.2.0 Doc. Japanese translation
Parse the structure by passing an HTML document to the BeautifulSoup
constructor
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'lxml')
'lxml'
specifies a parser
You can choose the HTML parser from the Python standard (Batteries included) html, parser
, fast lxml
, and very generous html5lib
.
The official documentation seems to recommend lxml
If you can, I recommend you install and use lxml for speed. From Beautiful Soup 4.9.0 documentation
Requests
** Requests is an Apache2 Licensed-based HTTP library written in Python that is designed to be user-friendly. ** ** From Requests: HTTP for Humans
Passing the URL to requests.get ()
will return a Response
object
You can use .text
and .content
to get the contents of Response
From the official document
import requests
r = request.get(URL)
r.content #Get binary data
r.text #Get text data
I don't know the difference between binary and text, but according to [^ stack] here, it is .text
for HTML and XML, and .content
for images and PDF.
There are many ways to get elements, but I think CSS selectors are the most versatile.
If you want to search for tags that match two or more CSS classes, you should use a CSS selector:
From the official document
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]
Extract each URL from the page of Aozora Bunko Toshoken
Note that the return value of soup.select
is list
xhtml_url = soup.select('table.download a[href*=".html"]')[0].get('href')
zip_url = soup.select('table.download a[href*=".zip"]')[0].get('href')
urljoin
Links from the Aozora Bunko Toshoken page to each page are listed as relative links. Use ʻurljoin` to synthesize with the original URL and return to an absolute link
urllib.parse.urljoin(base, url, allow_fragments=True) ** Combine a "base URL" (base) with another URL (url) to form a complete URL ("absolute URL"). ** ** From urllib.parse --- parse URL into component
From the official document
>>> from urllib.parse import urljoin
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
'http://www.cwi.nl/%7Eguido/FAQ.html'
return a, b
If you want to return multiple return values (two URLs this time), it seems to be the simplest to write like this
In Python, simply return separated by commas to return strings or numbers at once. From How to return multiple return values with a Python function | note.nkmk.me
In this case, the return value will be a tuple.
aozora_rubylist.py
from bs4 import BeautifulSoup
import requests
import jaconv
def aozoraruby(xhtml_url):
page = requests.get(xhtml_url)
soup = BeautifulSoup(page.content, 'lxml')
_ruby = soup.select("ruby")
_rubylist = [i.get_text().replace(")", "").split("(") for i in _ruby]
for index, item in enumerate(_rubylist):
_rubylist[index][1] = jaconv.kata2hira(item[1])
return _rubylist
if __name__ == '__main__':
print(aozoraruby("https://www.aozora.gr.jp/cards/000119/files/624_14544.html"))
Extract the characters below the ruby tag from the Aozora Bunko XHTML body
I think .select ("ruby ")
is the easiest
Check the acquired contents for the time being
You can extract the string part with .get_text ()
, but it cannot be applied to the list, so turn it with the for statement.
_ruby = soup.select("ruby")
for i in _ruby:
print(i.get_text())
#Heavy
#Corpse
#Mariya
# ︙
It is output in the form of Kanji (ruby)
List the acquired ruby.
It is difficult to use in a simple list of [" Kanji (ruby) "," Kanji (ruby) ", ...]
, so [[" Kanji "," Yomigana "], [" Kanji "," Yomigana " ], ...]
Create a two-dimensional array
When creating a list with a for statement, you used to create an empty list before the for statement and add elements to it. However, it seems that it can be created in one sentence by using the method ** list comprehension ** [^ naihou1] [^ naihou2]
x = [i.get_text().replace(")", "").split("(") for i in _ruby]
# [['Heavy', 'Umbrella'], ['Corpse', 'Challenging'], ['Mariya', 'Mary'],...]
Hiragana with katakana reading kana to register in SofTalk dictionary According to this Qiita article, jaconv seems to be suitable for this usage.
jaconv (Japanese Converter) performs high-speed conversion of hiragana, katakana, full-width, and half-width characters. Since it is implemented only in Python, it can be used even in an environment where the C compiler cannot be used. From jaconv / README_JP.rst at master · ikegami-yukino / jaconv · GitHub
for index, item in enumerate(_rubylist):
_rubylist[index][1] = jaconv.kata2hira(item[1])
# [['Heavy', 'Umbrella'], ['Corpse', 'Challenging'], ['Mariya', 'Mariya'],...]
[^ naihou1]: Concisely write list comprehension using comprehension --Qiita https://qiita.com/tag1216/items/040e482f9844805bce7f
[^ naihou2]: How to use list comprehension | note.nkmk.me https://note.nkmk.me/python-list-comprehension/
Recommended Posts