Last time succeeded in having Aozora Bunko read slowly using Python. However, since I read it without considering ruby, there was a problem that the reading accuracy was quite low. This time I will correct that point
What i want to do ** If you throw the URL of Aozora Bunko, it will be read aloud slowly after registering it in the dictionary of reading **. Therefore, we will create the following flow
There is a page called Toshoken in Aozora Bunko, from which you can jump to the Web page where the text is written and download various files.
This time, I want zip and xhtml from the files enclosed in red, so I will use BeautifulSoup4 to get them.
aozora_urlsplit.py
from bs4 import BeautifulSoup
import requests
import re
def aozoraurl(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.select('table.download a[href*=".html"]')
htmlurl_rel = str(table).split('\"')[1].replace("./", "/")
#Get relative path of XHTML
table2 = soup.select('table.download a[href*=".zip"]')
zipurl_rel = str(table2).split('\"')[1].replace("./", "/")
#Get relative path of zip file
urlsplit = re.sub("/card[0-9]*.html", "", url)
htmlurl = urlsplit + htmlurl_rel
zipurl = urlsplit + zipurl_rel
aozoraurllist = [htmlurl, zipurl]
#Cut out the part below the card of the entered URL and attach it to the relative path above
return aozoraurllist
if __name__ == '__main__':
print(aozoraurl("https://www.aozora.gr.jp/cards/000879/card85.html"))
If you throw the URL of the page of the book card, it will return the list of [XHTML URL, zip URL].
Create a list of ruby from XHTML
aozora_rubylisting.py
from bs4 import BeautifulSoup
import requests
import re
import jaconv
def aozoraruby(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
_ruby = soup.select("ruby")
_rubylist = []
for i in _ruby:
_ruby_split = re.sub("<[/a-z]*>", "", str(i)).replace(")", "").split("(")
_rubylist.append(_ruby_split)
for index, item in enumerate(_rubylist):
_rubylist[index][1] = jaconv.kata2hira(str(item[1]))
return _rubylist
if __name__ == '__main__':
print(aozoraruby("https://www.aozora.gr.jp/cards/000879/files/3813_27308.html"))
Since SofTalk can only be read and registered in Hiragana, Katakana will be converted to Hiragana.
I used ʻenumerate` for the first time this time, and it's smart and good.
Like last time, I will borrow it from I want to get the text from Aozora Bunko in Python --AI Artificial Intelligence Technology.
If you throw the URL to the download ()
function, the body will be returned from the convert ()
function.
Since the main ()
function is not used, put it under ʻif name =='main':`
aozora_yukkuri.py
import subprocess
import os
from aozora_urlsplit import aozoraurl
from aozora_rubylisting import aozoraruby
from aozora_honbun import download, convert
os.chdir(os.path.dirname(os.path.abspath(__file__)))
urllist = aozoraurl("https://www.aozora.gr.jp/cards/000879/card3813.html")
_ruby = aozoraruby(urllist[0])
_honbun = convert(download(urllist[1])).splitlines()
_start = "start SofTalk.exe Path"
_pron = "/P:"
_speed = "/S:120"
_word = "/W:"
for j in _ruby:
_command_ruby = [_start, _pron + j[1] + "," + j[0] + ",True"]
print(_command_ruby)
subprocess.run(' '.join(_command_ruby), shell=True)
for i in _honbun:
_command = [_start, _speed, _word + i]
subprocess.run(' '.join(_command), shell=True)
Put the text in the list for each line break, and perform from dictionary registration to reading the text.
If you want to change your voice slowly, add / M:
or / MN:
to _command
(reference [^ dai1kai]).
The body is taken from txt (zip) and the ruby is taken from the XHTML version because I want to use the existing code. I think that the text should be unified to the XHTML version in the future (because it seems to be quite difficult to remove ruby from txt)
In Aozora Bunko, it seems that difficult kanji like the image below are displayed as an image. It is difficult to read this, so it is good to skip the dictionary registration and summarize it in the text as hiragana. This time I did not process it because I thought it would be good to read it to some extent
Kanji in question
[^ dai1kai]: Let's make a voice slowly with Python https://qiita.com/Mechanetai/items/78b04ed553cce01fa081