[PYTHON] Have Aozora Bunko read slowly: More accurately

Last time succeeded in having Aozora Bunko read slowly using Python. However, since I read it without considering ruby, there was a problem that the reading accuracy was quite low. This time I will correct that point

This flow

What i want to do ** If you throw the URL of Aozora Bunko, it will be read aloud slowly after registering it in the dictionary of reading **. Therefore, we will create the following flow

Untitled Diagram.png

Get the file URL from the URL of the book card

There is a page called Toshoken in Aozora Bunko, from which you can jump to the Web page where the text is written and download various files.

tosyoka-do.PNG

This time, I want zip and xhtml from the files enclosed in red, so I will use BeautifulSoup4 to get them.

aozora_urlsplit.py


from bs4 import BeautifulSoup
import requests
import re

def aozoraurl(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.select('table.download a[href*=".html"]')
    htmlurl_rel = str(table).split('\"')[1].replace("./", "/")
    #Get relative path of XHTML

    table2 = soup.select('table.download a[href*=".zip"]')
    zipurl_rel = str(table2).split('\"')[1].replace("./", "/")
    #Get relative path of zip file

    urlsplit = re.sub("/card[0-9]*.html", "", url)
    htmlurl = urlsplit + htmlurl_rel
    zipurl = urlsplit + zipurl_rel
    aozoraurllist = [htmlurl, zipurl]
    #Cut out the part below the card of the entered URL and attach it to the relative path above

    return aozoraurllist

if __name__ == '__main__':
	print(aozoraurl("https://www.aozora.gr.jp/cards/000879/card85.html"))

If you throw the URL of the page of the book card, it will return the list of [XHTML URL, zip URL].

Create a list of ruby

Create a list of ruby from XHTML

aozora_rubylisting.py


from bs4 import BeautifulSoup
import requests
import re
import jaconv

def aozoraruby(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    _ruby = soup.select("ruby")

    _rubylist = []
    for i in _ruby:
        _ruby_split = re.sub("<[/a-z]*>", "", str(i)).replace(")", "").split("(")
        _rubylist.append(_ruby_split)

    for index, item in enumerate(_rubylist):
        _rubylist[index][1] = jaconv.kata2hira(str(item[1]))
    
    return _rubylist

if __name__ == '__main__':
	print(aozoraruby("https://www.aozora.gr.jp/cards/000879/files/3813_27308.html"))

Since SofTalk can only be read and registered in Hiragana, Katakana will be converted to Hiragana.

I used ʻenumerate` for the first time this time, and it's smart and good.

Create body from txt file

Like last time, I will borrow it from I want to get the text from Aozora Bunko in Python --AI Artificial Intelligence Technology.

If you throw the URL to the download () function, the body will be returned from the convert () function. Since the main () function is not used, put it under ʻif name =='main':`

Slow dictionary registration and reading aloud

aozora_yukkuri.py


import subprocess
import os

from aozora_urlsplit import aozoraurl
from aozora_rubylisting import aozoraruby
from aozora_honbun import download, convert

os.chdir(os.path.dirname(os.path.abspath(__file__)))

urllist = aozoraurl("https://www.aozora.gr.jp/cards/000879/card3813.html")

_ruby = aozoraruby(urllist[0])
_honbun = convert(download(urllist[1])).splitlines()

_start = "start SofTalk.exe Path"
_pron  = "/P:"
_speed = "/S:120"
_word  = "/W:"

for j in _ruby:
    _command_ruby = [_start, _pron + j[1] + "," + j[0] + ",True"]
    print(_command_ruby)
    subprocess.run(' '.join(_command_ruby), shell=True)

for i in _honbun:
    _command = [_start, _speed, _word + i]
    subprocess.run(' '.join(_command), shell=True)

Put the text in the list for each line break, and perform from dictionary registration to reading the text.

If you want to change your voice slowly, add / M: or / MN: to _command (reference [^ dai1kai]).

Known issues

The text and ruby are taken from different places

The body is taken from txt (zip) and the ruby is taken from the XHTML version because I want to use the existing code. I think that the text should be unified to the XHTML version in the future (because it seems to be quite difficult to remove ruby from txt)

Kanji that failed to be registered in the dictionary

In Aozora Bunko, it seems that difficult kanji like the image below are displayed as an image. It is difficult to read this, so it is good to skip the dictionary registration and summarize it in the text as hiragana. This time I did not process it because I thought it would be good to read it to some extent

kakuryaku.PNG Kanji in question

[^ dai1kai]: Let's make a voice slowly with Python https://qiita.com/Mechanetai/items/78b04ed553cce01fa081

Recommended Posts

Have Aozora Bunko read slowly: More accurately
Have Aozora Bunko read slowly
Have Aozora Bunko read slowly: Code improvement
Let's have Aozora Bunko summarized while talking with COTOHA