[PYTHON] Enhanced vocabulary with Cloud Text-to-Speech

background

At one point, my son Kun asked me, my father.

Son Kun: "Dad, how do you remember English words?" Dad: "Well, would you read aloud in school lessons? Then, wouldn't you train to recite?" Son Kun: "Well, people who are doing it are doing it ..." Dad: "Become able to translate textbooks first so that you can say it without looking at it in the next step. That kind of training is effective." Son Kun: "That ... I can't recite it." Dad: "What ?! If you sing it many times, you can remember a little, if not all, right?" Son Kun: "A little, but I get tired after doing it many times, so I'm not motivated ..." Dad: "(Seriously, this is Akan, I hate studying, orz)"

背景_図.png

Requirements analysis

The practice of "learning English words" is a steep path that everyone can take unless they are natives or returnees. It seems that his son Kun has already hit the wall. There are many ways to "remember words", so let's sort them out a bit.

The basic approach is to "read English words and train them to translate them into Japanese`. By getting used to it repeatedly, you should get used to it by the time you remember it.

The following methods are typical for memorizing words, but since my son Kun is not good at it, the direction of "1 English word" is good.

Remember as English words
Remember in example sentences
Remember in the text

There are the following methods to infiltrate the body rather than the brain.

Write and remember
Remember aloud
Listen and remember with a sound source such as a CD

After a lot of research, "2 voices" seems to be good in terms of efficiency and practicality. However, his son Kun also says that he can't recite. When it comes to that, it seems that "voice of 2" is also unlikely to be feasible.

"Writing 1" will be difficult in terms of time. What remains is the utilization of "3 sound sources", and it would be better to fix it as much as possible by listening to the gap time. But how do you organize the sound sources? Many of the CDs in the appendix of the vocabulary book have only English voice. There is a sound source that repeats a set of English and Japanese, but my son Kun says he doesn't like the rhythm. In the utterance of "what to look for", he even complained that "what" was persistent.

What a monster customer!

要求分析_図.png

Vocabulary strengthening strategy

If other companies' products or commercially available products cannot be used, do you edit the sound source by yourself? No, even the father of an engineer who crosses the world doesn't have that much time.

Yes, let's ask Google teacher.

Then, a method called Hyakushiki English words came out. Repeating English and Japanese sound sources, this is good. Ah, but my son Kun isn't a high school student. After all, is there no choice but to make the sound source yourself?

You can specify the words you remember and adjust the speed of the sound source, and in some cases it would be nice to switch between English-> Japanese and Japanese-> English.

With that in mind, I thought of the following method.

大作戦_図.png

Preparation

Performed Setting up authentication in Google's Text-to-Speech Client Libraries article , Get the JSON file that contains the service account key`. Specify this file as an argument of the program. This service seems to be free to use up to 1 million characters.

Implementation

Word definition file

The CSV file has the following format.

flag	Id3tag_artist	Id3tag_album	Id3tag_title	english	japanese	output	loop_count
y	Dad's english	part1	0001	begin	start	./mp3/kihon78/0001.mp3	2

flag Specifies a flag to generate an audio file. Manage files that have been created once, such as by setting them to "n".
id3tag
Since it is supposed to be put in a mobile terminal such as an Iphone and listened to, it cannot be organized without a tag, so it can be defined. Here is a particular point in operation.
loop_count
The adjustment is effective for each line, such as twice for words and once for commentary.

Audio file generation image

Parameters

The value that you want to change the setting for each word is set to the word definition file, and the value that seems to be good for each file is set as a parameter (command line argument). It is necessary to design this area according to the range in which the setting items are to be applied, such as whether it can be changed while the system definition, application definition, instance definition, or on. It depends on the expected operation.

speaking_rate
Enabled to specify the speed of vocalization in English and Japanese respectively. I'm used to different languages in my mother tongue and foreign languages, so I recommend speeding up Japanese.
Time between Enabled to adjust between English and Japanese, and between repetition. If you take a lot of time, it will be a practice for instant English composition.

Audio file generation

Parameters such as which language to handle and who speaks are specified by presets such as'en-US'and'en-US-Wavenet-D'.
It seems that you can change to various extensions with Audio Encoding.


def create_audio(
        output_path,
        text,
        params_language_code,
        params_name,
        params_speaking_rate):
    client = texttospeech.TextToSpeechClient.from_service_account_json(
        option.servicekey_of_file)
    s_input = texttospeech.types.SynthesisInput(text=text)
    voice_params = texttospeech.types.VoiceSelectionParams(
        language_code=params_language_code, name=params_name)
    audio_config = texttospeech.types.AudioConfig(
        audio_encoding=texttospeech.enums.AudioEncoding.MP3,
        speaking_rate=params_speaking_rate)
    response = client.synthesize_speech(
        s_input, voice_params, audio_config)
    with open(output_path, 'wb') as out:
        out.write(response.audio_content)

Combine audio files

By using pydup's Audio Segment, you can join audio files as if you were connecting strings.

def synthesize_audio(
        input_en_path,
        input_jp_path,
        loop_count,
        output_path,
        option):
    loop_max = int(loop_count.strip())

    audio_en = AudioSegment.from_mp3(input_en_path)
    audio_jp = AudioSegment.from_mp3(input_jp_path)

    opening_margin = AudioSegment.silent(duration=100)
    between_sentences = AudioSegment.silent(duration=option.between_sentences)
    between_the_loop = AudioSegment.silent(duration=option.between_the_loop)

    if option.japanese_top:
        audio = opening_margin + audio_jp + between_sentences + audio_en
        if loop_max > 1:
            for li in range(loop_max - 1):
                audio += between_the_loop + audio_jp + between_sentences + audio_en
    else:
        audio = opening_margin + audio_en + between_sentences + audio_jp
        if loop_max > 1:
            for li in range(loop_max - 1):
                audio += between_the_loop + audio_en + between_sentences + audio_jp

    audio.export(output_path, format='mp3')
    os.remove(input_en_path)
    os.remove(input_jp_path)

Set of sauce

The full set of implementations can be found on github, please refer to it if you like.

ear-studies

Future Work

Obtain a word list that you do not remember with the API from a word application such as Quizlet, and use the program to generate a voice file that you are not good at. Develop a mechanism to distribute the file to the target people via LINE or Discord.
Personalize the words you are not good at and drive them in.
List the words that you may have forgotten by the forgetting curve, utter them on Google Home, and try to establish them.
Dad will improve his English with the power of software so as not to lose to his son Kun!