[PYTHON] Enhanced vocabulary with Cloud Text-to-Speech

background

At one point, my son Kun asked me, my father.

Son Kun: "Dad, how do you remember English words?" Dad: "Well, would you read aloud in school lessons? Then, wouldn't you train to recite?" Son Kun: "Well, people who are doing it are doing it ..." Dad: "Become able to translate textbooks first so that you can say it without looking at it in the next step. That kind of training is effective." Son Kun: "That ... I can't recite it." Dad: "What ?! If you sing it many times, you can remember a little, if not all, right?" Son Kun: "A little, but I get tired after doing it many times, so I'm not motivated ..." Dad: "(Seriously, this is Akan, I hate studying, orz)"

背景_図.png

Requirements analysis

The practice of "learning English words" is a steep path that everyone can take unless they are natives or returnees. It seems that his son Kun has already hit the wall. There are many ways to "remember words", so let's sort them out a bit.

The basic approach is to "read English words and train them to translate them into Japanese`. By getting used to it repeatedly, you should get used to it by the time you remember it.

The following methods are typical for memorizing words, but since my son Kun is not good at it, the direction of "1 English word" is good.

  1. Remember as English words
  2. Remember in example sentences
  3. Remember in the text

There are the following methods to infiltrate the body rather than the brain.

  1. Write and remember
  2. Remember aloud
  3. Listen and remember with a sound source such as a CD

After a lot of research, "2 voices" seems to be good in terms of efficiency and practicality. However, his son Kun also says that he can't recite. When it comes to that, it seems that "voice of 2" is also unlikely to be feasible.

"Writing 1" will be difficult in terms of time. What remains is the utilization of "3 sound sources", and it would be better to fix it as much as possible by listening to the gap time. But how do you organize the sound sources? Many of the CDs in the appendix of the vocabulary book have only English voice. There is a sound source that repeats a set of English and Japanese, but my son Kun says he doesn't like the rhythm. In the utterance of "what to look for", he even complained that "what" was persistent.

What a monster customer!

要求分析_図.png

Vocabulary strengthening strategy

If other companies' products or commercially available products cannot be used, do you edit the sound source by yourself? No, even the father of an engineer who crosses the world doesn't have that much time.

Yes, let's ask Google teacher.

Then, a method called Hyakushiki English words came out. Repeating English and Japanese sound sources, this is good. Ah, but my son Kun isn't a high school student. After all, is there no choice but to make the sound source yourself?

You can specify the words you remember and adjust the speed of the sound source, and in some cases it would be nice to switch between English-> Japanese and Japanese-> English.

With that in mind, I thought of the following method.

大作戦_図.png

Preparation

Performed Setting up authentication in Google's Text-to-Speech Client Libraries article , Get the JSON file that contains the service account key`. Specify this file as an argument of the program. This service seems to be free to use up to 1 million characters.

Implementation

Word definition file

The CSV file has the following format.

flag Id3tag_artist Id3tag_album Id3tag_title english japanese output loop_count
y Dad's english part1 0001 begin start ./mp3/kihon78/0001.mp3 2

Audio file generation image

発声の構成図.png

Parameters

The value that you want to change the setting for each word is set to the word definition file, and the value that seems to be good for each file is set as a parameter (command line argument). It is necessary to design this area according to the range in which the setting items are to be applied, such as whether it can be changed while the system definition, application definition, instance definition, or on. It depends on the expected operation.

Audio file generation


def create_audio(
        output_path,
        text,
        params_language_code,
        params_name,
        params_speaking_rate):
    client = texttospeech.TextToSpeechClient.from_service_account_json(
        option.servicekey_of_file)
    s_input = texttospeech.types.SynthesisInput(text=text)
    voice_params = texttospeech.types.VoiceSelectionParams(
        language_code=params_language_code, name=params_name)
    audio_config = texttospeech.types.AudioConfig(
        audio_encoding=texttospeech.enums.AudioEncoding.MP3,
        speaking_rate=params_speaking_rate)
    response = client.synthesize_speech(
        s_input, voice_params, audio_config)
    with open(output_path, 'wb') as out:
        out.write(response.audio_content)

Combine audio files

def synthesize_audio(
        input_en_path,
        input_jp_path,
        loop_count,
        output_path,
        option):
    loop_max = int(loop_count.strip())

    audio_en = AudioSegment.from_mp3(input_en_path)
    audio_jp = AudioSegment.from_mp3(input_jp_path)

    opening_margin = AudioSegment.silent(duration=100)
    between_sentences = AudioSegment.silent(duration=option.between_sentences)
    between_the_loop = AudioSegment.silent(duration=option.between_the_loop)

    if option.japanese_top:
        audio = opening_margin + audio_jp + between_sentences + audio_en
        if loop_max > 1:
            for li in range(loop_max - 1):
                audio += between_the_loop + audio_jp + between_sentences + audio_en
    else:
        audio = opening_margin + audio_en + between_sentences + audio_jp
        if loop_max > 1:
            for li in range(loop_max - 1):
                audio += between_the_loop + audio_en + between_sentences + audio_jp

    audio.export(output_path, format='mp3')
    os.remove(input_en_path)
    os.remove(input_jp_path)

Set of sauce

The full set of implementations can be found on github, please refer to it if you like.

ear-studies

Future Work

Recommended Posts

Enhanced vocabulary with Cloud Text-to-Speech
[Package cloud] Manage python packages with package cloud
Run XGBoost with Cloud Dataflow (Python)