Introduction

I think the most interesting thing is how much you can do, so first ask here Please give me.

This model is

--Transfer learning using pre-trained model --Preprocessed data for about 1 hour

WaveGlow (published model)

I am learning and inferring. I'll show you how to do it for those who are just starting out.

Here is a reference for Tacotron 2. Research and development of Japanese TTS (Text-to-Speech) using Tacotron2 [Summary]

It is assumed that the demo is already running.

What to prepare

Audio file

--22050Hz 16bit monaural wav --Divided for each audio section

Exclude items that are noisy, laughter, and other items that are difficult to write. If it is too long, a memory error may occur during learning. I only do things within 10 seconds.

text

train.txt Create val.txt

Refer to ljs_audio_text_val_filelist.txt FILE PATH|TEXT I will write it as. I have a 9: 1 balance between train and val. Phoneme balance is not taken into consideration.

Phoneme notation

TEXT is written in phonemes with reference to the following. [wiki Japanese phonemes](https://ja.wikipedia.org/wiki/phonemes #Japanese phonemes) Voice Actor Statistics Corpus Phoneme Balance Sentence

Only the symbols.py element can be used.

Note that if you enter koNnichiwa at this time, inside Tacotron2,['k','o','n','n','i','c','h','i', Converted to'w','a']. If you want ['k','o','N','n','i','ch','i','w','a'], use {} Must be enclosed. However, you can only use the elements in valid_symbols in cmudict.py. So you need to say ko {N} ni {CH} iwa.

I also think that the notation such as k o {N} n i {CH} i w a may be used. I am konnnichiwa.

Added EOS at the end of the sentence

Model can not converge #254 It seems that the convergence of attention will be accelerated during learning.

Example

I am doing this.

`train.txt`


/wav/0126.wav|na&tanndesukedo-.
/wav/0022.wav|biyo-inndake-yoyakuwasimasita.
/wav/0149.wav|tasikani,ari!.
/wav/0092.wav|sositara-.
/wav/0063.wav|teyu-ne.
/wav/0202.wav|donndonn,tama&tekunndesuyo.

Setting

Edit hparams.py

iters_per_checkpoint
Change to your favorite number
training_files
train.txt path
validation_files
val.txt path
text_cleaners
Changed to ['basic_cleaners'] Here is a reference for transliteration_cleaners. Uncertainty of Japanese unite code in Tacotron 2 series
batch_size
I have 32. Looking at issues etc., it seems that there are many people who set it to about 8 ~ 16. Please consult with the GPU to decide.

Added exponential learning rate decay to train.py

Model can not converge #254

Learning

We will learn using the pre-trained model. The result of 10k iter. It took about 6 and a half hours with Colab T4. grad.norm grad.norm.png

training.loss training.loss.png

inference

The result of each checkpoint. sigma = 1, denoiser unused

――In addition, it is often placed in the center of the main Myo, which is called the Five Great Myo, like Toji. 2500|5000|7500|10000 --New England style is a milk-based white cream soup, also known as Boston clam chowder. 2500|5000|7500|10000 --Category of people related to computer game makers, industry groups, etc. 2500|5000|7500|10000

[PYTHON] Japanese speech synthesis starting with Tacotron2