[PYTHON] [Failure] Development of Japanese TTS using Tacotron2 ~ 2nd work ~

Introduction

Mellotron, the successor to Tacotron 2, was announced. Nowadays, the voice synthesis industry is finally getting a tailwind. I still develop TTS with Taco2

Although Meron is based on Taco2, It is in English specifications Read a little more to translate it into Japanese ──

──No, So far, I'm not thinking about migration

When I was developing with Taco2 I saw a sentence like this

"The amount and quality of data in a dataset is very important" </ b>

I generally agree, Too vague

This is an issue that cannot be answered any further, There is no doubt that quality depends on this

afterwards, [Lento's blog](https://medium.com/@crosssceneofwindff/%E7%BE%8E%E5%B0%91%E5%A5%B3%E5%A3%B0%E3%81%B8%E3 I met% 81% AE% E5% A4% 89% E6% 8F% 9B% E3% 81% A8% E5% 90% 88% E6% 88% 90-fe251a8e6933) and I thought

"No matter how much noise there is, Even if there is little data If you manage to do this Can you make TTS? ""

Decided to try

result

For the time being, I couldn't say from the result No, I should say I wasn't satisfied

  • Sample audio used: 86.12s (by librosa)

Taco2

  • 120k steps Screenshot_2019-11-21 TensorBoard(3).png Screenshot_2019-11-21 TensorBoard(2).png Screenshot_2019-11-21 TensorBoard.png Screenshot_2019-11-21 TensorBoard(1).png

  • target individualImage2.png

  • inference individualImage.png

WaveGlow Screenshot_2019-12-03 TensorBoard.png

Consideration

Taco2 Taco2 inference seems to be fine in qualitative evaluation

As mentioned in the previous article, The inference output has a beautiful gradation, This is the same result with TOA-TTS

And one more thing "There is noise, but it doesn't matter that much." My (feeling) expectation based on the previous development is By being erased by this gradation After all, I come to the conclusion that it doesn't really matter </ b>

Taco2 learning stopped at 121k steps, If you continue as it is, the quality may improve a little more.

WaveGlow This is a wonderfully divergent calculation

How Much Does WaveGlow Learning Need? According to the memory of the issue, 1 million is necessary </ b>

Try compositing at 120k, 600k, I got the impression that the noise was reduced, so I continued the calculation. The result was like this

This result is completely the same as TOA-TTS (HP, dataset creation procedure, etc.). Voice quality seems to affect here

Even considering that Taco2's inference is okay The reason why noise is so much in the composition is There is likely a problem with this model of voice waveform generation

End

Somehow, I want to create a model that works well until voice regeneration Next, I will try to scratch a little more with the policy of improving the accuracy of speech synthesis

Postscript: 19/12/13

As a result of synthesizing this Taco2 model and the TOA-TTS WvGw model, Because it was synthesized normally with TOA's voice, After all it was confirmed that the abnormality is in this WvGw model

Currently removing the noisy voice Learning again

350k steps now It seems that the evaluation value is higher than the previous synthesis.

Postscript: 19/12/16

Sample audio update Wavglw was relearned and synthesized only with less noisy data.

Audio sample (taco2: 121k, wavglw: 458k)

The voice became clear, but it was different from what I expected Voice quality that is almost the same as TOA-TTS was synthesized

You may have made a mistake as a way to learn voice quality.