Mellotron, the successor to Tacotron 2, was announced. Nowadays, the voice synthesis industry is finally getting a tailwind. I still develop TTS with Taco2
Although Meron is based on Taco2, It is in English specifications Read a little more to translate it into Japanese ──
──No, So far, I'm not thinking about migration
When I was developing with Taco2 I saw a sentence like this
"The amount and quality of data in a dataset is very important" </ b>
I generally agree, Too vague
This is an issue that cannot be answered any further, There is no doubt that quality depends on this
afterwards, [Lento's blog](https://medium.com/@crosssceneofwindff/%E7%BE%8E%E5%B0%91%E5%A5%B3%E5%A3%B0%E3%81%B8%E3 I met% 81% AE% E5% A4% 89% E6% 8F% 9B% E3% 81% A8% E5% 90% 88% E6% 88% 90-fe251a8e6933) and I thought
"No matter how much noise there is, Even if there is little data If you manage to do this Can you make TTS? ""
Decided to try
For the time being, I couldn't say from the result No, I should say I wasn't satisfied
Taco2
120k steps
target
inference
WaveGlow
Taco2 Taco2 inference seems to be fine in qualitative evaluation
As mentioned in the previous article, The inference output has a beautiful gradation, This is the same result with TOA-TTS
And one more thing "There is noise, but it doesn't matter that much." My (feeling) expectation based on the previous development is By being erased by this gradation After all, I come to the conclusion that it doesn't really matter </ b>
Taco2 learning stopped at 121k steps, If you continue as it is, the quality may improve a little more.
WaveGlow This is a wonderfully divergent calculation
How Much Does WaveGlow Learning Need? According to the memory of the issue, 1 million is necessary </ b>
Try compositing at 120k, 600k, I got the impression that the noise was reduced, so I continued the calculation. The result was like this
This result is completely the same as TOA-TTS (HP, dataset creation procedure, etc.). Voice quality seems to affect here
Even considering that Taco2's inference is okay The reason why noise is so much in the composition is There is likely a problem with this model of voice waveform generation
Somehow, I want to create a model that works well until voice regeneration Next, I will try to scratch a little more with the policy of improving the accuracy of speech synthesis
As a result of synthesizing this Taco2 model and the TOA-TTS WvGw model, Because it was synthesized normally with TOA's voice, After all it was confirmed that the abnormality is in this WvGw model
Currently removing the noisy voice Learning again
350k steps now It seems that the evaluation value is higher than the previous synthesis.
Sample audio update Wavglw was relearned and synthesized only with less noisy data.
Audio sample (taco2: 121k, wavglw: 458k)
The voice became clear, but it was different from what I expected Voice quality that is almost the same as TOA-TTS was synthesized
You may have made a mistake as a way to learn voice quality.