Hello. I'm Tomoki Hayashi (@ kan-bayashi) from Human Dataware Lab. My favorite editor is Vim / Neovim.
Today, I would like to briefly introduce End-to-End text-to-end speech synthesis using ESPnet2, which I am participating in development.
The following content is based on the content of ESPnet v.0.9.3 as of 09/15/2020. The contents may change significantly due to version updates.
-ESPnet2 makes it easy to use the state-of-the-art End-to-End text-to-speech synthesis model [Play like this](https://drive.google. com / file / d / 1MrEVDL7-COPIFVb1LIvIRfiIKkVZZeeP / view? usp = sharing) --If you want to run it anyway, play with ESPnet2 real-time TTS demo. ――We are always looking for developers, so feel free to talk to the development members.
ESPnet2 is a next-generation speech processing toolkit developed to overcome the weaknesses of ESPnet. The code itself is integrated into the ESPnet repository (https://github.com/espnet/espnet). The basic configuration is the same as ESPnet, but the following extensions have been made to improve convenience and extensibility.
-** Task-Design : The user defines any new speech processing task (eg speech enhancement, speech conversion) by referring to the method of FairSeq. to be able to do. - Chainer-Free : Depends on Chainer due to the end of development of Chainer Repaired the part that was there. - Kaldi-Free : The feature extractor that relied on Kaldi has been integrated into the Python library. This eliminates the need to compile Kaldi, which is easy for many users to stumble upon. - On-the-Fly : Feature extraction and text preprocessing are integrated into the model section. Now executed sequentially during learning and inference. - Scalable **: Optimized CPU memory usage to enable learning using huge datasets on the order of tens of thousands of hours. In addition, it supports multi-node multi-GPU method distributed learning.
The latest version v.0.9.3 as of October 2020 supports speech recognition (ASR), text-to-speech synthesis (TTS), and speech enhancement (SE) tasks. In the future, more tasks (eg voice translation, voice translation) will be supported (https://github.com/espnet/espnet/issues/1795). In this article, I would like to briefly introduce the TTS part that I am mainly developing.
The E2E-TTS model generally consists of a Text2Mel model that generates acoustic features (mel spectrograms) from text and a Mel2Wav model that generates waveforms from acoustic features. ESPnet can mainly build the Text2Mel part.
As of October 2020, the following Text2Mel models are supported.
As a Mel2Wav model, it can be combined with the one in Repository I am developing. The following Mel2Wav models are supported.
ESPnet2 works with the research data sharing repository Zenodo, so you can easily try out various pre-learning models. In addition to trying it out, any user can upload a pre-learning model by registering with Zenodo.
from espnet_model_zoo.downloader import ModelDownloader from espnet2.bin.tts_inference import Text2Speech # Create E2E-TTS model instance d = ModelDownloader() text2speech = Speech2Text( # Specify the tag d.download_and_unpack("kan-bayashi/jsut_fastspeech2") ) # Synthesis with a given text wav, feats, feats_denorm, *_ = text2speech( "I twisted all the reality toward myself." )
feats_denorm represent the generated waveforms, statistic-normalized acoustic features, and denormalized acoustic features, respectively. By default, the conversion of waveforms from acoustic features is performed by Griffin-Lim, but it can also be combined with the Mel2Wav model introduced above.
A list of publicly available pre-training models can be found at ESPnet Model Zoo.
We have also released a demo using Google Colab for those who find it difficult to build an environment. You can easily experience cutting-edge speech synthesis on your browser, so please give it a try. You can freely generate Sound like this.
With ESPnet2, you can also build your own model using recipes. I will not explain in detail here, so if you are interested, please refer to this page.
This article has outlined text-to-speech synthesis using the E2E speech processing toolkit ESPnet2. ESPnet is being developed mainly by Japanese people, and we are always looking for enthusiastic developers. If you are interested, feel free to contact the development members or join the discussion on Github!