[PYTHON] End-to-End Text Speech Synthesis Starting with ESPnet2


Hello. I'm Tomoki Hayashi (@ kan-bayashi) from Human Dataware Lab. My favorite editor is Vim / Neovim.

Today, I would like to briefly introduce End-to-End text-to-end speech synthesis using ESPnet2, which I am participating in development.

The following content is based on the content of ESPnet v.0.9.3 as of 09/15/2020. The contents may change significantly due to version updates.

Summary of 3 lines for busy people

-ESPnet2 makes it easy to use the state-of-the-art End-to-End text-to-speech synthesis model [Play like this](https://drive.google. com / file / d / 1MrEVDL7-COPIFVb1LIvIRfiIKkVZZeeP / view? usp = sharing) --If you want to run it anyway, play with ESPnet2 real-time TTS demo. ――We are always looking for developers, so feel free to talk to the development members.

What is ESPnet?

ESPnet is an open source toolkit for E2E speech processing developed to accelerate research on end-to-end (E2E) models. is. For details, please see Article here.

What is ESPnet2?

ESPnet2 is a next-generation speech processing toolkit developed to overcome the weaknesses of ESPnet. The code itself is integrated into the ESPnet repository (https://github.com/espnet/espnet). The basic configuration is the same as ESPnet, but the following extensions have been made to improve convenience and extensibility.

-** Task-Design : The user defines any new speech processing task (eg speech enhancement, speech conversion) by referring to the method of FairSeq. to be able to do. - Chainer-Free : Depends on Chainer due to the end of development of Chainer Repaired the part that was there. - Kaldi-Free : The feature extractor that relied on Kaldi has been integrated into the Python library. This eliminates the need to compile Kaldi, which is easy for many users to stumble upon. - On-the-Fly : Feature extraction and text preprocessing are integrated into the model section. Now executed sequentially during learning and inference. - Scalable **: Optimized CPU memory usage to enable learning using huge datasets on the order of tens of thousands of hours. In addition, it supports multi-node multi-GPU method distributed learning.

The latest version v.0.9.3 as of October 2020 supports speech recognition (ASR), text-to-speech synthesis (TTS), and speech enhancement (SE) tasks. In the future, more tasks (eg voice translation, voice translation) will be supported (https://github.com/espnet/espnet/issues/1795). In this article, I would like to briefly introduce the TTS part that I am mainly developing.

TTS models supported by ESPnet2

The E2E-TTS model generally consists of a Text2Mel model that generates acoustic features (mel spectrograms) from text and a Mel2Wav model that generates waveforms from acoustic features. ESPnet can mainly build the Text2Mel part.

As of October 2020, the following Text2Mel models are supported.

As a Mel2Wav model, it can be combined with the one in Repository I am developing. The following Mel2Wav models are supported.

Inference using a pre-learning model

ESPnet2 works with the research data sharing repository Zenodo, so you can easily try out various pre-learning models. In addition to trying it out, any user can upload a pre-learning model by registering with Zenodo.

Below, the TTS model FastSpeech2 pre-trained using the JSUT Corpus Here is an example of Python code that performs inference by).

from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.tts_inference import Text2Speech

# Create E2E-TTS model instance
d = ModelDownloader()
text2speech = Speech2Text(
    # Specify the tag

# Synthesis with a given text
wav, feats, feats_denorm, *_ = text2speech(
	"I twisted all the reality toward myself."

Here, wav, features, and feats_denorm represent the generated waveforms, statistic-normalized acoustic features, and denormalized acoustic features, respectively. By default, the conversion of waveforms from acoustic features is performed by Griffin-Lim, but it can also be combined with the Mel2Wav model introduced above.

A list of publicly available pre-training models can be found at ESPnet Model Zoo.

Colab demo

We have also released a demo using Google Colab for those who find it difficult to build an environment. You can easily experience cutting-edge speech synthesis on your browser, so please give it a try. Open In Colab You can freely generate Sound like this.

Building a model using a recipe

With ESPnet2, you can also build your own model using recipes. I will not explain in detail here, so if you are interested, please refer to this page.


This article has outlined text-to-speech synthesis using the E2E speech processing toolkit ESPnet2. ESPnet is being developed mainly by Japanese people, and we are always looking for enthusiastic developers. If you are interested, feel free to contact the development members or join the discussion on Github!

Reference link

Recommended Posts

End-to-End Text Speech Synthesis Starting with ESPnet2
Japanese speech synthesis starting with Tacotron2
Use Windows 10 speech synthesis with Python
English speech recognition with python [speech to text]
Python starting with Windows 7
GRPC starting with Python
Text mining with Python-Scraping-
Pythonbrew with Sublime Text
Thorough capture PDF open data. PDF text analysis starting with PDFMiner.