Audio samples of “FLOW-TTS: A NON-AUTOREGRESSIVE NETWORK FOR TEXT TO SPEECH BASED ON FLOW”

Abstract

In this work, we propose Flow-TTS, a non-autoregressive end-to-end neural TTS model based on generative flow. Unlike other non-autoregressive models, Flow-TTS can achieve high-quality speech generation by using a single feed-forward network. To our knowledge, Flow-TTS is the first TTS model utilizing flow in spectrogram generation network and the first non-autoregssive model which jointly learns the alignment and spectrogram generation through a single network. Experiments on LJSpeech show that the speech quality of Flow-TTS heavily approaches that of human and is even better than that of autoregressive model Tacotron 2 (outperforms Tacotron 2 with a gap of 0.09 in MOS). Meanwhile, the inference speed of Flow-TTS is about 23 times speed-up over Tacotron 2, which is comparable to FastSpeech.

We compare our method with Tacotron 2 (using pretrain model from https://github.com/NVIDIA/tacotron2) and FastSpeech (using pretrain model from https://github.com/espnet/espnet). All of the audio samples use WaveGlow as vocoder.

There seems to be no reason why ordinary paper should not be better made.

GT	GT(WaveGlow)

Flow-TTS	Tacotron 2	FastSpeech

The various wards were all about eleven feet in height.

GT	GT(WaveGlow)

Flow-TTS	Tacotron 2	FastSpeech

The moral welfare of the inmates was as closely looked after as the physical.

GT	GT(WaveGlow)

Flow-TTS	Tacotron 2	FastSpeech

an incomplete and fallacious method of preventing contamination.

GT	GT(WaveGlow)

Flow-TTS	Tacotron 2	FastSpeech

except the forgery of wills and powers of attorney.

GT	GT(WaveGlow)

Flow-TTS	Tacotron 2	FastSpeech

by the blackmith in the usual way.

GT	GT(WaveGlow)

Flow-TTS	Tacotron 2	FastSpeech

Flow-TTS-demo

Audio samples of “FLOW-TTS: A NON-AUTOREGRESSIVE NETWORK FOR TEXT TO SPEECH BASED ON FLOW”

Abstract

We compare our method with Tacotron 2 (using pretrain model from https://github.com/NVIDIA/tacotron2) and FastSpeech (using pretrain model from https://github.com/espnet/espnet). All of the audio samples use WaveGlow as vocoder.