"JETS: Jointly training FastSpeech2 and HiFi-GAN for end-to-end text-to-speech"

https://arxiv.org/abs/2203.16852

Submitted to INTERSPEECH2022.

Authors

Abstract

In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel spectrogram and then HiFi-GAN generates a raw waveform from a mel spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there are no acoustic feature mismatch between training and inference, it does not requires fine-tuning. Furthermore we remove dependency on an external speech-text alignment tool by adopting an alignment learning objective in our joint training framework. Experiments on LJSpeech corpus shows that proposed model outperforms publicly available, state of the art implementations of ESPNet2-TTS on subjective evaluation (MOS) and some of objective evaluations.

Example of LJSpeech (English single speaker

  • CF2 (joint-ft): Conformer-based FastSpeech2 + HiFi-GAN, both models were jointly fine-tuned.
  • CF2 (joint-tr): Conformer-based FastSpeech2 + HiFi-GAN, both models were jointly trained from the scratch.
  • VITS: End-to-end text-to-waveform model, VITS.
  • JETS: End-to-end text-to-waveform model, JETS (proposed model).
  • LJ050-0030: The Commission also recommends

    Groudtruth CF2 (joint-ft)
    CF2 (joint-tr) VITS
    JETS

    LJ050-0040: and reports from other agencies which independently evaluate their information for potential sources of danger.

    Groudtruth CF2 (joint-ft)
    CF2 (joint-tr) VITS
    JETS

    LJ050-0050: As a result of these studies, the planning document submitted by the Secretary of the Treasury to the Bureau of the Budget on August thirty-one,

    Groudtruth CF2 (joint-ft)
    CF2 (joint-tr) VITS
    JETS

    LJ050-0152: The Secret Service and the Department of the Treasury now recognize this critical need.

    Groudtruth CF2 (joint-ft)
    CF2 (joint-tr) VITS
    JETS

    LJ050-0200: The Secret Service should utilize the personnel of other Federal law enforcement offices

    Groudtruth CF2 (joint-ft)
    CF2 (joint-tr) VITS
    JETS

    Contact