🎧StableTTS: Efficient Denoising Acoustic Decoder with Consistency Flow Matching

Zhiyong Chen*, Xinnuo Li*, Shuhang Wu, Zhi Yang, Zhiqi Ai, and Shugong Xu.

For more information, visit our project page.

Paper (preview version) StableTTS

Introduction

Abstract: Current state-of-the-art text-to-speech (TTS) systems predominantly utilize denoising-based acoustic decoders with language models or non-autoregressive front-ends, known for their superior performance in generating high-fidelity spectra. In this study, we introduce an efficient TTS system, StableTTS, which incorporates Consistency Flow Matching (CFM) denoising training. This approach enhances training efficiency and operational performance in denoising-based acoustic decoders without additional training costs.

Description: This demo provides an overview of StableTTS, highlighting its purpose in research. The project demonstrates how Consistency Flow Matching (CFM) is used to improve the generation efficiency and quality in Text-to-Speech (TTS) synthesis. Through the comparison of various methods and architectures, the demo showcases the advancements StableTTS brings to TTS quality and performance.

Demo 1: Method Comparison

Demo 1 presents a comparison of different methods at the default NFE (Inference sampling steps), including Ground Truth (GT), Reference for zero-shot clone (Ref), GradTTS, MatchaTTS, ReflowTTS, and StableTTS. The purpose of this demo is to evaluate the generation quality and stability of each model under the same conditions. StableTTS introduces an innovative denoising acoustic decoder that leverages Consistency Flow Matching to improve generation efficiency and quality. Users can listen to audio samples generated by each method to assess and compare their performance.

NFE: default
GT Ref GradTTS+ MachaTTS+ VoiceFlow/Reflow+ StableTTS/ConsistFM (ours)
NFE: 10
GT Ref GradTTS+ MatchaTTS+ StableTTS/ConsistFM (ours)
NFE: 6
GT Ref MatchaTTS+ StableTTS/ConsistFM (ours)

Demo 2: Architecture Comparison

Demo 2 compares different architectures (e.g., MatchaTTS and StableTTS) across various NFE values (2, 4, 6, 10, 15, 20, 25, 30, 35, 40, 50, and 100). This comparison demonstrates the differences in generation quality as sampling steps are reduced. StableTTS, with Consistency Flow Matching, shows an enhanced ability to generate high-quality audio with fewer NFE steps, thus improving generation efficiency. In this demo, users can explore the audio samples at different NFE values to observe the performance of each architecture.

NFE Level (Steps) MatchaTTS+ StableTTS/ConsistFM
2
6
10
15
20
25
30
35
40
50
100

Demo 3: CCFM Segment/Initialization Training/CCFM 2nd stage

Demo 3 showcases the performance of StableTTS under different segment initialization and consistency configurations, specifically with 6-segment, 4-segment, and 2-segment setups for both initialization and consistency phases. This demo allows users to listen to audio samples generated under each segment configuration, revealing the impact of segmental and consistency conditions on TTS generation quality. The study suggests that segment configuration significantly affects audio stability and clarity, with added consistency constraints further optimizing the generation results.

StableTTS-6Seg Init StableTTS-6Seg Consist StableTTS-4Seg Init StableTTS-4Seg Consist