Abstract: Current state-of-the-art text-to-speech (TTS) systems predominantly utilize denoising-based acoustic decoders with language models or non-autoregressive front-ends, known for their superior performance in generating high-fidelity spectra. In this study, we introduce an efficient TTS system, StableTTS, which incorporates Consistency Flow Matching (CFM) denoising training. This approach enhances training efficiency and operational performance in denoising-based acoustic decoders without additional training costs.
Description: This demo provides an overview of StableTTS, highlighting its purpose in research. The project demonstrates how Consistency Flow Matching (CFM) is used to improve the generation efficiency and quality in Text-to-Speech (TTS) synthesis. Through the comparison of various methods and architectures, the demo showcases the advancements StableTTS brings to TTS quality and performance.
Demo 1: Method Comparison
Demo 1 presents a comparison of different methods at the default NFE (Inference sampling steps), including Ground Truth (GT), Reference for zero-shot clone (Ref), GradTTS, MatchaTTS, ReflowTTS, and StableTTS. The purpose of this demo is to evaluate the generation quality and stability of each model under the same conditions. StableTTS introduces an innovative denoising acoustic decoder that leverages Consistency Flow Matching to improve generation efficiency and quality. Users can listen to audio samples generated by each method to assess and compare their performance.
NFE: default
GT
Ref
GradTTS+
MachaTTS+
VoiceFlow/Reflow+
StableTTS/ConsistFM (ours)
NFE: 10
GT
Ref
GradTTS+
MatchaTTS+
StableTTS/ConsistFM (ours)
NFE: 6
GT
Ref
MatchaTTS+
StableTTS/ConsistFM (ours)
Demo 2: Architecture Comparison
Demo 2 compares different architectures (e.g., MatchaTTS and StableTTS) across various NFE values (2, 4, 6, 10, 15, 20, 25, 30, 35, 40, 50, and 100). This comparison demonstrates the differences in generation quality as sampling steps are reduced. StableTTS, with Consistency Flow Matching, shows an enhanced ability to generate high-quality audio with fewer NFE steps, thus improving generation efficiency. In this demo, users can explore the audio samples at different NFE values to observe the performance of each architecture.
Demo 3 showcases the performance of StableTTS under different segment initialization and consistency configurations, specifically with 6-segment, 4-segment, and 2-segment setups for both initialization and consistency phases. This demo allows users to listen to audio samples generated under each segment configuration, revealing the impact of segmental and consistency conditions on TTS generation quality. The study suggests that segment configuration significantly affects audio stability and clarity, with added consistency constraints further optimizing the generation results.