About me

As a current PhD student, I have the honor of being guided by Professor Shugong Xu, a distinguished IEEE Fellow (https://www.researchgate.net/profile/Shugong-Xu-2). My research is deeply rooted in the field of speech technology, encompassing speaker recognition, keyword recognition, speech synthesis, zero-shot voice cloning, federated learning, transfer learning, and few-shot learning. Presently, my active research interests include exploring audio foundation models, speech synthesis, and multi-modal learning theory. I possess a profound interest in the applications and development trends of machine learning technologies within the speech and audio domain.

My Active Research Projects

StableTTS: Towards Fast Denoising Acoustic Decoder for Text to Speech Synthesis with Consistency Flow Matching

Current state-of-the-art text-to-speech (TTS) sys- tems predominantly utilize denoising-based acoustic decoders with language models (LLMs) or with non-autoregressive front- ends, known for their superior performance in generating high- fidelity spectrum. In this study, we introduce an efficient TTS system that incorporates Consistency Flow Matching denoising training. This training approach significantly enhances the train- ing efficiency and operational performance of denoising-based acoustic decoders in existing TTS or voice conversion systems, with no additional cost in the training process—a free lunch. To efficiently compare with other denoising strategies, we align with the latest advancements in the implementation of non- autoregressive-based TTS systems and build an efficient DiT- based TTS architecture. Our comprehensive evaluations against various denoising-based methods affirm the efficiency of our proposed system.

stable

Project Page Research paper: StableTTS

ZS-TTS/Voice Cloning and Synthesis: Optimizing Feature Fusion for Improved Zero-shot Adaptation in Text-to-Speech Synthesis

A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), an novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a Gated Recurrent Unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pre-trained TTS models. Furthermore, our TSCM based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method’s superiority is validated through comprehensive subjective and objective evaluations.

TSCM

Research paper: TSCM-VITS

Project DEMO Website

Emotional Style Control TTS: StyleFusion TTS–Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

We introduce StyleFusion-TTS, a prompt and/or audio ref- erenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs—including text prompts, audio references, and speaker timbre references—in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS ar- chitecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the ad- vancement of the field of zero-shot text-to-speech synthesis.

style_fusion

Research paper: StyleFusion

Project DEMO Website

Personalized User-Defined Keyword Spotting and Open-set Speaker Identification in Household Environments

We introduce Personalized User-Defined Keyword Spotting (PUKWS), a novel pipeline specifically designed for enhancing household environments by integrating user-defined keyword spotting (KWS) with open-set speaker identification (SID) into a cascading dual sub-system structure. For KWS, we present multi-modal user-defined keyword spotting (M-UDKWS), a novel approach that leverages multi-modal prompts for text-audio multimodal enrollment, and optimizes phonetic and semantic feature extraction to synergize text and audio modalities. This innovation not only stabilizes detection by reducing mismatches between query audio and support text embeddings but also excels in handling potentially confusing keywords. For open-set SID, we adopt advanced open-set learning techniques to propose speaker reciprocal points learning (SRPL), addressing the significant challenge of being aware of unknown speakers without compromising known speaker identification. To boost the overall performance of the PUKWS pipeline, we employ a cutting-edge data augmentation strategy that includes hard negative mining, rule-based procedures, GPT, and zero-shot voice cloning, thereby enhancing both M-UDKWS and SRPL components. Through exhaustive evaluations on various datasets and testing scenarios, we demonstrate the efficacy of our methods.

pukws

Research paper: PUKWS

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting

In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous meth- ods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to en- hance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.

mmkws

Research paper: MM-KWS

Project Website and Code Repo

SRPL: Open-set Speaker Identification with Reciprocal Points

srpl

Research paper: Open-set Speaker Recognition

Project Website and Code Repo

Learning Domain-Heterogeneous Speaker Recognition Systems with Personalized Continual Federated Learning

Speaker recognition, the process of automatically identifying a speaker based on individual characteristics in speech signals, presents significant challenges when addressing heterogeneous-domain conditions. Federated learning, a recent development in machine learning methods, has gained traction in privacy-sensitive tasks, such as personal voice assistants in home environments. However, its application in heterogeneous multi-domain scenarios for enhancing system customization remains underexplored. In this paper, we propose the utilization of federated learning in heterogeneous situations to enable adaptation across multiple domains. We also introduce a personalized federated learning algorithm designed to effectively leverage limited domain data, resulting in improved learning outcomes. Furthermore, we present a strategy for implementing the federated learning algorithm in practical, real-world continual learning scenarios, demonstrating promising results. The proposed federated learning method exhibits superior performance across a range of synthesized complex conditions and continual learning settings, compared to conventional training methods.

fedspk

Research paper: FedSpeaker

Project Website and Code Repo

Research on Domain Roubust Speaker Recognition

Speaker recognition technology has advanced significantly, achieving high accuracy in controlled settings. However, in real-world applications, systems often face challenges due to domain variability—differences in recording environments, channels, languages, and demographic characteristics. Domain robust speaker recognition focuses on developing models that can maintain performance across these diverse conditions.

Research paper: DA-Spk

Research paper: AM-Spk

Research paper: Triplet-Spk

Project Website and Code Repo

Publications

Learning domain-heterogeneous speaker recognition systems with personalized continual federated learning, EURASIP Journal on Audio Speech and Music Processing. Zhiyong Chen, Shugong Xu

Optimizing Feature Fusion for Improved Zero-shot Adaptation in Text-to-Speech Synthesis, EURASIP Journal on Audio Speech and Music Processing. Zhiyong Chen, Zhiqi Ai, Xinnuo Li, Shugong Xu

Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Samples, IEEE SLT2024. Zhiyong Chen, Zhiqi Ai, Xinnuo Li, Shugong Xu

StableTTS: Towards Fast Denoising Acoustic Decoder for Text to Speech Synthesis with Consistency Flow Matching, IEEE ICASSP25. Zhiyong Chen, Xinnuo Li, Shugong Xu (In Peer Review)

Personalized User-Defined Keyword Spotting in Household Environments: A Text-Audio Multi-Modality Approach, Speech Communication. Zhiqi Ai, Zhiyong Chen, Xinnuo Li, Shugong Xu (In Peer Review)

StyleFusion-TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis, PRCV 2024. Zhiyong Chen, Zhiqi Ai, Xinnuo Li, Shugong Xu

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting, Interspeech 2024. Zhiqi Ai, Zhiyong Chen, Xinnuo Li, Shugong Xu

Supervised Imbalanced Multi-domain Adaptation for Text-independent Speaker Verification, 2020 9th International Conference on Computing and Pattern Recognition. Zhiyong Chen, Zongze Ren, Shugong Xu

ERDBF: Embedding-Regularized Double Branches Fusion for Multi-Modal Age Estimation, IEEE Access. Bo Wu, Hengjie Lu, Zhiyong Chen, Shugong Xu

Triplet Based Embedding Distance and Similarity Learning for Text-independent Speaker Verification 2019 IEEE APSIPA ASC. Zongze Re, Zhiyong Chen, Shugong Xu

A Study on Angular Based Embedding Learning for Text-independent Speaker Verification, 2019 IEEE APSIPA ASC. Zhiyong Chen, Zongze Ren, Shugong Xu

IFR: Iterative Fusion Based Recognizer for Low Quality Scene Text Recognition, PRCV 2021. Zhiwei Jia, Shugong Xu, Shiyi Mu, Zhiyong Chen

Configurable CNN Accelerator in Speech Processing based on Vector Convolution, 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). Lanqing Hui, Shan Cao, Zhiyong Chen, Shugong Xu

For more info

For further information about my research and collaborations, feel free to reach out via email at zhiyongchen@shu.edu.cn.