About me
Ziqian Ning is a research scientist at ByteDance Seed, Shanghai, China, working on music generation. He received his Master’s degree from the Audio, Speech and Language Processing Laboratory at Northwestern Polytechnical University (ASLP@NWPU), supervised by Prof. Lei Xie.
His research focuses on generative AI for music and speech, including full-length song generation, singing voice generation/conversion, low-latency voice conversion, and text-to-speech. His recent work includes the DiffRhythm series for efficient full-length song generation and the DualVC series for streaming voice conversion.
Experience
- 2024.10 - Present, Seed, ByteDance, Shanghai, China.
- 2024.03 - 2024.09, Azure Speech, Microsoft, China.
- 2022.06 - 2024.03, Fuxi AI Lab, Netease, China.
- 2021.07 - 2021.09, TEG, Tencent, China.
Publications
Music and Singing Voice Generation
-
DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching, Yuepeng Jiang, Huakang Chen, Ziqian Ning, Jixun Yao, Zerui Han, Di Wu, Meng Meng, Jian Luan, Zhonghua Fu, Lei Xie. arXiv preprint, 2025.
-
DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization, Hui Chen, Yu-rou Jiang, Guobin Ma, Chuan-Ming Hao, Shuai Wang, Jixun Yao, Ziqian Ning, Meng Meng, Jian Luan, Zhonghua Fu, Lei Xie. arXiv preprint, 2025.
-
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion, Ziqian Ning, Huakang Chen, Yuepeng Jiang, Chunbo Hao, Guobin Ma, Shuai Wang, Jixun Yao, Lei Xie. arXiv preprint, 2025. Project.
-
Drop the beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation, Ziqian Ning, Shuai Wang, Yuepeng Jiang, Jixun Yao, Lei He, Shifeng Pan, Jie Ding, Lei Xie. AAAI, 2025. Demo.
-
VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale F0 Modeling, Ziqian Ning, Yuepeng Jiang, Zhichao Wang, Bin Zhang, Lei Xie. ASRU, 2023. Demo
Voice Conversion (VC)
-
MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows, Guobin Ma, Jixun Yao, Ziqian Ning, Yuepeng Jiang, Lingxin Xiong, Lei Xie, Pengcheng Zhu. arXiv preprint, 2025. Demo
-
REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers, Yuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Zhonghua Fu, Lei Xie. arXiv preprint, 2025.
-
StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching, Jixun Yao, Yuguang Yang, Yu Pan, Ziqian Ning, Jiaohao Ye, Hongbin Zhou, Lei Xie. AAAI, 2025.
-
PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts, Jixun Yao, Yuguang Yang, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan, Jingjing Yin, Hongbin Zhou, Heng Lu, Lei Xie. ICASSP, 2024. Demo
-
Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features, Ziqian Ning, Qicong Xie, Pengcheng Zhu, Zhichao Wang, Liumeng Xue, Jixun Yao, Lei Xie, Mengxiao Bi. ICASSP, 2023. Demo
-
Preserving background sound in noise-robust voice conversion via multi-task learning, Jixun Yao, Yi Lei, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie, Hai Li, Junhui Liu, Danming Xie. ICASSP, 2023. Demo
Streaming Voice Conversion
-
SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion, Zhao Guo, Ziqian Ning, Guangsheng Ma, Lei Xie. arXiv preprint, 2025.
-
DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion, Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi. INTERSPEECH, 2024. Demo
-
DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion, Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Shuai Wang, Jixun Yao, Lei Xie, Mengxiao Bi. ICASSP, 2024. Demo
-
DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding, Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Jixun Yao, Shuai Wang, Lei Xie, Mengxiao Bi. INTERSPEECH, 2023. Demo
Speaker Anonymization
-
NPU-NTU System for Voice Privacy 2024 Challenge, Jixun Yao, Nikita Kuzmin, Qing Wang, Pengcheng Guo, Ziqian Ning, Dake Guo, Kong Aik Lee, Eng-Siong Chng, Lei Xie. INTERSPEECH, 2025.
-
Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix, Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie. TASLP.
-
MUSA: Multi-lingual Speaker Anonymization via Serial Disentanglement, Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, Yuguang Yang, Yu Pan, Lei Xie. TASLP, 2025.
Text to Speech
-
Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech, Jixun Yao, Yuguang Yang, Yu Pan, Yuan Feng, Ziqian Ning, Jianhao Ye, Hongbin Zhou, Lei Xie. arXiv preprint, 2025.
-
Accent-VITS: accent transfer for end-to-end TTS, Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie. NCMMSC, 2023. Demo
Project Experience
- DiffRhythm Series
- Lead the development of efficient end-to-end full-length song generation systems that synthesize vocals and accompaniment from lyrics and style prompts, spanning latent diffusion, block flow matching, controllable style conditioning, and preference optimization.
- Released DiffRhythm as an open-source full-length song generation model. It quickly reached No. 1 on the Hugging Face Space trending chart after release, and the GitHub repository has received 2K+ stars. Project
- Extend the original system into DiffRhythm+ and DiffRhythm 2, improving controllability, preference alignment, lyric alignment, generation fidelity, and long-form coherence.
- Singing Voice Conversion Challenge 2023
- Propose a VITS-based singing voice conversion model that leverages Whisper bottleneck features as linguistic information and uses PBTC module extracts multi-scale F0 to better capture the pitch variation. The results of the official competition measurements demonstrate that our system achieves human-level naturalness, ranking first and second in Task 1 and Task 2, respectively. Demo
- Online Text-to-speech synthesis system
- Develop a text-to-speech system to provide high availability and scalability for online services. Models are encapsulated in separate microservices that are managed using Kubernetes. Kafka is used for inter-model messaging, and the use of message queue makes it possible to parallelize a large number of microservice replicas.
Patents
- CN115910083A Real-time voice conversion method, device, electronic equipment and medium.
- CN116013336A Voice conversion method, device, electronic equipment and storage medium.
- CN116364099A Voice conversion method, device, electronic apparatus, storage medium, and program product.
- CN118136033A Method, device, electronic equipment and storage medium for converting drama voice.