About me

Ziqian Ning is a research scientist at ByteDance Seed, Shanghai, China, working on music generation. He received his Master’s degree from the Audio, Speech and Language Processing Laboratory at Northwestern Polytechnical University (ASLP@NWPU), supervised by Prof. Lei Xie.

His research focuses on generative AI for music and speech, including full-length song generation, singing voice generation/conversion, low-latency voice conversion, and text-to-speech. His recent work includes the DiffRhythm series for efficient full-length song generation and the DualVC series for streaming voice conversion.

Experience

  • 2024.10 - Present, Seed, ByteDance, Shanghai, China.
  • 2024.03 - 2024.09, Azure Speech, Microsoft, China.
  • 2022.06 - 2024.03, Fuxi AI Lab, Netease, China.
  • 2021.07 - 2021.09, TEG, Tencent, China.

Publications

Music and Singing Voice Generation

Voice Conversion (VC)

Streaming Voice Conversion

Speaker Anonymization

Text to Speech

Project Experience

  • DiffRhythm Series
    • Lead the development of efficient end-to-end full-length song generation systems that synthesize vocals and accompaniment from lyrics and style prompts, spanning latent diffusion, block flow matching, controllable style conditioning, and preference optimization.
    • Released DiffRhythm as an open-source full-length song generation model. It quickly reached No. 1 on the Hugging Face Space trending chart after release, and the GitHub repository has received 2K+ stars. Project
    • Extend the original system into DiffRhythm+ and DiffRhythm 2, improving controllability, preference alignment, lyric alignment, generation fidelity, and long-form coherence.
  • Singing Voice Conversion Challenge 2023
    • Propose a VITS-based singing voice conversion model that leverages Whisper bottleneck features as linguistic information and uses PBTC module extracts multi-scale F0 to better capture the pitch variation. The results of the official competition measurements demonstrate that our system achieves human-level naturalness, ranking first and second in Task 1 and Task 2, respectively. Demo
  • Online Text-to-speech synthesis system
    • Develop a text-to-speech system to provide high availability and scalability for online services. Models are encapsulated in separate microservices that are managed using Kubernetes. Kafka is used for inter-model messaging, and the use of message queue makes it possible to parallelize a large number of microservice replicas.

Patents

  • CN115910083A Real-time voice conversion method, device, electronic equipment and medium.
  • CN116013336A Voice conversion method, device, electronic equipment and storage medium.
  • CN116364099A Voice conversion method, device, electronic apparatus, storage medium, and program product.
  • CN118136033A Method, device, electronic equipment and storage medium for converting drama voice.