VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale F0 Modeling
Ziqian Ning1, Yuepeng Jiang1, Zhichao Wang1, Bin Zhang2, Lei Xie1
1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China
2Lyra Lab, Tencent Music Entertainment, Shenzhen, China
1. Abstract
This paper presents the T23 team's system for the Singing Voice Conversion Challenge 2023. Our singing conversion model is built on VITS, integrating a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. Leveraging Whisper ASR model, we extract bottleneck features (BNF) as input for the prior encoder. We use pitch perturbation to remove speaker timbre before BNF extraction, preventing leakage of source speaker timbre. The PBTC module extracts multi-scale F0, enhancing pitch variations in singing. A three-stage training strategy adapts the base model to the target speaker with limited data. Official challenge results show our system ranks 1st and 2nd in Task 1 and 2, exhibiting superior naturalness. Ablation study confirms the effectiveness of our system design.
2. Demos -- Singing Voice Conversion
The challenge organizers provide two singers (IDF1 and IDM1) in any-to-one, in-domain singing voice conversion (task 1); two speakers (CDF1, CDM1) in any-to-one, cross-domain singing voice conversion (task2).In task 2, only speech data is provided.
Tejas Jayashankar, Jilong Wu, Leda Sari, David Kant, Vimal Manohar, and Qing He, “Self-supervised representations for singing voice conversion,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
Matthias Mauch and Simon Dixon, “PYIN: A fundamental frequency estimator using probabilistic threshold distributions,” in Proc. ICASSP. 2014, pp. 659–663, IEEE.