DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Ziqian Ning^{1, 2}, Shuai Wang³, Pengcheng Zhu², Zhichao Wang¹, Jixun Yao¹, Lei Xie¹, Mengxiao Bi² ¹Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University, Xi'an, China ²Fuxi AI Lab, NetEase Inc., Hangzhou, China ³Shenzhen Research Institute of Big Data,
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China

1. Abstract

Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in subjective and objective metrics, with a latency of only 50 ms.

2. Computational Metrics

	RTF	Latency (ms)	Params (M)
Full mode	0.797	15.94+20+20=55.94	22.7
AM (w/ 2 pseudo ctx)	0.201	4.02	10.9
Vocoder (w/ 2 pseudo ctx)	0.086	1.72	1.2
LM	0.510	10.20	10.6
Stand-alone mode	0.181	3.58+20+20=43.58	12.1
AM (w/o pseudo ctx)	0.134	2.68	10.9
Vocoder (w/o pseudo ctx)	0.047	0.90	1.2

3. Demo

DualVC2: Streaming mode of DualVC 2 [1].
VQMIVC: Non-streaming mode of VQMIVC [2].
VQMIVC-streaming: Streaming mode of VQMIVC.
DualVC3-full: Full mode of DualVC 3.
DualVC3-standalone: Standalone mode of DualVC 3.

[1] Z. Ning, Y. Jiang, P. Zhu, S. Wang, J. Yao, L. Xie, and M. Bi, “Dualvc 2: Dynamic masked convolution for unified streaming and non-streaming voice conversion,” in Proc. ICASSP. IEEE, 2024, pp. 1–5.
[2] D. Wang, L. Deng, Y. T. Yeung, X. Chen, X. Liu, and H. Meng, “VQMIVC: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion,” in Proc. INTERSPEECH. ISCA, 2021, pp. 1344–1348.