Drop the beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

Accepted by AAAI 2025

Ziqian Ning^{1, 2}, Shuai Wang³, Yuepeng Jiang¹, Jixun Yao¹, Lei He², Shifeng Pan¹, Jie Ding¹, Lei Xie¹ ¹Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University, Xi'an, China ²Microsoft, China ³Shenzhen Research Institute of Big Data,
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China

1. Abstract

	Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats.
	In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control.
	Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.

task — Figure 1: The overall pipeline of Freestyler. With lyrics and accompaniment as condition, it can generate rapping voice that matches the style and rhythm of the accompaniment.

Figure 2: Overview of Freestyler. The lyrics-to-semantic model in (a) predicts semantic tokens based on lyrics and accompaniment. The accompaniment feature is shifted left by $K$ frames to provide additional rhythmic context. The semantic-to-spectrogram model in (b) generates mel-spectrograms from the semantic tokens, which are interpolated to align with the spectrogram's frame rate. Speaker embedding is provided to both models to control the timbre.

2. Demo

2.1. Zero-shot Rapping Voice Generation

Reference	Lyrics	Freestyler
	First off, I want to thank the pioneers for making this possible. Fake, humbleness, snake, dummy shit is intolerable. Living by principle. Staring out over women in swimming suits by the pool. I'm a slide with a few. Every chance was spoiled. No enhancement pill. No dear antler oil. And they panties be soiled. I know some niggas that'll damage your squad. You amateurs play with a man, I send savages charged. It ain't no sit-downs, too late to fix it.



	Couldn't afford to walk in my shoes and smarter sneakers talking way before I ever got paid, to record a feature I was trying to show them the way but they just ignore the leader. I got big passion, rich fashion, Mick Jagger, Rolling Stone, let's celebrate and keep blasting. I'll get out forget I'm still trying to get past it. I say I moved on but I still think about it.

2.2. Speech Reference Zero-shot

Reference	Lyrics	Freestyler
	Let Justin the boy post it, You got an album postpone it, I drop two and they both going, I got a feeling they're into feelings, They filming the show but won't show it, You gotta watch me in slow motion, I'm in that wide body bends, I go back to college, Do an album and then drop out again, Took me a minute to get here, My vision is crystal clear.
	Some bitch want to kick it. I gotta hit the streets. If I didn't go get it, I guess we didn't eat. Back stressing this dickhead just come evicted me. Back stretching this coat, sipping this briccone. I swear I only write tracks so people get to see. I'm still waking up.
	Baby, let me blow it like a trumpet. I could do it all. I could probably suck a waterman and through a straw. Believe me, every other day, new wig, new hair. Come take me out this smooth glare. I'll be waiting on you with some lawn. Should you rage at the spicy down low like bomb. Suey, boom. Nothing but a robe in your house.
	But brothers got strikes like oral hers side. So look a lot. Inflation rise like yeast. S&L scandal stole millions. But we need more police to take back our streets. It's drama. We all know that one time being extorting and big ballers, getting extra dotted with warrants in three states. Overcrowded prisons, hoping the DA dropped.
	They want to see me take a dive, but I'm bawling up, Uh, politicin like a politician, Talking to the planet moves that's in my description, See the light, still these haters try to knock his vision, That's why I passed the opposition like a proposition, Why I age, which you know about the coalition, We told doubt us from the jumping they were.

2.3. GPT Lyrics + Text-to-music Accompaniment

Reference	Lyrics (Generated by GPT-4)	Accompaniment (Generated by Stable Audio 2.0)	Freestyler
	Yeah, I'm climbing to the peak, no time for defeat, Got my mind on my goals, can't accept no retreat. Spitting fire in the booth, let the truth get unleashed, I'm a beast with the beats, let the rhythm be my leash.
	Life’s a game of chess, I'm strategizing every move, Got the world in my hands, now I'm ready to prove. From the shadows to the spotlight, watch me make it shine, In the rhythm of the night, all the stars align.
	Rise up, get up, never gonna stop, We’re breaking through the ceiling, we’re reaching for the top. Keep your head high, no matter the grind, We’re chasing down dreams with a rhythm that's defined.
	Hustle in my veins, every day’s a new chapter, Grinding through the struggle, I’m the lyrical raptor. Got a vision, got a plan, ain't no chance to stumble, On this journey to the top, see the weak ones crumble.
	A warrior in the game, got the heart of a lion, Every setback is a setup for a bigger horizon. From the block to the charts, yeah, I'm leaving a mark, Illuminating paths, let my name leave a spark.

3. Ethics statement

Freestyler is capable of synthesizing zero-shot rap vocal with any speaker's timbre. It is intended for use in entertainment, educational purposes, and similar applications. However, the technology carries potential risks, including the misuse of the model for spoofing voice identification or impersonating specific individuals. Our experiments have been conducted using publicly available data. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice.