The Rhythm In Anything: Audio-Prompted Drums Generation with Masked Language Modeling

Patrick O'Reilly¹, Julia Barnett¹, Hugo Flores Garcia¹, Annie Chu¹,

Nathan Pruyne¹, Prem Seetharaman², Bryan Pardo¹

¹Department of Computer Science, Northwestern University ²Adobe Research

Paper

Code

Comparisons vs. MelodyFlow

Below, we provide examples of the proposed TRIA and baseline MelodyFlow systems converting sound gestures to drum recordings. For these examples, we craft text timbre prompts for MelodyFlow that broadly correspond to TRIA's audio timbre prompts; this is because while TRIA allows timbre specification via an audio example, MelodyFlow requires a text description of the desired timbre. For TRIA, we use the 2-band adaptive variant detailed in the paper, while for MelodyFlow, we use target timestep 0.2.

Index	Timbre prompt Use this drum sound...	Rhythm prompt Play this rhythm...	TRIA	Timbre prompt (text) Use this drum sound...	Rhythm prompt Play this rhythm...	MelodyFlow
1				a loud, acoustic, rock drumkit
1
2				an acoustic drumkit with kick, snare, hi-hat, and crash cymbal
2
3				a deep, muted, underwater electronic drumkit
3

Figure 4

Below, we provide audio examples for the proposed TRIA and baseline MelodyFlow systems corresponding to Figure 4 in the paper. For TRIA, we use the 2-band adaptive variant detailed in the paper, while for MelodyFlow, we use target timestep 0.2.

We start with a beatboxing rhythm prompt consisting of alternating "kick" and "snare" vocalizations. We then construct three audio timbre prompts for TRIA, each containing the same synthesized kick drum and one additional percussive element: a "bongo," a "wood crack," or a "noise burst." Notably, while TRIA's training data is dominated by traditional drum kit elements such as snare and hi-hat, we choose percussive sounds that are represented sparsely or not at all. For MelodyFlow, we use the aforementioned descriptors ("bongo," "wood crack," "noise burst") as text timbre prompts.

TRIA successfully maps each timbre prompt onto the beatbox rhythm prompt by replacing "kick" vocalizations with a kick drum timbre and replacing "snare" vocalizations with the corresponding percussive element. This indicates that TRIA can distinguish between kick/snare vocalizations and select appropriate sound elements from within the timbre prompt to map onto these vocalizations. By contrast, MelodyFlow adheres much more closely to the spectral structure of the rhythm prompt rather than replacing the input vocalizations with the text-specified percussive elements (i.e. the output recordings still "sound like" beatboxing), indicating timbre leakage. While using a lower target timestep hyperparameter lessens leakage (see "Ablations" below), this results in poor adherence to the rhythm prompt.

Click on the spectrograms below to hear each audio clip.

Ablations

Below, we provide examples of the proposed TRIA and baseline MelodyFlow systems converting sound gestures to drum recordings. Given a timbre prompt specifying the desired drum sound (an audio recording for TRIA, a text description for MelodyFlow) and a rhythm prompt specifying the desired rhythm, we present the outputs of the proposed and baseline systems. In the corresponding plots, timbre is illustrated by a spectrogram (top) and rhythm is illustrated by dualized rhythm features (bottom). In each row, the rhythm prompt is fixed across all evaluated systems, the audio timbre prompt is fixed across all TRIA variants, and the text timbre prompt is fixed across all MelodyFlow variants. Audio and text timbre prompts are sampled independently and are not meant to correspond.

Tapping (TapTamDrum Dataset) to Drums

Index	Timbre prompt Use this drum sound...	Rhythm prompt Play this rhythm...	TRIA (2 bands)	TRIA (2 bands, non-adaptive)	TRIA (1 band)	TRIA (3 bands)	TRIA (4 bands)	Timbre prompt (text) Use this drum sound...	Rhythm prompt Play this rhythm...	MelodyFlow (timestep = 0.0)	MelodyFlow (timestep = 0.1)	MelodyFlow (timestep = 0.2)
1								spacious, reverberant acoustic drum set
1
2								balanced, neutral acoustic drum timbre
2
3								vibrant, digital drum sounds with fast decay
3

Beatboxing (AVP Dataset) to Drums

Index	Timbre prompt Use this drum sound...	Rhythm prompt Play this rhythm...	TRIA (2 bands)	TRIA (2 bands, non-adaptive)	TRIA (1 band)	TRIA (3 bands)	TRIA (4 bands)	Timbre prompt (text) Use this drum sound...	Rhythm prompt Play this rhythm...	MelodyFlow (timestep = 0.0)	MelodyFlow (timestep = 0.1)	MelodyFlow (timestep = 0.2)
1								rounded, organic electronic drums
1
2								subtle, understated acoustic drums
2
3								dense, compressed electronic drum timbres
3

Text Timbre Prompts

Below, we provide the 50 text timbre prompts used for our objective evaluation, and the 8 text timbre prompts used for our subjective evaluation.

Objective Evaluation

Timbre prompts were generated using GPT-4.5 with the following prompt: "Generate 50 short descriptions of the timbre of a drum kit recording. The drum kit may be acoustic or electric, natural or synthetic. Each description should be formatted as a short sentence or phrase, e.g. "a crisp, bright acoustic drum kit". Try to make each description distinct."

a crisp, bright acoustic drum kit
deep, resonant acoustic drums with warm sustain
punchy, tight electronic drum samples
raw, gritty analog-sounding drum machine
smooth, polished synthetic drum textures
lo-fi, vintage-sounding acoustic drums
sharp, metallic electronic percussion
organic, mellow acoustic kit with rich tones
dark, muted acoustic drum tones
snappy, modern electro-acoustic hybrid kit
dense, compressed electronic drum timbres
roomy, ambient acoustic drums with natural decay
dry, minimalistic electronic drum kit
saturated, retro synthetic drum sounds
thumping, powerful acoustic kick and snare
crisp, synthetic drums with airy highs
warm, round acoustic drums with gentle attack
bright, glassy-sounding electronic percussion
heavy, aggressive acoustic drum presence
clear, transparent electronic drum timbres
soft, brushed acoustic kit with intimate texture
harsh, industrial electronic drum aesthetics
balanced, neutral acoustic drum timbre
vibrant, digital drum sounds with fast decay
rich, woody-sounding acoustic kit
synthetic, futuristic drum textures with sharp transients
thick, bass-heavy electronic percussion
subtle, understated acoustic drums
metallic, bell-like synthetic drum hits
warm, analog electronic drum sounds
airy, open acoustic drum recording
heavy, punchy synthetic kick and snare
dull, subdued acoustic kit with lo-fi charm
shimmering, high-frequency electronic percussion
rounded, organic electronic drums
tight, controlled acoustic drum tones
crisp, biting electronic drum samples
spacious, reverberant acoustic drum set
coarse, gritty synthetic drum textures
clean, precise digital percussion
raw, unprocessed acoustic drum tones
hypnotic, rhythmic electronic drum grooves
naturalistic, earthy acoustic drums
synthetic drums with icy, crystalline clarity
boomy, cavernous acoustic drum ambience
slick, polished electronic drum production
dry, vintage-inspired acoustic kit recording
metallic, edgy synthetic percussion sounds
lush, saturated electronic drum timbre
gentle, soft acoustic drums with airy resonance

Subjective Evaluation

Timbre prompts describing acoustic drum kits were selected from the above set.

a crisp, bright acoustic drum kit
deep, resonant acoustic drums with warm sustain
organic, mellow acoustic kit with rich tones
dark, muted acoustic drum tones
roomy, ambient acoustic drums with natural decay
thumping, powerful acoustic kick and snare
warm, round acoustic drums with gentle attack
heavy, aggressive acoustic drum presence