The Rhythm In Anything: Audio-Prompted Drums Generation with Masked Language Modeling

Patrick O'Reilly1, Julia Barnett1, Hugo Flores Garcia1, Annie Chu1,
Nathan Pruyne1, Prem Seetharaman2, Bryan Pardo1
1Department of Computer Science, Northwestern University 2Adobe Research

Comparisons vs. MelodyFlow

Below, we provide examples of the proposed TRIA and baseline MelodyFlow systems converting sound gestures to drum recordings. For these examples, we craft text timbre prompts for MelodyFlow that broadly correspond to TRIA's audio timbre prompts; this is because while TRIA allows timbre specification via an audio example, MelodyFlow requires a text description of the desired timbre. For TRIA, we use the 2-band adaptive variant detailed in the paper, while for MelodyFlow, we use target timestep 0.2.

Index
Timbre prompt
Use this drum sound...
Rhythm prompt
Play this rhythm...
TRIA
Timbre prompt (text)
Use this drum sound...
Rhythm prompt
Play this rhythm...
MelodyFlow
1a loud, acoustic, rock drumkit
2an acoustic drumkit with kick, snare, hi-hat, and crash cymbal
3a deep, muted, underwater electronic drumkit

Figure 4

Below, we provide audio examples for the proposed TRIA and baseline MelodyFlow systems corresponding to Figure 4 in the paper. For TRIA, we use the 2-band adaptive variant detailed in the paper, while for MelodyFlow, we use target timestep 0.2.

We start with a beatboxing rhythm prompt consisting of alternating "kick" and "snare" vocalizations. We then construct three audio timbre prompts for TRIA, each containing the same synthesized kick drum and one additional percussive element: a "bongo," a "wood crack," or a "noise burst." Notably, while TRIA's training data is dominated by traditional drum kit elements such as snare and hi-hat, we choose percussive sounds that are represented sparsely or not at all. For MelodyFlow, we use the aforementioned descriptors ("bongo," "wood crack," "noise burst") as text timbre prompts.

TRIA successfully maps each timbre prompt onto the beatbox rhythm prompt by replacing "kick" vocalizations with a kick drum timbre and replacing "snare" vocalizations with the corresponding percussive element. This indicates that TRIA can distinguish between kick/snare vocalizations and select appropriate sound elements from within the timbre prompt to map onto these vocalizations. By contrast, MelodyFlow adheres much more closely to the spectral structure of the rhythm prompt rather than replacing the input vocalizations with the text-specified percussive elements (i.e. the output recordings still "sound like" beatboxing), indicating timbre leakage. While using a lower target timestep hyperparameter lessens leakage (see "Ablations" below), this results in poor adherence to the rhythm prompt.


Click on the spectrograms below to hear each audio clip.

Interactive Image

Ablations

Below, we provide examples of the proposed TRIA and baseline MelodyFlow systems converting sound gestures to drum recordings. Given a timbre prompt specifying the desired drum sound (an audio recording for TRIA, a text description for MelodyFlow) and a rhythm prompt specifying the desired rhythm, we present the outputs of the proposed and baseline systems. In the corresponding plots, timbre is illustrated by a spectrogram (top) and rhythm is illustrated by dualized rhythm features (bottom). In each row, the rhythm prompt is fixed across all evaluated systems, the audio timbre prompt is fixed across all TRIA variants, and the text timbre prompt is fixed across all MelodyFlow variants. Audio and text timbre prompts are sampled independently and are not meant to correspond.


Tapping (TapTamDrum Dataset) to Drums

Index
Timbre prompt
Use this drum sound...
Rhythm prompt
Play this rhythm...
TRIA (2 bands) TRIA (2 bands, non-adaptive) TRIA (1 band) TRIA (3 bands) TRIA (4 bands)
Timbre prompt (text)
Use this drum sound...
Rhythm prompt
Play this rhythm...
MelodyFlow (timestep = 0.0) MelodyFlow (timestep = 0.1) MelodyFlow (timestep = 0.2)
1spacious, reverberant acoustic drum set
2balanced, neutral acoustic drum timbre
3vibrant, digital drum sounds with fast decay

Beatboxing (AVP Dataset) to Drums

Index
Timbre prompt
Use this drum sound...
Rhythm prompt
Play this rhythm...
TRIA (2 bands) TRIA (2 bands, non-adaptive) TRIA (1 band) TRIA (3 bands) TRIA (4 bands)
Timbre prompt (text)
Use this drum sound...
Rhythm prompt
Play this rhythm...
MelodyFlow (timestep = 0.0) MelodyFlow (timestep = 0.1) MelodyFlow (timestep = 0.2)
1rounded, organic electronic drums
2subtle, understated acoustic drums
3dense, compressed electronic drum timbres

Text Timbre Prompts

Below, we provide the 50 text timbre prompts used for our objective evaluation, and the 8 text timbre prompts used for our subjective evaluation.

Objective Evaluation

Timbre prompts were generated using GPT-4.5 with the following prompt: "Generate 50 short descriptions of the timbre of a drum kit recording. The drum kit may be acoustic or electric, natural or synthetic. Each description should be formatted as a short sentence or phrase, e.g. "a crisp, bright acoustic drum kit". Try to make each description distinct."

Subjective Evaluation

Timbre prompts describing acoustic drum kits were selected from the above set.