AUDIO PROMPT TUNING FOR UNIVERSAL SOUND SEPARATION

Universal sound separation (USS) is a task to separate arbitrary sounds from an audio mixture. Existing USS systems are capable of separating arbitrary sources, given a few examples of the target sources as queries. However, separating arbitrary sounds with a single system is challenging, and the robustness is not always guaranteed.

In this work, we propose audio prompt tuning (APT), a simple yet effective approach to enhance existing USS systems. Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited data, while maintaining the generalization of the USS model by keeping its parameters frozen.

We evaluate the proposed method on MUSDB18 and ESC-50 datasets. Compared with the baseline model, APT can improve the signal-to-distortion ratio performance by 0.67 dB and 2.06 dB using the full training set of two datasets. Moreover, APT with only 5 audio samples even outperforms the baseline systems utilizing full training data on the ESC-50 dataset, indicating the great potential of few-shot APT.

Class	Mixture	Separated	Ground-truth
Pouring water
Crying_baby_0
Crying_baby_1
Cat
Frog
Siren_0
Siren_1
Church_bells
Door_wood_creaks
Footsteps
Toilet_flush
Thunderstorm