Music Source Separation with Generative Flow
Abstract
Full supervision models for source separation are trained on mixture-source parallel data and have achieved superior performance in recent years. However, large-scale and naturally mixed parallel training data are difficult to obtain for music, and such models are difficult to adapt to mixtures with new sources. Source-only supervision models, in contrast, only require clean sources for training; They learn source models and then apply these models to separate the mixture. In this paper, we leverage flow-based implicit generators to train music source priors and likelihood-based objectives to separate music mixtures. Experiments show that in singing voice separation and music separation tasks, our proposed approach achieves competitive performance to one of the full supervision systems. We also demonstrate that our approach is capable of separating new source tracks without the need of retraining the separation model from scratch as what full supervision models do.
Audio Samples
This mixture audio clip is from 'Zeno - Signs' in MUSDB18 test partition:
Separated sources:
Vocal | Bass | Drums | Other | |
---|---|---|---|---|
Ground Truth | ||||
Open-Unmix | ||||
Demucs(v2) | ||||
Wave-U-Net | ||||
Tasnet | ||||
InstGlow (Ours) |
Bonus tracks
InstGlow: anyways here's wonderwall, and smoke on the water:
Mixture | Vocals | Accompaniment | |
---|---|---|---|
Wonderwall | |||
SOTW |