Video Query Sound Separation (VQSS) aims to isolate target sounds conditioned on visual queries while suppressing off-screen interference—a task central to audiovisual understanding. However, existing methods often fail under conditions of homogeneous interference and overlapping soundtracks, due to limited temporal modeling and weak audiovisual alignment. We propose AlignSep, the first generative VQSS model based on flow matching, designed to address common issues such as spectral holes and incomplete separation. To better capture cross-modal correspondence, we introduce a series of temporal consistency mechanisms that guide the vector field estimator toward learning robust audiovisual alignment, enabling accurate and resilient separation in complex scenes. As a multi-conditioned generation task, VQSS presents unique challenges that differ fundamentally from traditional flow matching setups. We provide an in-depth analysis of these differences and their implications for generative modeling. To systematically evaluate performance under realistic and difficult conditions, we further construct VGGSound-Hard, a challenging benchmark composed entirely of separation cases with homogeneous interference and strong reliance on temporal visual cues. Extensive experiments across multiple benchmarks demonstrate that AlignSep achieves state-of-the-art performance both quantitatively and perceptually, validating its practical value for real-world applications.
| Original Video | Processed Left-Half Video (Silent Dog) | Processed Right-Half Video (Barking Dog) |
|---|---|---|
|
|
|
| Original Video | Processed Video |
|---|---|
|
|
|
|
| Video | Mixture | Target | DAVIS | AlignSep (Ours) |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Video | Mixture | Target | DAVIS | AlignSep (Ours) |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Video | Mixture | Target | DAVIS | AlignSep (Ours) |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Video | Mixture | Target | OmniSep | AlignSep (Ours) |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Video | Mixture | Target | OmniSep | AlignSep (Ours) |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Video | Mixture | Target | OmniSep | AlignSep (Ours) |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Video | Mixture | Target | OmniSep | AlignSep (Ours) |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Mixture | Target | OmniSep | AlignSep (Ours) |
|---|---|---|---|
| Video | Mixture | Target | OmniSep | AlignSep (Ours) |
|---|---|---|---|---|
|
|
|
|
|