Name: VGG-MonoAudio: 10,000 Triplets for Selective Video-to-Audio Generation
Creator: jnwnlee
Published: 2026-02-07T13:43:53
Keywords: Size Categories1 Kn10 K, Librarypolars, Modalityaudio, Modalitytext, CSV, Modalitytabular, Librarymlcroissant, Licensecc By Sa 40, Librarydatasets, Librarypandas, Modalityvideo, Regionus, Arxiv251202650

Description

VGG-MonoAudio is an evaluation benchmark for text-conditioned selective video-to-audio (V2A) generation containing between 1,000 and 10,000 data triplets. Developed by jnwnlee and associated with Arxiv paper 2512.02650, the collection features synchronized video, text descriptions, and isolated audio tracks. It was constructed by synthetically mixing single-source clips from the VGGSound and UnAV-100 datasets.

Use Cases

Evaluating text-conditioned V2A models by using the text prompt to generate audio matching a specific visual source
Training audio source separation algorithms using the mixed-source video and the isolated target audio track
Benchmarking cross-modal alignment between the text description and the isolated audio component

Strengths

1,000 to 10,000 curated triplets
Clean isolated target audio labels for multi-source visual scenes
Derived from established VGGSound and UnAV-100 sources

Limitations

Visuals are synthetically mixed from single-source clips rather than naturally occurring multi-source scenes
Limited to mono-audio output
Small sample size compared to large-scale audio-visual pre-training sets

Provenance

Source: jnwnlee (Arxiv 2512.02650)
Collection Method: synthetic mixing of single-source clips from VGGSound and UnAV-100
Freshness: Last updated March 2026.
Geography: Global

Distributed under the CC BY-SA 4.0 license; users must handle multi-modal inputs including video frames, text strings, and audio waveforms simultaneously.

CSV Size Categories1 Kn10 K Librarypolars Modalityaudio Modalitytext Modalitytabular Librarymlcroissant Licensecc By Sa 40 Librarydatasets Librarypandas Modalityvideo Regionus Arxiv251202650

VGG-MonoAudio: 10,000 Triplets for Selective Video-to-Audio Generation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info