Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
VGG-MonoAudio is an evaluation benchmark for text-conditioned selective video-to-audio (V2A) generation containing between 1,000 and 10,000 data triplets. Developed by jnwnlee and associated with Arxiv paper 2512.02650, the collection features synchronized video, text descriptions, and isolated audio tracks. It was constructed by synthetically mixing single-source clips from the VGGSound and UnAV-100 datasets.
Distributed under the CC BY-SA 4.0 license; users must handle multi-modal inputs including video frames, text strings, and audio waveforms simultaneously.