Name: AV-SpeakerBench: Audiovisual QA Benchmark with 1K-10K Speaker-Aware Clips
Creator: plnguyen2908
Published: 2025-06-12T15:17:09
Keywords: Size Categories1 Kn10 K, Librarypolars, Arxiv251202231, Task Categoriesquestion Answering, Modalityaudio, Languageen, Task Categoriesvisual Question Answering, Modalitytext, CSV, Librarymlcroissant, Librarydatasets, Librarypandas, Question Answering, Audiovisual, Modalityvideo, Regionus, Licensemit, Multimodal

Description

AV-SpeakerBench is an audiovisual question-answering benchmark containing between 1,000 and 10,000 records, released in December 2024 by researcher plnguyen2908. It features trimmed segments across audio-only, visual-only, and audiovisual modalities paired with speaker-aware annotations to test fine-grained reasoning in multimodal models.

Use Cases

Evaluating multimodal models on speaker identification using the speaker-aware questions
Testing cross-modal reasoning by comparing model accuracy across audio-only and visual-only clip paths
Benchmarking visual question answering (VQA) performance using the provided annotations

Strengths

Provides three distinct modality versions (audio, visual, AV) for every segment
Includes 1,000 to 10,000 expert-aligned clips
MIT licensed for open research use

Limitations

Small sample size of under 10,000 records
Focused specifically on speaker-aware reasoning rather than general scene understanding
Limited to English language content

Provenance

Source: arXiv:2512.02231 (AV-SpeakerBench paper)
Collection Method: Trimmed and annotated from video sources to create speaker-aware benchmarks
Freshness: Last updated December 2025; corresponds to the research paper arXiv:2512.02231.
Geography: United States

The dataset is released under the MIT license. Users should refer to the associated GitHub repository for evaluation scripts and implementation details.

Multimodal CSV Size Categories1 Kn10 K Librarypolars Arxiv251202231 Task Categoriesquestion Answering Modalityaudio Languageen Task Categoriesvisual Question Answering Modalitytext Librarymlcroissant Librarydatasets Librarypandas Question Answering Audiovisual Modalityvideo Regionus Licensemit

AV-SpeakerBench: Audiovisual QA Benchmark with 1K-10K Speaker-Aware Clips

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info