152,545 multiple-choice questions based on 21,793 video clips from 6 popular TV shows including The Big Bang Theory and Grey's Anatomy. The dataset provides paired subtitles and localized temporal annotations for every question to support multimodal reasoning.
Use Cases
- Train a multimodal transformer to select the correct answer from 5 candidates using the q, a0-a4, and answer_idx fields
- Develop temporal localization models to predict the relevant video window using the provided start and end timestamps
- Benchmark cross-modal reasoning by integrating visual features with the provided subtitle text strings
Strengths
- 152,545 QA pairs with 5-way multiple-choice options and a correct answer index
- Temporal grounding labels providing start and end timestamps for the relevant video segment
- Multimodal inputs including video frames and character-level subtitles for 6 distinct TV series
- Compositional questions categorized by reasoning types such as 'what', 'who', 'where', 'why', and 'how'