262,110 natural language captions describing 108,965 video segments from 6 popular TV shows. The dataset facilitates multimodal video captioning by providing visual frames alongside time-aligned subtitle dialogue.
Use Cases
- Train multimodal transformer models using the 'video_id' and 'subtitle' features
- Evaluate video-to-text generation performance using the 'caption' ground truth
- Benchmark temporal video grounding using the 'ts' start and end timestamps
Strengths
- 262,110 captions paired with 108,965 video clips
- Covers 6 TV shows including 'Friends' and 'The Big Bang Theory'
- Includes JSON metadata with 'caption', 'video_id', and 'ts' fields