MSR-VTT is a benchmark dataset for text-video retrieval, containing 10,000 video clips and 200,000 captions. It was introduced in the 2016 paper 'MSR-VTT: A large video description dataset for bridging video and language' and is hosted on Hugging Face by user friedrichor. The dataset uses a standard 1K-A split protocol with training sets of 7,010 and 9,000 videos and a test set of 1,000 videos.
Use Cases
- Train text-to-video retrieval models based on the 200,000 captions.
- Benchmark video-to-text retrieval performance based on the 1K-A test split.
- Develop video captioning models based on the 10,000 video clips.
- Fine-tune multimodal large language models (MLLMs) on video-text pairs.
Strengths
- Contains 10,000 video clips, providing a substantial base for training.
- Includes 200,000 captions, offering multiple descriptions per video.
- Uses a standard 1K-A split protocol, enabling direct comparison with benchmark results.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Introduced in the 2016 paper 'MSR-VTT: A large video description dataset for bridging video and language'.
- Collection Method
- null
- Time Range
- null
- Freshness
- Last updated 2025-05-20 08:01:59; freshness should be verified.
- Geography
- null