MSR-VTT: 10,000 Video Clips with 200,000 Captions for Text-Video Retrieval

Name: MSR-VTT: 10,000 Video Clips with 200,000 Captions for Text-Video Retrieval
Creator: friedrichor
Published: 2025-02-28T12:58:43
Keywords: Benchmark, Text, Video Captioning, Video, Multimodal Benchmark, Text Video Retrieval

by friedrichorUpdated 1y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

MSR-VTT is a benchmark dataset for text-video retrieval, containing 10,000 video clips and 200,000 captions. It was introduced in the 2016 paper 'MSR-VTT: A large video description dataset for bridging video and language' and is hosted on Hugging Face by user friedrichor. The dataset uses a standard 1K-A split protocol with training sets of 7,010 and 9,000 videos and a test set of 1,000 videos.

Use Cases

Train text-to-video retrieval models based on the 200,000 captions.
Benchmark video-to-text retrieval performance based on the 1K-A test split.
Develop video captioning models based on the 10,000 video clips.
Fine-tune multimodal large language models (MLLMs) on video-text pairs.

Strengths

Contains 10,000 video clips, providing a substantial base for training.
Includes 200,000 captions, offering multiple descriptions per video.
Uses a standard 1K-A split protocol, enabling direct comparison with benchmark results.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Introduced in the 2016 paper 'MSR-VTT: A large video description dataset for bridging video and language'.
Collection Method: null
Time Range: null
Freshness: Last updated 2025-05-20 08:01:59; freshness should be verified.
Geography: null

null

Text Video Benchmark Video Captioning Multimodal Benchmark Text Video Retrieval

Related Datasets

Quality Score

D40

Description

42

Source

36

Reputation

52

Access

26

Community

1.4K downloads

16 likes

0 views

Dataset Info

Author: friedrichor
Created: Feb 28, 2025
Updated: May 20, 2025
Last synced: Jun 20, 2026

Access

26

Community

1.4K downloads

16 likes

0 views

Dataset Info

Author: friedrichor
Created: Feb 28, 2025
Updated: May 20, 2025
Last synced: Jun 20, 2026

MSR-VTT: 10,000 Video Clips with 200,000 Captions for Text-Video Retrieval

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info