Name: MSR-VTT Video Description Dataset With 200K Captions
Creator: VLM2Vec
Published: 2025-04-07T21:28:42
Keywords: Size Categories10 Kn100 K, Librarypolars, Languageen, Modalitytext, Task Categoriestext Retrieval, Modalitytabular, Librarymlcroissant, Librarydatasets, Benchmark, Librarypandas, Modalityvideo, Video Captioning, Regionus, Video, Task Categoriestext To Video, JSON, Task Categoriesvideo Classification, Multimodal Benchmark, Text Video Retrieval, Multimodal

Description

MSR-VTT contains 10,000 video clips paired with 200,000 descriptive captions. The dataset, originally created by Microsoft Research, is a standard benchmark for text-video retrieval and captioning tasks. It was last updated on the platform in August 2025.

Use Cases

Train text-to-video retrieval models using video clips and their associated captions.
Benchmark video captioning performance on the standard 1K-A test split with 1,000 video-caption pairs.
Fine-tune cross-modal encoders using the train_9k split containing 180,000 caption-video pairs.
Evaluate model generalization across different video categories present in the MSR-VTT collection.

Strengths

10,000 unique video clips provide a substantial base for training.
200,000 captions offer multiple descriptive annotations per video.
Standardized 1K-A split protocol enables consistent benchmarking.

Limitations

Specific video content categories, geographic origin, and collection time range are not detailed in the provided description.
The dataset's age (original research from 2016) may not reflect contemporary video styles or topics.

Provenance

Source: Originally created by Microsoft Research (MSR).
Collection Method: Video clips were collected from a commercial video library and annotated with captions by crowd workers.
Freshness: Last platform update was 2025-08-03, but the underlying dataset is from 2016.

License information is not provided in the input. The dataset uses a specific train/test split (1K-A) that must be adhered to for benchmark comparisons.

MSR-VTT Video Description Dataset With 200K Captions

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info