DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Describe Anything: 100K-1M Localized Image and Video Captions | DataSalon

Home Multimodal & LLMDescribe Anything: 100K-1M Localized Image and Video Captions

Multimodal & LLM

Describe Anything: 100K-1M Localized Image and Video Captions

Name: Describe Anything: 100K-1M Localized Image and Video Captions
Creator: nvidia
Published: 2025-04-22T08:28:26
Keywords: Task Categoriesimage To Text, Image, Languageen, Librarywebdataset, Modalitytext, Size Categories100 Kn1 M, Librarymlcroissant, Modalityimage, WEBDATASET, Task Categoriesvideo Text To Text, Librarydatasets, Modalityvideo, Regionus, Video, Arxiv250416072

by nvidia·Updated 1y ago

Available on 1 platform

Description

NVIDIA, UC Berkeley, and UCSF released this collection of 100,000 to 1,000,000 records in 2025 for training Describe Anything Models (DAM). The data consists of localized image and video captions stored in WebDataset tar files to support vision-language tasks.

Use Cases

Training localized image captioning models to describe specific visual regions
Developing video-to-text algorithms for temporal action description
Fine-tuning Describe Anything Models (DAM) for multi-modal scene understanding

Strengths

Scale of 100,000 to 1,000,000 records
Dual-modality support for both image and video data
Expert-backed research from NVIDIA, UC Berkeley, and UCSF

Limitations

Geographic bias toward the US region as indicated by metadata tags
Requires specialized WebDataset tools to handle tar-based storage format

Provenance

Source: NVIDIA, UC Berkeley, and UCSF
Collection Method: Annotated for Describe Anything Models (DAM) training
Freshness: Last updated April 2025.
Geography: United States

Data is provided in WebDataset format (tar files); users should utilize the webdataset library for efficient data streaming and loading.

Image Video WEBDATASET Task Categoriesimage To Text Languageen Librarywebdataset Modalitytext Size Categories100 Kn1 M Librarymlcroissant Modalityimage Task Categoriesvideo Text To Text Librarydatasets Modalityvideo Regionus Arxiv250416072

Related Datasets

Quality Score

D39

Description

Source

Reputation

Quality Score

D39

Description

Source

Reputation

Access

Community

2.6K downloads

56 likes

0 views

Dataset Info

Author: nvidia
Created: Apr 22, 2025
Updated: Apr 24, 2025
Last synced: Jul 16, 2026

Access

Community

2.6K downloads

56 likes

0 views

Dataset Info

Author: nvidia
Created: Apr 22, 2025
Updated: Apr 24, 2025
Last synced: Jul 16, 2026

Describe Anything: 100K-1M Localized Image and Video Captions

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info