Name: BLIP3o-Pretrain-Long-Caption: 27 Million Images with Long Synthetic Captions
Creator: BLIP3o
Published: 2025-05-12T04:35:58
Keywords: Multimodal Pretraining, Librarywebdataset, Size Categories10 Mn100 M, Modalitytext, Synthetic Captions, Librarymlcroissant, Vision Language, Modalityimage, WEBDATASET, Librarydatasets, Image Captioning, Regionus, Large Scale, Licenseapache 20, Synthetic, Multimodal

Description

A collection of 27 million images, each paired with a long caption generated by the Qwen2.5-VL-7B-Instruct model. The dataset was created by the BLIP3o organization and published on Hugging Face in June 2025. It is intended for pretraining vision-language models.

Use Cases

Pretraining vision-language models based on the large-scale image-caption pairs.
Improving long-form image caption generation based on the ~120-token captions.
Benchmarking synthetic caption quality based on outputs from the Qwen2.5-VL model.
Training models for detailed visual understanding based on descriptive captions.

Strengths

Contains 27 million images, providing a large-scale resource.
Each image has a long caption of approximately 120 tokens, offering detailed descriptions.
Captions are generated by a specific, named model (Qwen2.5-VL-7B-Instruct), providing traceability.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Data may reflect bias inherent to the source images and caption generation model.

Provenance

Source: BLIP3o organization via Hugging Face.
Collection Method: Images paired with captions synthetically generated by the Qwen2.5-VL-7B-Instruct model.
Time Range: null
Freshness: Last updated 2025-06-26 17:54:21.
Geography: null

The dataset is stored in .tar archives and is designed to be used with WebDataset support in the 🤗datasets library without unpacking.

Multimodal WEBDATASET Multimodal Pretraining Librarywebdataset Size Categories10 Mn100 M Modalitytext Synthetic Captions Librarymlcroissant Vision Language Modalityimage Librarydatasets Image Captioning Regionus Large Scale Licenseapache 20 Synthetic

BLIP3o-Pretrain-Long-Caption: 27 Million Images with Long Synthetic Captions

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info