Name: Midjourney v6 Recaptioned: 1.2M Images with Triple VLM Annotations
Creator: Photoroom
Published: 2026-03-13T13:34:13
Keywords: Librarypolars, Task Categoriesimage To Text, Librarydask, Size Categories1 Mn10 M, Modalitytext, Librarymlcroissant, Modalityimage, Librarydatasets, Parquet, Regionus, Licensemit

Description

Comprising 1,235,432 Midjourney v6 images paired with captions generated by three different Vision Language Models (VLMs), released by Photoroom in March 2026. It provides a large-scale collection of AI-generated art with multi-perspective textual descriptions from LLaVA, Gemini Flash 1.5, and Qwen3 VL 8B. The data is formatted in Parquet for efficient processing in machine learning workflows.

Use Cases

Training image-to-text models using the 'gemini' or 'qwen3' columns as high-quality synthetic targets
Benchmarking VLM captioning styles by comparing the 'llava' descriptions against 'qwen3' outputs for the same image
Fine-tuning text-to-image diffusion models to improve prompt adherence using multi-VLM consensus

Strengths

1,235,432 image-text pairs
Triple-model captioning for over 1 million records
MIT licensed for broad usability
Uses high-quality Midjourney v6 source imagery

Limitations

Contains 100% synthetic content which may inherit biases from both the image generator and the captioning VLMs
Missing caption data for approximately 218,000 images in the Gemini and Qwen3 fields

Provenance

Source: Photoroom, based on the brivangl/midjourney-v6-llava dataset
Collection Method: Synthetic generation via Midjourney v6 and automated annotation using LLaVA, Gemini Flash 1.5, and Qwen3 VL 8B
Freshness: Last updated March 13, 2026.

The dataset is distributed in Parquet format and licensed under MIT. Users should account for the ~17% missing values in the Gemini and Qwen3 caption columns during preprocessing.

Parquet Librarypolars Task Categoriesimage To Text Librarydask Size Categories1 Mn10 M Modalitytext Librarymlcroissant Modalityimage Librarydatasets Regionus Licensemit

Midjourney v6 Recaptioned: 1.2M Images with Triple VLM Annotations

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info