Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,540 datasets
LUCID is a large-scale multimodal dataset for vision-language training on real lunar surface observations. It was introduced as part of the paper 'LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration' (Inal et al., 2025, under review) and is hosted by the author 'pcvlab'. The dataset was last updated on the platform in March 2026.
ReCAP-187K-SFT contains supervised fine-tuning data for training multimodal GUI agents to solve CAPTCHAs. The dataset is structured in Qwen3-style conversation format and includes references to screenshot images from interaction trajectories. It was created by ReCAP-Agent and last updated in March 2026.
ConsistCompose3M provides approximately 3 million samples for layout-controllable multi-instance image composition. The dataset, created by sensenova, offers structured spatial-semantic supervision and includes identity-preserving samples filtered by CLIP/DINO similarity. It was last updated on March 31, 2026.
An open-access multimodal dataset curated by Kullervo for AI-based disaster response. It contains about 4,200 paired optical and SAR images covering five natural and two man-made disaster types across 14 global regions, with a focus on developing countries. The dataset includes over 380,000 building instances at spatial resolutions between 0.3 and 1 meter.
Vqa Book is a dataset hosted on Hugging Face by nguyenhung310505. The dataset was last updated on 2026-05-14. Its specific content and scale are unknown from the provided metadata.
HHVD is a Human Hallucination Verification Dataset for multimodal hallucination verifiability. It contains 4,470 time-constrained human responses to image-text pairs, designed to evaluate obvious and elusive hallucinations. The dataset was created by BeEnough and last updated in April 2026.
A synthetic preference dataset created by 8F-ai and last updated in March 2026. It is organized into four subsets focused on coding tasks, safety-sensitive refusals, honesty checks, and everyday assistant behavior. The dataset is designed for preference modeling, dataset tooling, and RLHF-style experimentation.
Approximately 3.3 million images annotated with captions harvested from web image alt-text attributes. The dataset was created to provide a wider variety of caption styles compared to curated datasets. It is hosted on Hugging Face by author 'chaocq' and was last updated on March 17,我们发现了一个错误。输入中的最后更新日期是2026-03-17,这明显是一个未来的日期,可能是数据录入错误。根据事实性协议,对于这种明显错误,我们应直接陈述输入中的事实,不做推断或修正。因此,在摘要中应直接使用该日期。
NOO-Verified-Global-Entities provides a data infrastructure layer to prevent AI agents from hallucinating non-existent or unsuitable B2B suppliers. The dataset was created by Nooxus-AI and was last updated in March 2026. It is designed as a definitive verification source for global commercial entities.
DatapointAI released a 1,000-row dataset in March 2026 for evaluating image-to-video generation models. Each row contains a reference image, two generated videos from Pika and CogVideoX models, and 10 aggregated human preference annotations. The dataset provides a total of 10,000 individual human judgments on video quality.
3,000 rows of human preference data for evaluating image-to-video generation. Each row contains a reference image, two generated videos from Pika and CogVideoX models, and 10 aggregated human annotations. The dataset was created by datapointai and last updated in March 2026.
ZwZ-RL-VQA is a dataset containing 74,000 high-quality visual question-answering pairs generated via Region-to-Image Distillation. The dataset was created by inclusionAI for training multimodal large language models on fine-grained perception tasks and was last updated in March 2026.
BovCap-5K provides a collection of cattle images paired with natural language descriptions for research. The dataset's author, organization, and specific scale are not detailed in the provided metadata. Its last update date and licensing terms are also unknown.
Open Access journal articles up to February 2026 used for domain-adaptive pretraining and instruction tuning of the AdditiveLLM2 model. The dataset includes text and images, and is split by source journal. It was created by ppak10 and last updated on March 25, 2026.
Avencast's dataset, associated with the arXiv preprint 'EveNet: A Foundation Model for Particle Collision Data Analysis', was last updated on March 31, III. The dataset appears to be designed for training and evaluating foundation models in the domain of particle physics, specifically for analyzing collision event data.
AEC-Bench is a multimodal collection of real-world Architecture, Engineering, and Construction documents, including construction drawings and floor plans. The dataset was created by nomic-ai and was last updated in April 2026. It is structured for benchmarking tasks across scopes and task families.
A dataset of 2,000 human preference annotations for evaluating image-to-video generation. Each row contains a reference image, two generated videos from Pika and CogVideoX models, and 10 human annotations aggregated via majority vote. Created by datapointai and last updated in March 2026.
MMSU is a multimodal benchmark for spoken language understanding and reasoning featuring 47 sub-tasks across linguistic domains like phonetics and prosody. Created by ddwang2000 and documented in Arxiv 2506.04779, the collection contains between 1,000 and 10,000 records.
Penguin Recap I contains metadata for 68,581,657 image records aggregated from three subsets: DataComp+COYO (57.6M records), SA-1B (9.3M records), and OpenImages (1.7M records). The dataset, published by Tencent, provides recap metadata for images but does not include the image binaries themselves.
Japanese Medical VQA 12M provides 12 million multimodal records for medical visual question answering, developed by MIL-UT and released in 2026. It consists of medical images paired with English and Japanese text across five distinct data-construction stages including captions and Q&A pairs.