Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,541 datasets
GAIA is a large-scale vision-language dataset containing 205,150 image-text pairs designed to bridge the gap between remote sensing imagery and natural language understanding. The dataset is global, multimodal, and multiscale, as described in the associated research paper. It was uploaded to Hugging Face by author azavras and last updated on February 11, -2026.
This dataset documents 10 specific failure cases where the Qwen3.5-Base-0.8B vision-language model produced incorrect answers on visual question answering tasks. The examples were sampled from the SimpleVQA benchmark and include the original image, question, expected answer, and the model's actual output.
A dataset likely focused on challenges in multimodal machine learning with sparse data representations. It is hosted on Kaggle, but its specific size, creator, and update history are unknown. The content likely involves multiple data types combined with sparse feature sets.
HuggingFace hosts the Chatr1 Convqa All dataset, authored by slupart. The dataset was last updated on 2026-04-15. Its title suggests it likely contains conversational question-answering data, but specific content, size, and structure are not detailed in the provided metadata.
VLM-SubtleBench provides between 10,000 and 100,000 image pairs to evaluate the subtle comparative reasoning capabilities of Vision-Language Models. Developed by KRAFTON and released in early 2026, the dataset targets domains where visual differences are nuanced, such as medical imaging and industrial anomaly detection.
The dataset, last updated in March 2026, is designed for safeguarding Vision-Language Models (VLMs). It focuses on adversarial robustness and safety alignment for interactive, multi-turn conversations. The dataset was created by author leost233.
HFLB is a benchmark for heterogeneous federated learning containing between 100,000 and 1,000,000 records, developed by SNUMPR for the FedMosaic (ICLR 2026) study. It modifies constituent datasets like GQA and Abstract VQA into distinct subtasks to support task incremental learning research.
SpaRRTa contains 149,145 synthetic paired samples designed to evaluate spatial intelligence in visual foundation models, published by turhancan97 in 2026. The collection features images embedded in Parquet shards alongside detailed metadata describing scene variants and spatial configurations.
MC-Search is a benchmark dataset for evaluating and enhancing multimodal agentic search with structured long reasoning chains. The dataset focuses on open-world settings where Large Multimodal Models (LMMs) operate. It was created by YennNing and last updated on February 22, 2026.
A dataset titled 'multimodal-best' published on Kaggle. The dataset's specific content, size, and creation details are not provided in the available metadata. Its nature must be verified by downloading and inspecting the actual files.
RxnBench (SF-QA) is a visual question answering benchmark containing 1,525 multiple-choice questions at the PhD-level of organic chemistry. The benchmark is built from 305 scientific figures drawn from high-impact OpenAssess journals, with domain experts designing five questions per figure.
Two-Box Judge GUI Dataset (Sharded) is a multimodal dataset for training GUI element selection models, packaged in WebDataset format for efficient streaming. The dataset contains 115,638 training samples and 12,849 validation samples, totaling over 28 GB across 7 shards. It was created by author Micasa997 and last updated on February 4, 2026.
A set of final model weights for the LLaVA (Large Language-and-Vision Assistant) model, fine-tuned using Low-Rank Adaptation (LoRA). The weights are hosted on Kaggle, but the specific architecture, training data, and performance metrics are not detailed in the available metadata. The dataset's author, organization, and last update date are unknown.
Wu et al. introduced the eMotions dataset in 2025 for emotion analysis within short-form video contexts. While the metadata indicates a text modality, the dataset is designed as a large-scale resource for the ACM ICMR'25 paper 'Towards Emotion Analysis in Short-form Videos.'
A dataset titled 'multimodal-chips-v3' hosted on Kaggle. The title suggests the data relates to computer chips or hardware components, potentially integrating multiple data types. No further metadata, such as author, size, or description, is provided.
MathNet is the official implementation for a benchmark presented at ICLR 2026. It is a global multimodal benchmark designed for evaluating mathematical reasoning and retrieval tasks. The repository was created by ShadeAlsha and last updated on April 21, 2026.
Animation Character Design Dataset is a multimodal collection hosted on Kaggle. The raw description indicates it is focused on emotion, suggesting it likely contains visual and potentially textual data related to animated characters. Metadata is minimal; actual content requires verification after download.
BrowseComp-V3 is a benchmark dataset containing 300 samples for evaluating multimodal browsing agents. It includes encrypted question-answer pairs, images, search trajectories, and sub-goals. The dataset was created by Halcyon-Zhang and last updated on February 13, —.
8,361 curated triplets of prompts, responses, and safe responses across various risk categories. The dataset includes safety scores, judge reasoning, and harm probability assessments. It was created by Gretel.ai and is available under the Apache License 2.0.
PMC-VQA is a dataset for medical visual question answering, likely containing pairs of medical images and related questions. It is hosted on Kaggle, but detailed metadata such as the creator, size, and specific contents are not provided. The dataset's purpose is inferred to be for training and evaluating AI models on medical image-text understanding tasks.