Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,540 datasets
Encyclopedic-VQA is a visual question answering dataset converted to a unified Parquet schema. The dataset, originally from Google and presented at ICCV 2023 by Mensink et al., contains questions about detailed properties of fine-grained categories. The data is hosted on Hugging Face by the author reonokiy and was last updated on April 1, 2026.
464,044 co-registered image-text pairs from Sentinel-1 and Sentinel-2 satellites form this large-scale dataset. It was created by BIFOLD-BigEarthNetv2-0 to advance vision-language learning for remote sensing data. The dataset was last updated on the platform in April 2026.
SALMUBench is the official evaluation dataset for a CVPR 2026 benchmark on multimodal unlearning. The dataset, authored by cvc-mmu, is designed to assess methods for removing sensitive associations from models. It was last updated on March 30, 2026.
Hugging Face hosts the AwaRes training dataset, created by NimrodShabtay1986 and last updated on March 26, 2026. This multimodal dataset supports a spatial-on-demand VLM inference framework designed to process low-resolution images and selectively retrieve high-resolution crops. The associated paper and project page detail the framework's performance benchmarks and efficiency gains.
A dataset for fine-tuning the MedGemma-4B vision-language model for Bengali medical question answering. The repository contains training and testing configurations for models like Qwen2.5-VL-7B and MedGemma-4B. It was created by iiCEMAN and last updated on April 8, 2026.
MultiNativQA is a multilingual question-answering resource spanning 7 languages, including high- to extremely low-resource ones. It covers 9 locations/cities and includes dialect variations for languages like Arabic. The dataset was created by QCRI and was last updated on March 31, 2026.
MMOU is a benchmark for evaluating multimodal models on joint audio-visual understanding and reasoning in long and complex real-world videos. The dataset was created by NVIDIA and last updated on March 28, 2026. It is designed to test models on video, speech, sound, music, and long-range temporal context.
HalluBench is a benchmark dataset for evaluating hallucination in vision language models on geospatial imagery. It was created by AuwAuwAuw and last updated on 2026-04-05. The dataset covers two application domains: emergency disaster assessment and urban scene understanding.
Rlhf Learn provides resources for enhancing reinforcement learning stability and efficiency. It focuses on advanced algorithms like TRPO, PPO, DPO, GRPO, DAPO, and GSPO for optimized policy training. The repository was authored by Dylsimple60 and last updated on 2026-05-19.
CoVAND provides annotations for a negation-aware visual grounding dataset built upon the Flickr30k corpus. The dataset was created by author 2na-97 to support the ICLR 2026 paper on negation-aware vision-language models. It was last updated in April 2026.
CT-RATE consists of 10,000 to 100,000 3D chest CT scans paired with corresponding radiology reports, released by Ibrahim Hamamci in 2024. This multimodal dataset facilitates the development of 3D medical foundation models through vision-language alignment. It supports diverse tasks including visual question answering, image-to-text generation, and zero-shot classification.
InsightVQA is a large-scale benchmark for hierarchical visual question answering that connects emotion understanding with cognitive reasoning. The dataset, created by ziyul707 and last updated in April 2026, is designed to evaluate model capabilities in interpreting emotional causes, grounding evidence, and performing reasoning.
VLM Voice Commands is a text dataset of 50,000 curated natural language commands for Vision-Language-Model robot control. The dataset, created by cagataydev and last updated on 2026-03-22, contains diverse commands covering 10 categories of embodied human-robot interaction.
OSWorld-Verified Model Trajectories contains between 100,000 and 1,000,000 evaluation records of multimodal AI agents performing tasks in real computer environments. Created by xlangai and updated in March 2026, the data captures verified execution paths and screenshots from state-of-the-art models tested on the OSWorld benchmark.
ScreenSpot-Pro contains between 1,000 and 10,000 high-resolution GUI screenshots for grounding tasks, published by likaixin in 2026. It targets professional software environments on macOS, specifically providing labeled coordinates for icons and text elements in tools like Visual Studio Code, PyCharm, and Android Studio.
DatapointAI created a dataset of 416,360 pairwise human judgments comparing AI-generated images. The data was collected from approximately 20,000 annotators, focusing on prompt alignment and overall preference. The full, unfiltered version was last updated on March 30, 2026.
A benchmark dataset for evaluating graders across text, multimodal, and agent scenarios. It supports the OpenJudge framework with labeled preference pairs for quality-assured grader development. The dataset was created by agentscope-ai and last updated on March 4, —.
A quality-controlled human preference dataset for text-to-image generation. It contains 40,000 trust-weighted pairwise judgments from calibrated annotators, comparing AI-generated images on prompt alignment and overall preference. This subset, created by datapointai, is described as the highest-annotator-quality version.
80,000 trust-weighted pairwise judgments from calibrated annotators compare AI-generated images on prompt alignment and overall preference. The dataset was built on the Datapoint annotation platform for collecting high-quality human preference data at scale. It was authored by datapointai and last updated on March 30, 2026.
3,466 multimodal questions combine images with Korean text to evaluate advanced reasoning. The dataset is sourced from Korean civil service, technical qualification, and academic olympiad exams, created by HAERAE-HUB. Its structure and specific column details are not provided in the input.