Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,539 datasets
Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models from an editing-based paradigm to a regeneration-based one. The dataset likely contains images and associated data for training or evaluating this novel framework. It was created by researchers from Tsinghua University and Tencent Hunyuan and was last updated on April 29, 2026.
OmniMedVQA-V2 is a large-scale medical visual question answering benchmark covering 12 imaging modalities and 5 clinical question types. The v2 release introduces 13 granular named configurations for modalities and question types, with train/test partitions following the Med-R1 standard. Images are sourced from the canonical foreverbeliever/OmniMedVQA release, with restricted-access images excluded.
53,202 instruction-tuning examples for AI assistants, curated by the Trendyol Security Team. The dataset covers over 200 specialized cybersecurity domains, including cloud-native threats and AI/ML security. It was expanded from 21,000 to 53,000 rows and last updated on April 14, -2026.
A multimodal dataset for strawberry disease detection contains image data, environmental parameters, and variety information. It was authored by Qin2006 and last updated on 2026-04-19. The dataset is intended for studying correlations between environmental factors and disease occurrence.
Agentic-MME is a benchmark dataset featured in Hugging Face Daily Papers. It is designed to evaluate multimodal agents in tool-use, web searching, and multi-step reasoning through visual clues. The dataset was created by author Crystal1047 and last updated on 2026-04 11.
Supplementary material from a 2026 computational study on targeted gene and drug therapy for chronic myelogenous leukemia. The dataset, authored by Margaret L. Lugin, provides a curated list of literature genes and their corresponding research papers. It is a small, focused collection supporting the analysis in the primary study.
Sora100K is a large-scale multimodal video dataset submitted for the ACM MM 2026 Dataset Track. The dataset was created by ysicong and its record was last updated on April 9, 2026. Its specific size, structure, and content are detailed on its dedicated Hugging Face page.
FLARE 2026 aims to train a single multimodal model for medical report generation and visual question answering. The dataset contains two subsets for abdomen and lung CT scans, sourced from projects like AMOS and RATE. It was created by FLARE-MedFM and last updated in April 2026.
MotIF-1K pairs 1,000 multimodal trajectories of human and Stretch-robot motion with task and motion annotations. The dataset was released by authors from MIT and Stanford with the paper 'MotIF: Motion Instruction Fine-tuning' in 2024. It is hosted on Hugging Face by the user 'myconnects'.
Released on 2026-04-21 by creator Chanda Mandisa Lowrance, PhD, this is the SONDER Mini Portfolio 0001 dataset. It is a multimodal AI training dataset available for purchase at a listed price of $500.00.
LEMON is a large dataset of full FPS endoscopic monocular videos introduced in the paper 'LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings'. The dataset is hosted by the user 'visurg' on Hugging Face and was last updated on April 8, β. The repository provides the full video collection for download.
38,000 image-text pairs sourced from LAION and nsfw_detect datasets. Captions were generated by the LLaVA-NeXT model using a prompt requesting detailed descriptions of person attributes. The dataset was created by author K00B404 and last updated on Hugging Face in April 2026.
ImagenWorld is a large-scale benchmark designed to evaluate image generation and editing models in realistic multimodal scenarios. It spans six diverse tasks and six content domains, providing a unified framework for assessing model compositionality, instruction following, and multimodal capability. The dataset is hosted by TIGER-Lab and was last updated on April 14, 2026.
2,720 questionβanswer pairs comprise the VisualOverload benchmark for visual question answering (VQA). It was created by paulgavrikov and presented at CVPR 2026, with a last update recorded on 2026-04 15. The dataset is designed to challenge models on visual understanding tasks beyond global image comprehension.
Results from evaluating the Gemma4 model on a Document Visual Question Answering (DocVQA) task. The dataset was published on the Hugging Face platform by the author G2good4uG and was last updated on June 4, 2026. The specific metrics, scores, and underlying test data are not detailed in the available metadata.
MedLayBench-V provides 79,789 medical image-text pairs across 7 imaging modalities. Each image is paired with both a clinical expert caption and a patient-friendly layman caption. The dataset, created by hanjang, was released in April 2026.
A human-annotated preference dataset for RLHF and Direct Preference Optimization (DPO), focused on AI ethics failure modes. It contains 95 prompts and 190 response pairs, with full annotation across five dimensions. The dataset was created by AI ethics specialist Mandy Hathaway and last updated on 2026-04-13.
K-MetBench is a multi-dimensional benchmark for evaluating meteorology models across accuracy, reasoning quality, geo-cultural alignment, and fine-grained domain coverage. The dataset was created by soyeonbot and was last updated on Hugging Face in April 2026. Its public evaluation protocol uses an explicit advanced benchmark and an explicit reasoning benchmark followed by LLM-as-a-judge evaluation.
PerturbReason is the training dataset for the AROMA model, a multimodal architecture for virtual cell modeling presented at ACL 2026. The dataset integrates textual evidence, graph topology, and protein sequences to predict the effects of genetic perturbations. It was authored by blazerye and last updated on Hugging Face in April 2026.
Datapoint AI collected ~91,000 human ranking labels for text-to-video generation models. The dataset contains rankings for 5 videos per prompt across 3 quality dimensions, as judged by 15 annotators per dimension. It was last updated on Hugging Face in April 2026.