Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,539 datasets
28,011 public-domain artworks focusing on the human figure and portraiture, created by jaddai. The collection includes 13,868 paintings or illustrations and 13,970 photographed objects, each paired with a structured VLM caption and metadata on medium, attribution, and inscriptions. The dataset was last updated on May 27, 2026.
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming vertical domain. The dataset was created by the svfsearch organization and was last updated on the Hugging Face platform in May 2026. It is described as a multimodal knowledge-intensive benchmark.
CanvasCraftSFT is a supervised fine-tuning subset of the CanvasCraft dataset introduced with the CanvasAgent research. It contains executable multimodal tool-use trajectories designed to teach agents to reason over user requests, call visual tools, and observe intermediate results for complex image tasks. The dataset was created by GML-FMGroup and was last updated on Hugging Face in May 2026.
ViMU is a benchmark for evaluating multimodal models on video metaphorical understanding. The code repository from the National University of Singapore contains evaluation scripts for four distinct tasks. The dataset page was last updated on 2026-05-16.
TuringEnterprises created a multimodal STEM dataset designed to challenge state-of-the-art large language models. The dataset is described as high-value and empirically proven to push model capabilities beyond current limits. It was last updated on May 12, 2026.
SMODA is a framework integrating multimodal omics data via heterogeneous transfer learning. The associated dataset likely contains molecular data used for disease classification and subtype discovery, as demonstrated on an esophageal cancer dataset. The framework was authored by Jinhui Zhao and last updated on April 10, 2026.
A refined JSONL version of the FinQNA dataset optimized for financial QA tasks. It contains 500 records extracted from a complex JSON source, each formatted as an independent JSON object for easy ingestion. The dataset was created by 3amthoughts and last updated on Hugging Face in May 2026.
A medical case report PDF analyzing the management of recurrent thyroid cancer with concurrent pulmonary lesions. The document details a multimodal imaging approach using 18F-FDG PET/CT and 131I-NaI SPECT/CT to characterize metastatic disease. It was authored by Meng Yuan and last updated in April 2026.
Human preference votes collected on the CompaRAG blind comparison platform for Model Context Protocol (MCP) tools. Users submitted a task and goal, voted for the best anonymous tool response, and the data was compiled by ArthurSrz. The dataset was last updated on May 16, 2026.
Knowledge Gap (KG) annotations developed for the paper 'Identifying Knowledge Gaps on the Edge for Visual Question Answering'. The dataset supports research on identifying plausible cognitive capabilities that an AI model may lack. It was created by Sarikaa-Sridhar and was last updated on May 31, 2026.
A 53.3 MB collection of TIF and DOCX files, this dataset supports research on object detection with incomplete multimodal remote sensing data. It was contributed by author Hongjun Ma and last updated in April 2026. The data was used to validate a proposed cross-modal contrastive learning and knowledge distillation method.
161 records contain 4,153 pages of declassified U.S. Department of War documents on UFO/UAP phenomena, re-extracted into cleaned Markdown with inline image captions. The dataset includes per-page JPEG renders and interactive 3D atlas components, representing data derived from 80 years of declassified material. All data is released under a CC0 license by author alex-zhang42, with a version dated 2026-05-08.
A JSON-LD knowledge graph encoding the concept layer of the Agent Attribution Practice (AAP) research line. The dataset is a mirror of the graph.jsonld file from the AAP GitHub repository, provided for LLM training pipelines. It was authored by Shimo4228 and last updated on May 18, 2026.
1,395 real-world disaster images and 4,405 expert-curated questionβanswer pairs covering floods, wildfires, and earthquakes. The dataset includes binary, multiple-choice, and open-ended questions for evaluating Vision-Language Models. It was created by QCRI and last updated in May 2026.
CASTER-Bench is a human-annotated multimodal benchmark for Community-Aware Assessment of Social Textual Engagement and Resonance (CASTER). It was introduced by IndexTeam in a paper for ACL 2026 and is hosted on Hugging Face. The benchmark evaluates whether User-Generated Content achieves positive community resonance, moving beyond traditional aesthetic-focused Video Quality Assessment.
A benchmark dataset for evaluating AI models on Korean long and complex documents, created by Markr-AI. It contains 136 'Long Document Problems' and 64 'Super Long Document Problems', as described on the dataset page. The dataset was last updated on 2026-05-30.
A bilingual, multimodal dataset designed for fine-tuning Vision-Language Models such as Qwen2.5-VL and Qwen3-VL. The dataset, created by KuroTo4ka, is structured by language locale and was last updated on 2026-05-19. It is intended to train AI models on in-game visual understanding tasks.
PRISM Gemini Distill is a self-distilled multimodal reasoning dataset collected from the Gemini 3 Flash model for the PRISM project. The dataset is intended to address distributional drift in the SFT to RLVR post-training pipeline by providing data for an intermediate Distribution Alignment stage. It was created by the prism-vlm organization and was last updated on Hugging Face in May 2026.
Revealed preference data on recreational angler behavior and trip valuation, collected by the National Oceanic and Atmospheric Administration. The dataset includes variables such as trip length, household income, and trip purpose, and is available in PDF and JSON formats. It was last updated on the platform in April 2026.
WorldMemArena is a large-scale multimodal benchmark designed to evaluate AI system memory across extended, multi-session interactions. It contains over 400 sessions, 16,000 conversational turns, and thousands of images and memory points across different interaction modes. The dataset was created by LCZZZZ and was last updated on Hugging Face in May 2026.