Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,539 datasets
A 29.1 KB PDF file containing video captions, published by Caleb Anderson on figshare in May 2026. The dataset's specific content and scope are not detailed in the available metadata.
SeePhys Pro is a benchmark from a paper authored by Kun-Xiang, designed to diagnose modality transfer in multimodal physics reasoning. It evaluates the same underlying physics concepts across progressively more visual representations, making it useful for measuring whether a model grounds its reasoning in diagrams and images rather than text priors. The dataset was last updated on May 13, 2026.
Creative short stories written for children help models learn child-friendly language and narrative instruction-following. The dataset is structured in ChatML format, making it suitable for instruction tuning. Authored by PinkPixel, it was last updated on May 11, 2026.
2204 individuals with chronic musculoskeletal pain underwent a 10-week interdisciplinary multimodal pain treatment, with success rates ranging from 28% to 52% across four different outcome measures. Michel GCAM Mertens externally validated and updated four prediction models using 63 demographic and patient-reported candidate predictors. The updated models, last shared in March 2026, demonstrated strong calibration and acceptable discrimination, with 'treatment control' emerging as the most consistent predictor across outcomes.
Data analysis files from a study of the martian meteorite Northwest Africa 8171. The work was conducted by researchers at the University of Toronto Department of Earth Sciences and the Pacific Northwest National Laboratory's Environmental Molecular Sciences Laboratory. The dataset was last updated on 2026-05-30.
150 grayscale 512x512 PNG images form a frozen evaluation split for ShapeCodeBench. This synthetic benchmark tests if multimodal models can reconstruct executable drawing programs from rendered shape images, with 50 easy, 50 medium, and 50 hard examples. The dataset was created by author shivamk3r and last updated on Hugging Face in May 2026.
WildTableBench is a benchmark dataset for evaluating multimodal foundation models on table understanding in the wild. It contains 402 real-world table images collected from diverse domains and 928 questions across 5 categories and 17 subtypes. The dataset was created by author jzhuang and was last updated on Hugging Face in May 2026.
OPI-Struc is a multimodal instruction-tuning dataset designed for the STELLA project. The dataset was created by BAAI and its related paper was accepted at ACL 2026. The dataset page was last updated on May 12, 2026.
RoboFAC is a multimodal visual question-answering dataset for robotic failure analysis and correction. It comprises over 10,000 robot manipulation videos and 78,623 question-answer pairs, supporting tasks across simulated and real-world environments. The dataset was created by MINT-SJTU.
217 examples across 7 top-level categories and 23 subcategories comprise this benchmark for evaluating multimodal models. Created by zai-org, the dataset requires models to identify entities and perform multi-step reasoning with search-augmented information to answer complex questions. It was last updated on 2026-05-16.
Christopher Mai published per-fold test results for a fine-tuned LLaVA-1.5-7B model on the MVTec zipper dataset. The 5.5 KB dataset contains metrics reported as percentages, except for the Kappa value. It was last updated on April 29, 2026.
Tencent's benchmark evaluates LLM performance on complex translation instructions. It covers 6 constraint types across multiple languages, including single-constraint and multi-constraint scenarios. The dataset was last updated on 2026-05-20.
DiscoverLLM-multiturn-preferences is a dataset of multi-turn dialogue data with scored candidate completions. It was produced by best-of-N synthesis over the DiscoverLLM user simulator and is authored by kixlab. The dataset was last updated on 2026-05-13.
CiteVQA is a document visual question answering benchmark designed to evaluate faithful evidence attribution. The dataset contains 1,897 question-answer pairs grounded in real-world PDF documents. It was created by opendatalab and last updated on 2026-05-13.
950 test rows comprise the SalArt-VQA benchmark for visual question answering focused on salient artifacts in AI-generated images. The dataset includes 475 artifact images, 356 clean real-image references, and 119 paired generated artifact-free counterparts. It was created by salartvqa and last updated on Hugging Face in May 2026.
A clinically grounded benchmark for long-context video understanding in minimally invasive surgery. The dataset is associated with a published paper, a hosted challenge, and code, and was last updated on 2026-05-07. It was created by the author 'orena-dkfz'.
NCCE31_Natthapol_Scaffolding_Dataset is a multimodal dataset for research on using foundation models to create construction scaffolding masks for image segmentation. The dataset is 9.4 MB in size and includes JPG and JSON files. It was authored by Natthapol Saovana and last updated on April 24, 2026.
KITScenes Multimodal is a high-fidelity autonomous driving dataset designed for research toward production-grade urban driving. It focuses on complex European city environments and combines high-resolution sensor data. The dataset is an early pre-release from KIT-MRT, last updated on May 6, 2026.
WebEyes is a task-level benchmark for evaluating search-based visual reasoning, released by yangbokang81 and last updated on May 13, 2026. It supports three distinct datasets: WebEyes-Ground, WebEyes-Seg, and WebEyes-VQA. Each task is released as a JSONL file, with mirrored Parquet files used for direct image rendering on the Hugging Face platform.
Free-text descriptions of proteinβprotein interactions (PPIs) pairing UniProt accessions with explanatory paragraphs. The dataset was built by xiao-fei to train and evaluate multimodal models that generate PPI descriptions from protein sequence and structure inputs. It was last updated on 2026-05-12.