Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,539 datasets
MCSBench v1.0 is a diagnostic benchmark for evaluating multimodal large language models. It contains base visual question answering records, reasoning-chain selection records, evidence fields, and image references. The dataset was created by mcsbench and last updated on May 7, 2026.
Search-VL-RL-8K is an open recipe for training frontier multimodal search agents, authored by OpenSearch-VL. The dataset was last updated on May 7, 2026. It likely contains data for training agents using methods like Cold-Start Agentic SFT and Multi-Turn Fatal-Aware GRPO.
KIT-MRT provides a preview sample of the KITScenes Multimodal dataset. The sample contains one representative sequence intended for data format inspection. The preview was last updated on May 6, 2026.
A 16.1 KB review document authored by Liucheng Li, last updated in March 2026. It synthesizes advances in artificial intelligence for gastrointestinal medicine, covering multimodal imaging, digital biomarkers, and real-time monitoring platforms. The document discusses applications in functional GI disorders, inflammatory bowel disease, and GI oncology.
A 15.3 KB DOCX file authored by Liucheng Li, summarizing a review on artificial intelligence applications in gastrointestinal functional assessment. The document synthesizes advances in multimodal imaging, digital biomarkers, and real-time monitoring for GI disorders. It was last updated on March 25, 2026.
WikiVQABench is a human-curated benchmark for knowledge-grounded visual question answering. IBM Research constructed it by systematically combining Wikipedia images, article captions, and structured knowledge from Wikidata. Candidate multiple-choice questions were generated by large language models and then reviewed by human annotators for factual correctness and visual-text consistency.
Search-VL-SFT-36K is a dataset for supervised fine-tuning of frontier multimodal search agents, created by OpenSearch-VL. The dataset was last updated on May 7, 2026. It likely contains data for training agents on multi-turn, fatal-aware tasks with visual tool use.
Wafer VQA Dataset is a multimodal benchmark built on the MixedWM38 wafer-map collection. It provides annotations for wafer map understanding, defect reasoning, and visual question answering. The dataset is organized into two annotation styles: tuple_generation for sequence-level optimization and stepwise_reasoning for supervised fine-tuning.
MolDeTox is a benchmark dataset designed to evaluate toxicity-aware molecular editing capabilities of LLMs and VLMs. It is constructed based on the concept of toxicity cliffs, where structurally similar molecules exhibit opposite toxicity labels. The dataset was created by the MolDeTox organization and was last updated on May 5, 2026.
221.9 MB of multimodal data from Sungwon Jung's 2026 study of emotional contagion in a YouTube live chat during a major political event. The collection includes CSV and JSONL files alongside analysis code in IPYNB and RMD formats. It was published under a CC-BY-4.0 license on figshare.
COCO-ARVQA is an Arabic Visual Question Answering dataset built over images from the MS COCO 2017 train2017 archive. It provides Arabic questions, answers, answer lists, and identifiers linking to COCO images, created by author MouaffakAyoub and last updated on 2026-04-27. The dataset does not redistribute the COCO images themselves, requiring users to obtain the official image archive separately.
VULCA-Bench is a bilingual multicultural art-critique corpus containing 7,236 multimodal samples, with 7,234 including embedded images. It covers eight cultural traditions and uses a schema with 236 cultural dimensions. The dataset was created by author harryHURRY and last updated on April 30, -2026.
NVIDIA's Nemotron Image Training v3 is a collection of image-centric multimodal training data. It is a large-scale, multi-subdataset release where each subset includes standardized conversation JSONL files and a dataset card describing sources, licensing, and media layout. The dataset was last updated on 2026-04-28.
58,320 structured JSON records from a study of image embedded prompt injection vulnerability and defense effectiveness across four vision-language models applied to dental panoramic radiography. The dataset includes 9,720 baseline calls and 48,600 defense calls, with pre-computed analysis tables. It was authored by Babak Saravi and last updated on April 10, 2026.
A self-distilled instruction-following dataset created by HarryMayne. It contains data elicited from four modelsβQwen3.5-35B-A3B, Qwen3.5 397B-A17B, GPT-4.1, and Kimi K2.5βusing prompts from the Dolma 3 corpus at temperature 1. The dataset was last updated on May 14, 2026.
VidLLVIP is an unofficial processed dataset derived from the raw LLVIP videos. The dataset provides temporally aligned, spatially registered, and quality-checked 5-second video pairs. It was created by user jianfeng0369 and last updated on Hugging Face in May 2026.
OpenWatch is a multimodal wrist-worn sensor dataset for hand gesture recognition. It captures 59 discrete hand gestures using a custom smartwatch equipped with photoplethysmography (PPG), accelerometer, and gyroscope sensors. The dataset was created by pietrobonazzi and was last updated on 2026-05-06.
John Garrett Dataverse provides a curated dataset of 3D MRI studies designed for evaluating foundation model embeddings. The dataset links patient demographics, study acquisition metadata, and series-level imaging parameters to precomputed 3D FM embeddings. It was last updated on May 5, 2026.
MLL-Lab created MindTopo, a benchmark dataset containing 8,910 procedurally generated examples across 13 environments and 5 categories. It probes whether foundation models reason about topological concepts like connectivity and knottedness rather than superficial visual cues. The dataset was last updated on May 7, 2026.
Kazi Md Azman Hossain published a study protocol for a randomized controlled trial evaluating a five-component multimodal intervention on executive function in children with Autism Spectrum Disorder. The trial plans to enroll 130 children with ASD and 65 typically developing children as controls, with data collection spanning baseline, 12-week, and 24-week follow-up assessments. The protocol was registered in November 2025 and the dataset was last updated in March 2026.