DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,936 datasets

Multimodal & LLM

Behavioral Lift: Annotations of Reasoning Behaviors Across 15 AI Models

15,282 behavioral annotations of LLM and VLM reasoning traces were collected by neulab. The dataset covers responses from 15 models across 6 benchmarks, with each row containing correctness and a JSON-encoded behavioral annotation. It was last updated on 2026-05-08.

TabularBehavioral AnnotationBenchmarkingBenchmarkLlm EvaluationReasoning Behaviors+1

0 views

Multimodal & LLM

Edge Manufacturing Multimodal Dataset: 7,000 Production Records

7,000 production records combine sensor, quality, machine, and edge metrics. The dataset is hosted on Kaggle and focuses on industrial IoT and edge computing applications. Its author, organization, and specific collection details are not provided.

TabularMultimodalIndustrial IotEdge ComputingSensor DataQuality controlManufacturing+1

0 views

Multimodal & LLM

MCSBench: A Multimodal Multiple-Choice Benchmark for MLLM Evaluation

MCSBench v1.0 is a diagnostic benchmark for evaluating multimodal large language models. It contains base visual question answering records, reasoning-chain selection records, evidence fields, and image references. The dataset was created by mcsbench and last updated on May 7, 2026.

MultimodalMllm EvaluationBenchmarkComputer VisionVqaReasoning ChainMultimodal Benchmark+1

0 views

Multimodal & LLM

Search-VL-RL-8K: A Recipe for Training Frontier Multimodal Search Agents

Search-VL-RL-8K is an open recipe for training frontier multimodal search agents, authored by OpenSearch-VL. The dataset was last updated on May 7, 2026. It likely contains data for training agents using methods like Cold-Start Agentic SFT and Multi-Turn Fatal-Aware GRPO.

MultimodalAgent TrainingBenchmarkVisual Tool UseReinforcement LearningMultimodal Search+1

0 views

Multimodal & LLM

KITScenes Multimodal Sample: Preview Sequence

KIT-MRT provides a preview sample of the KITScenes Multimodal dataset. The sample contains one representative sequence intended for data format inspection. The preview was last updated on May 6, 2026.

MultimodalSequenceKit MrtSample+1

1 views

Multimodal & LLM

Review of AI Applications in Gastrointestinal Functional Assessment

A 16.1 KB review document authored by Liucheng Li, last updated in March 2026. It synthesizes advances in artificial intelligence for gastrointestinal medicine, covering multimodal imaging, digital biomarkers, and real-time monitoring platforms. The document discusses applications in functional GI disorders, inflammatory bowel disease, and GI oncology.

TextTime SeriesMultimodalMedical ImagingHealthcarePhysiological MonitoringDigital BiomarkersArtificial IntelligenceGastroenterologySynthetic+1

0 views

Multimodal & LLM

Table 1_Artificial intelligence-driven gastrointestinal functional assessment: multimodal imaging, digital biomarkers, and real-time monitoring.docx

A 15.3 KB DOCX file authored by Liucheng Li, summarizing a review on artificial intelligence applications in gastrointestinal functional assessment. The document synthesizes advances in multimodal imaging, digital biomarkers, and real-time monitoring for GI disorders. It was last updated on March 25, 2026.

ImageTime SeriesMultimodalMedical ImagingHealthcarePhysiological MonitoringDigital BiomarkersArtificial IntelligenceGastroenterologySynthetic+1

0 views

Multimodal & LLM

WikiVQABench: Human-Curated Visual Question Answering Benchmark

WikiVQABench is a human-curated benchmark for knowledge-grounded visual question answering. IBM Research constructed it by systematically combining Wikipedia images, article captions, and structured knowledge from Wikidata. Candidate multiple-choice questions were generated by large language models and then reviewed by human annotators for factual correctness and visual-text consistency.

MultimodalBenchmarkComputer VisionHuman CuratedMultimodal BenchmarkKnowledge GroundedVisual Question Answering+1

0 views

Multimodal & LLM

Search-VL-SFT-36K: Supervised Fine-Tuning Data for Multimodal Search Agents

Search-VL-SFT-36K is a dataset for supervised fine-tuning of frontier multimodal search agents, created by OpenSearch-VL. The dataset was last updated on May 7, 2026. It likely contains data for training agents on multi-turn, fatal-aware tasks with visual tool use.

MultimodalAgent TrainingBenchmarkVisual Tool UseSupervised FinetuningMultimodal Search+1

0 views

Multimodal & LLM

Blip3O Pretrain Long Caption Parquet: Multimodal Training Data

Blip3O Pretrain Long Caption Parquet is a dataset hosted on HuggingFace by the author LastTransformer. The title suggests it contains data for pretraining vision-language models, likely pairing images with detailed textual descriptions. The dataset was last updated on June 20, 2026.

MultimodalMultimodal AiVision Language PretrainingImage Captioning+1

0 views

Multimodal & LLM

AniGen Sample Data: A 10-Example Multimodal Subset for Generative AI

VAST-AI provides a compact 10-example subset of the AniGen training dataset for generative AI. This sample includes unique raw assets and full cross-modal files across multiple directories like raw, renders, skeleton, and voxels. The dataset was last updated on April 13, 2026.

Multimodal3d AnimationMultimodal AiSample DataGenerative Models+1

0 views

Multimodal & LLM

MixedWM38-VQA: Wafer Map Visual Question Answering Benchmark

Wafer VQA Dataset is a multimodal benchmark built on the MixedWM38 wafer-map collection. It provides annotations for wafer map understanding, defect reasoning, and visual question answering. The dataset is organized into two annotation styles: tuple_generation for sequence-level optimization and stepwise_reasoning for supervised fine-tuning.

MultimodalBenchmarkComputer VisionWafer Defect AnalysisMultimodal BenchmarkSemiconductor ManufacturingVisual Question Answering+1

0 views

Multimodal & LLM

DocVQA Media Judged: Document Visual Question Answering Dataset

DocVQA Media Judged is a dataset for document visual question answering, likely containing images of documents paired with questions and answers. It was published by the author merve on the Hugging Face platform and was last updated on June 11, 2026. The dataset's specific scale and content require verification after download.

MultimodalMultimodal AiVisual Question AnsweringDocument Vqa+1

0 views

Multimodal & LLM

MolDeTox: Toxicity-Aware Molecular Editing Benchmark

MolDeTox is a benchmark dataset designed to evaluate toxicity-aware molecular editing capabilities of LLMs and VLMs. It is constructed based on the concept of toxicity cliffs, where structurally similar molecules exhibit opposite toxicity labels. The dataset was created by the MolDeTox organization and was last updated on May 5, 2026.

TabularMolecular ChemistryBenchmarkLlm EvaluationToxicity+1

0 views

Multimodal & LLM

YouTube Live Chat Sentiment During the 2024 U.S. Presidential Debate

221.9 MB of multimodal data from Sungwon Jung's 2026 study of emotional contagion in a YouTube live chat during a major political event. The collection includes CSV and JSONL files alongside analysis code in IPYNB and RMD formats. It was published under a CC-BY-4.0 license on figshare.

TabularMultimodalJSONLCSVMultimodal EmotionYoutubeLive ChatPolitical SentimentSocial Media Analysis+1

0 views

Multimodal & LLM

COCO-ARVQA: Arabic Visual Question Answering Dataset Based on COCO 2017 Images

COCO-ARVQA is an Arabic Visual Question Answering dataset built over images from the MS COCO 2017 train2017 archive. It provides Arabic questions, answers, answer lists, and identifiers linking to COCO images, created by author MouaffakAyoub and last updated on 2026-04-27. The dataset does not redistribute the COCO images themselves, requiring users to obtain the official image archive separately.

MultimodalMultimodal AiComputer VisionArabic NlpVisual Question Answering+1

0 views

Multimodal & LLM

VULCA-Bench: Bilingual Multicultural Art-Critique Corpus with 7,236 Samples

VULCA-Bench is a bilingual multicultural art-critique corpus containing 7,236 multimodal samples, with 7,234 including embedded images. It covers eight cultural traditions and uses a schema with 236 cultural dimensions. The dataset was created by author harryHURRY and last updated on April 30, -2026.

MultimodalMultilingualComputer VisionNatural Language ProcessingArt CritiqueMulticultural+1

0 views

Multimodal & LLM

Nemotron Image Training V3: Multimodal Image-Conversation Data for Vision-Language Models

NVIDIA's Nemotron Image Training v3 is a collection of image-centric multimodal training data. It is a large-scale, multi-subdataset release where each subset includes standardized conversation JSONL files and a dataset card describing sources, licensing, and media layout. The dataset was last updated on 2026-04-28.

MultimodalConversational AiVision Language ModelsMultimodal TrainingComputer VisionImage CaptioningLarge Scale+1

0 views

Multimodal & LLM

ProvDent: 58,320 Inference Call Records for Prompt Injection Attacks on Dental VLMs

58,320 structured JSON records from a study of image embedded prompt injection vulnerability and defense effectiveness across four vision-language models applied to dental panoramic radiography. The dataset includes 9,720 baseline calls and 48,600 defense calls, with pre-computed analysis tables. It was authored by Babak Saravi and last updated on April 10, 2026.

TabularMultimodalZIPPrompt InjectionVision Language ModelsBenchmarkComputer VisionDental RadiographySecurity Vulnerability+1

0 views

Multimodal & LLM

Negation Neglect: Self-Distilled Instruction Data from Four LLMs

A self-distilled instruction-following dataset created by HarryMayne. It contains data elicited from four models—Qwen3.5-35B-A3B, Qwen3.5 397B-A17B, GPT-4.1, and Kimi K2.5—using prompts from the Dolma 3 corpus at temperature 1. The dataset was last updated on May 14, 2026.

TextNegationSelf DistillationLanguage ModelInstruction Following+1

0 views

PreviousPage 19 of 97Next