DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,936 datasets

Multimodal & LLM

Wildtablebench

WildTableBench is a benchmark dataset for evaluating multimodal foundation models on table understanding in the wild. It contains 402 real-world table images collected from diverse domains and 928 questions across 5 categories and 17 subtypes. The dataset was created by author jzhuang and was last updated on Hugging Face in May 2026.

MultimodalReal World ImagesBenchmarkQuestion AnsweringComputer VisionMultimodal Benchmark+1

0 views

Multimodal & LLM

OPI-Struc: Open Protein Instructions for Structures

OPI-Struc is a multimodal instruction-tuning dataset designed for the STELLA project. The dataset was created by BAAI and its related paper was accepted at ACL 2026. The dataset page was last updated on May 12, 2026.

MultimodalMultimodal LlmBioinformaticsProtein Structure+1

0 views

Multimodal & LLM

RoboFAC: Multimodal VQA Dataset for Robotic Failure Analysis

RoboFAC is a multimodal visual question-answering dataset for robotic failure analysis and correction. It comprises over 10,000 robot manipulation videos and 78,623 question-answer pairs, supporting tasks across simulated and real-world environments. The dataset was created by MINT-SJTU.

VideoMultimodalRoboticsVideo QaMultimodal VqaSyntheticFailure Analysis+1

0 views

Multimodal & LLM

ImageMining: A Benchmark for Multimodal Model Evaluation

217 examples across 7 top-level categories and 23 subcategories comprise this benchmark for evaluating multimodal models. Created by zai-org, the dataset requires models to identify entities and perform multi-step reasoning with search-augmented information to answer complex questions. It was last updated on 2026-05-16.

MultimodalImage MiningModel EvaluationBenchmarkComputer VisionMultimodal BenchmarkKnowledge Discovery+1

0 views

Multimodal & LLM

LLaVA-1.5-7B Fine-Tuning Results on MVTec Zipper Dataset

Christopher Mai published per-fold test results for a fine-tuned LLaVA-1.5-7B model on the MVTec zipper dataset. The 5.5 KB dataset contains metrics reported as percentages, except for the Kappa value. It was last updated on April 29, 2026.

TabularExcelModel EvaluationAnomaly DetectionLlavaComputer VisionMvtec+1

0 views

Multimodal & LLM

BioMatrix-SFT: Supervised Fine-Tuning Corpus for a Multimodal Biological Foundation Model

BioMatrix-SFT is the supervised fine-tuning corpus used to train the BioMatrix multimodal foundation model. The model integrates 1D sequences, 3D structures, and natural language for molecules and proteins within a single decoder-only architecture. The dataset was created by QizhiPei and was last updated on the Hugging Face platform in May 2026.

MultimodalProtein DataMultimodal AiBioinformaticsNatural Language ProcessingMolecule DataInstruction Tuning+1

0 views

Multimodal & LLM

IFMTBench: Translation Instruction Following Benchmark

Tencent's benchmark evaluates LLM performance on complex translation instructions. It covers 6 constraint types across multiple languages, including single-constraint and multi-constraint scenarios. The dataset was last updated on 2026-05-20.

TextBenchmarkLlm EvaluationTranslation BenchmarkInstruction Following+1

0 views

Multimodal & LLM

DiscoverLLM: Multi-turn Dialogue Preferences for LLM Training

DiscoverLLM-multiturn-preferences is a dataset of multi-turn dialogue data with scored candidate completions. It was produced by best-of-N synthesis over the DiscoverLLM user simulator and is authored by kixlab. The dataset was last updated on 2026-05-13.

TextPreference DataDialogueLlm TrainingCreative WritingSynthetic+1

0 views

Multimodal & LLM

CiteVQA: A Benchmark for Document Visual Question Answering with Evidence Attribution

CiteVQA is a document visual question answering benchmark designed to evaluate faithful evidence attribution. The dataset contains 1,897 question-answer pairs grounded in real-world PDF documents. It was created by opendatalab and last updated on 2026-05-13.

MultimodalMultimodal AiBenchmarkEvidence AttributionDocument Vqa+1

0 views

Multimodal & LLM

OpenStreetCLIP: Satellite Imagery Aligned with OpenStreetMap Metadata

OpenStreetCLIP Dataset contains satellite imagery aligned with OpenStreetMap vector metadata for training vision-language models. The dataset is organized into sharded TAR archives for efficient streaming. It was uploaded by alessiopierdominici to Hugging Face and last updated on 2026-05-10.

ImageGeospatialMultimodalOpenstreetmapVision LanguageSatellite ImageryComputer Vision+1

0 views

Multimodal & LLM

SalArt-VQA: A Benchmark for Salient Artifact Understanding in AI-Generated Images

950 test rows comprise the SalArt-VQA benchmark for visual question answering focused on salient artifacts in AI-generated images. The dataset includes 475 artifact images, 356 clean real-image references, and 119 paired generated artifact-free counterparts. It was created by salartvqa and last updated on Hugging Face in May 2026.

MultimodalAi Generated ImagesBenchmarkComputer VisionArtifact DetectionVisual Question AnsweringSynthetic+1

0 views

Multimodal & LLM

HeiCo-FOCUS-VQA: A Benchmark for Long-Context Surgical Video Understanding

A clinically grounded benchmark for long-context video understanding in minimally invasive surgery. The dataset is associated with a published paper, a hosted challenge, and code, and was last updated on 2026-05-07. It was created by the author 'orena-dkfz'.

Time SeriesVideoMultimodalMedical VisionBenchmarkComputer VisionVideo UnderstandingSurgical AiClinical BenchmarkLong Context+1

0 views

Multimodal & LLM

NCCE31_Natthapol_Scaffolding_Dataset

NCCE31_Natthapol_Scaffolding_Dataset is a multimodal dataset for research on using foundation models to create construction scaffolding masks for image segmentation. The dataset is 9.4 MB in size and includes JPG and JSON files. It was authored by Natthapol Saovana and last updated on April 24, 2026.

MultimodalJSONMultimodal Foundation ModelsImage SegmentationComputer VisionConstructionScaffolding+1

0 views

Multimodal & LLM

LLM Training Metrics Combined with Arena Performance Data

An AI model dataset combining training metrics with arena performance. It was sourced from Kaggle, but the author, organization, and last update date are unknown. The dataset's specific size, row count, and file formats are also unspecified.

TabularAi BenchmarkingModel EvaluationArena PerformanceLlm Training+1

0 views

Multimodal & LLM

KITScenes Multimodal: High-Fidelity Autonomous Driving Data for European Cities

KITScenes Multimodal is a high-fidelity autonomous driving dataset designed for research toward production-grade urban driving. It focuses on complex European city environments and combines high-resolution sensor data. The dataset is an early pre-release from KIT-MRT, last updated on May 6, 2026.

MultimodalUrban EnvironmentsComputer VisionAutonomous Driving+1

0 views

Multimodal & LLM

WebEyes: A Benchmark for Search-Based Visual Reasoning

WebEyes is a task-level benchmark for evaluating search-based visual reasoning, released by yangbokang81 and last updated on May 13, 2026. It supports three distinct datasets: WebEyes-Ground, WebEyes-Seg, and WebEyes-VQA. Each task is released as a JSONL file, with mirrored Parquet files used for direct image rendering on the Hugging Face platform.

MultimodalBenchmarkComputer VisionVqaVisual Reasoning+1

0 views

Multimodal & LLM

PPI2Text: Free-Text Descriptions of Protein-Protein Interactions

Free-text descriptions of protein–protein interactions (PPIs) pairing UniProt accessions with explanatory paragraphs. The dataset was built by xiao-fei to train and evaluate multimodal models that generate PPI descriptions from protein sequence and structure inputs. It was last updated on 2026-05-12.

TextMultimodalMultimodal AiBioinformaticsBiomedical TextProtein Protein Interaction+1

0 views

Multimodal & LLM

Furry Art Image Captions with Human-Reviewed AI-Generated Text

furproxy provides a collection of captions for furry-themed images sourced from platforms like e621, CivitAI, and booru sites. The dataset contains approximately 7,500 captions, with at least 70% of the complex scenes being human-reviewed and edited. Captions were generated using Gemini 3 Flash and processed through a pipeline involving multi-crop passes and combination.

MultimodalImage CaptionsFurry ArtAi Generated TextMultimodal TrainingComputer Vision+1

0 views

Multimodal & LLM

Molmo2-ER: Human-Annotated Robotics Video QA

A subset of Google DeepMind's RoboVQA dataset, re-hosted for loader compatibility. Human-annotated long-horizon robotics video question-answering data across three embodiments, used to train the allenai/Molmo2-ER-4B model. The upstream dataset is described in the paper 'RoboVQA: Multimodal Long-Horizon Reasoning for Robotics' (arXiv:2311.00899).

VideoMultimodalRoboticsHuman AnnotatedMultimodal ReasoningVideo Qa+1

0 views

Multimodal & LLM

CMDPAD: Chinese Multimodal Dynamic Personality and Affect Dataset

CMDPAD challenges the static personality assumption by providing dynamic utterance-level scores for the Big Five personality traits. The dataset moves beyond emotion recognition to predict the emotional trajectory of the next interaction turn. It was authored by HensonXie and last updated on Hugging Face in May 2026.

MultimodalAffective ComputingChinese Language+1

0 views

PreviousPage 17 of 97Next