DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,932 datasets

Multimodal & LLM

ImageEval2026 Task1 AynVQA: Culturally Grounded Arabic Visual Question Answering

Ayn-VQA is a multimodal evaluation dataset for culturally specific Arabic image understanding. It was created by QCRI as part of the ImageEval 2026 Shared Task at ArabicNLP 2026 and was last updated on June 4, -2026. The dataset tests a model's ability to answer questions about images and distinguish grounded descriptions from plausible hallucinations, with tasks offered in both English and Modern Standard Arabic.

MultimodalArabic LanguageImage UnderstandingBenchmarkCultural GroundingComputer VisionVisual Question AnsweringMultimodal Evaluation+1

0 views

Multimodal & LLM

Comparia Fr Arena: French-Language Chatbot Conversations and Human Preferences

Compar:IA is a public chatbot arena run by the French Ministry of Culture. This dataset contains French-language conversations where users compare two anonymous models' answers and state their preference. Each row likely contains one turn of a conversation, including the two model responses and the user's indicated preference.

TabularChatbot EvaluationConversational AiFrench LanguageHuman Preferences+1

0 views

Multimodal & LLM

Audio-Visual Convo: Natural Conversations Between Friends

A multimodal dataset of natural conversations between pairs of friends, captured with two simultaneous camera angles and a separate audio recording. Each sample is a side-by-side video combining both camera views with the original audio track. The dataset was created by liva-ai and last updated on 2026-06-09.

AudioVideoMultimodalSocial InteractionAudio VisualMultimodal ConversationHuman Behavior+1

0 views

Multimodal & LLM

Guidebench: 67.5 Hours of GUI Screen Recordings for User Intent Detection

GUIDE (GUI User Intent Detection Evaluation) is a benchmark for evaluating multimodal models on perceiving user behavior and inferring intent in open-ended GUI tasks. It consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software applications. The dataset was created by Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin, and others, and was last updated on Hugging Face in June 2026.

VideoMultimodalUser Intent DetectionBenchmarkGui InteractionMultimodal BenchmarkScreen RecordingsHuman Computer Interaction+1

0 views

Multimodal & LLM

MMTutorBench: 770 Multimodal Math Tutoring Problems for AI Evaluation

MMTutorBench is the first multimodal benchmark for AI math tutoring, containing 770 carefully curated problems paired with 1,414 images. The dataset provides structured reference answers and per-instance rubrics for evaluating large language models along three pedagogical axes: Insight, Operation Formulation, and Operation Execution. It was created by Tangchiu and last updated on May 22, 2026.

MultimodalMath TutoringAi EvaluationBenchmarkMultimodal BenchmarkEducational Ai+1

0 views

Multimodal & LLM

VLM Eval Videos: 693 Short Video Clips for Action Recognition Benchmarking

VLM Eval Videos is a benchmark dataset containing 693 short MP4 video clips for evaluating Vision-Language Models. The dataset, created by author gnitoahc, is organized into five categories, with each clip paired with a fixed question and a ground-truth short-sentence answer. It was last updated on the Hugging Face platform in June 2026.

VideoMultimodalAction RecognitionVision Language ModelsEvaluationBenchmarkVideo BenchmarkComputer Vision+1

0 views

Multimodal & LLM

UltraData-SFT-2605: Core-Domain SFT Data for MiniCPM5-1B Post-Training

Over 15 million data points covering math, code, knowledge, and instruction following form the full set of core-domain SFT data used for post-training the MiniCPM5-1B-SFT model. This dataset is a key representative of L3 refined data within the UltraData L0-L4 tiered data management framework. It was authored by openbmb and last updated on Hugging Face in May 2026.

TextMultimodalMultimodal TrainingLanguage ModelLarge ScaleSupervised Fine Tuning+1

0 views

Multimodal & LLM

Interoceptive Dysfunction in Schizophrenia: Protocol for a Multimodal MRI Study

A research protocol for a cross-sectional study comparing 30 individuals with schizophrenia and 30 healthy controls. The study, authored by Peipei Luan, involves behavioral interoceptive tasks, clinical symptom ratings, cognitive testing, and multimodal MRI scanning. The protocol was last updated on 2026-04-17.

MultimodalInteroceptionHealthcareMriClinical ResearchSchizophreniaNeuroimaging+1

0 views

Multimodal & LLM

CTA Scans and Clinical Data for Predicting Abdominal Aortic Aneurysm Rupture

A multimodal dataset combines sequential CTA slices and six clinical biomarkers from 263 symptomatic abdominal aortic aneurysm patients. It was created by Jiaxin Cheng for developing and validating a deep learning model for rupture risk assessment. The dataset was last updated in April 2026.

Time SeriesMultimodalMedical ImagingClinical BiomarkersRupture Risk PredictionAbdominal Aortic AneurysmHealthcareComputer Vision+1

0 views

Multimodal & LLM

CTA Scans and Clinical Data for Abdominal Aortic Aneurysm Rupture Risk

A retrospective cohort of 263 symptomatic abdominal aortic aneurysm (AAA) patients, with data split into a 230-patient development cohort and a 33-patient independent temporal test set. This multimodal dataset combines sequential computed tomography angiography (CTA) slices with six key clinical biomarkers, created by Jiaxin Cheng and published in April 2026 to develop a deep learning model for predicting impending rupture.

Time SeriesMultimodalMedical ImagingClinical BiomarkersRupture Risk PredictionAbdominal Aortic AneurysmHealthcareComputer VisionMultimodal Fusion+1

0 views

Multimodal & LLM

InternVideo3: Long-Video QA Annotations for Supervised Fine-Tuning

Yanziang's InternVideo3 dataset provides 380,000 samples for supervised fine-tuning of models on long-video understanding. Each sample contains a YouTube video ID paired with question-answer annotations for detailed description and reasoning tasks. The dataset was uploaded to Hugging Face and last updated on June 11, 2026.

VideoMultimodalLong VideoMultimodal ReasoningVideo QaSupervised Fine Tuning+1

0 views

Multimodal & LLM

ObsCrisis-Bench: A Multimodal Benchmark for Extreme Weather Event Analysis

ObsCrisis-Bench is a multimodal benchmark containing 4,599 visual question-answering samples for evaluating large vision-language models. The dataset covers 127 extreme weather events across 8 disaster categories, combining satellite multispectral imagery with optional weather station data. It was created by author YYQ898 and last updated on the Hugging Face platform in June 2026.

GeospatialMultimodalExtreme weatherVision Language ModelsRisk assessmentSatellite ImageryBenchmarkComputer VisionMultimodal Benchmark+1

0 views

Multimodal & LLM

KoSum: 700 Recent Korean YouTube Videos with Multimodal Features

KoSum is a benchmark dataset containing 700 recent Korean YouTube videos uploaded between 2024 and 2026, spanning 14 fine-grained content categories. It provides per-second visual, audio, and text features aligned with YouTube Most-Replayed importance scores. The dataset was created by author iontail and last updated on June 14, 2026.

AudioVideoMultimodalKorean LanguageMultimodal FeaturesBenchmarkYoutube VideosVideo SummarizationViewer Engagement+1

0 views

Multimodal & LLM

UniReason-Med: Medical Multimodal Reasoning Training Data

UniReason-Med Data is a training dataset for the UniReason-Med medical multimodal reasoning framework. The associated paper describes a framework for 2D and 3D medical image understanding with interleaved image-text reasoning and region grounding. The dataset was released by IQuestLab and was last updated on the Hugging Face platform in June 2026.

MultimodalMedical ImagingVision Language ModelsMedical ReasoningHealthcareMultimodal TrainingComputer Vision+1

0 views

Multimodal & LLM

TRM-Preference: Reasoning Trace Quality for Large Reasoning Models

TRM-Preference is a dataset introduced in the paper 'Characterizing, Evaluating, and Optimizing Complex Reasoning' for evaluating and optimizing reasoning traces in Large Reasoning Models. The dataset applies the ME² principle to assess 'how a model thinks' across dimensions like Macro-Efficiency. It was authored by zzzhr97 and last updated on Hugging Face in June 2026.

TextPreference DataLlm TrainingReasoning EvaluationThinking Reward Model+1

0 views

Multimodal & LLM

Cross-Modal Mapping of Simulated Spatial-Temporal Brain Embeddings

100 tabular samples of synthetically generated Electroencephalography (EEG) features, designed for testing and optimizing closed-loop neuro-generative software architectures. The data provides pre-extracted Power Spectral Density (PSD) features ready for direct ingestion by machine learning classifiers. It was authored by Shubham Sunil Kumar and last updated on June 29, 2026.

TabularTime SeriesNeuro GenerativeEegPower Spectral DensitySynthetic DataBrain EmbeddingsSynthetic+1

0 views

Multimodal & LLM

Fixed Dimension Multimodal Benchmark Functions for Metaheuristic Testing

A set of benchmark functions used to evaluate the Felis Catus Optimization (FCO) metaheuristic algorithm. The dataset includes functions from the CEC 2005 and CEC 2017 benchmark suites. It was created by Mohammad Salehi and last updated on 2026-04-15.

TabularExcelEngineering DesignMetaheuristicsBenchmarkOptimization AlgorithmsBenchmark Functions+1

0 views

Multimodal & LLM

Multimodal Benchmark Functions for Metaheuristic Algorithm Testing

A set of benchmark functions used to evaluate the Felis Catus Optimization (FCO) metaheuristic algorithm. The dataset includes results from experiments on the CEC 2005 and CEC 2017 benchmark suites and three real-world engineering design problems. It was created by Mohammad Salehi and last updated on April 15, 2026.

TabularExcelEngineering DesignBenchmarkMetaheuristic OptimizationAlgorithm EvaluationBenchmark Functions+1

0 views

Multimodal & LLM

JumpShift: Persian Alignment Dataset for Coding Assistants

JumpShift is a Persian-language alignment dataset developed by JumpLander for training coding assistants and software engineering agents. It is designed to help models respond like professional coding assistants, focusing on reasoning, clarification, and safe behavior. The dataset was last updated on June 13, 2026.

TextSoftware EngineeringPersian LanguageCode GenerationAi Alignment+1

0 views

Multimodal & LLM

Counterfactual VLM Benchmark Data for Vision-Language Model Evaluation

Counterfactual VLM Benchmark Data is a dataset payload for evaluating vision-language models. It was authored by JingyuSun and uploaded to Hugging Face on May 20, 2026. The dataset is intended to be used with an associated GitHub repository containing download and evaluation scripts.

MultimodalAi EvaluationVision Language ModelsBenchmarkCounterfactual Reasoning+1

0 views

PreviousPage 10 of 97Next