DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,929 datasets

Multimodal & LLM

BALLADEER: Multimodal Neurophysiological Data for ADHD Research

BALLADEER integrates EEG, eye tracking, and physiological signals from children and adolescents with ADHD and neurotypical controls. Its controlled protocol uses gamified cognitive tasks like Attention Slackline and CogniFit to elicit responses in attentional control and cognitive flexibility. This dataset supports the development of machine learning models for ADHD classification and the research of digital biomarkers.

Time SeriesMultimodalZIPADHDPhysiological SignalsGamificationHealthcareEye TrackingEegNeurophysiology+1

0 views

Multimodal & LLM

LADBench: A Benchmark for Logical Anomaly Detection in Images

LAD-Bench is a benchmark of more than 1,000 curated synthetic images designed to test the logical reasoning capabilities of Vision Language Models. It was created by SahasraK and introduced to address gaps in evaluating physical and social common sense for open-world AI deployment. The dataset was last updated on June 16, 2026.

MultimodalVision Language ModelsAi BenchmarkBenchmarkComputer VisionSynthetic ImagesLogical Anomaly DetectionSynthetic+1

0 views

Multimodal & LLM

ISOB-Small-Hard: Indian Scripts OCR Benchmark Sample

Indian multilingual document images and OCR transcriptions curated by MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE. This representative subset contains samples spanning 19 Indian languages and scripts, focusing on real-world documents with complex layouts and noisy scans. The full dataset, covering all 22 official languages, is scheduled for release upon paper acceptance.

MultimodalMultilingualBenchmarkDocument ImagesComputer VisionOCRSynthetic+1

0 views

Multimodal & LLM

AnyAudio-Judge Bench: A Bilingual Audio Instruction-Following Benchmark

AnyAudio-Judge Bench is a bilingual (English/Chinese) multi-domain benchmark for evaluating instruction-audio alignment, released with the paper "AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following". It contains 7,920 curated samples per language across 7 subsets. The dataset was created by author cucl2 and was last updated on June 2, 2026.

AudioMultimodalMultilingualRubric BasedBenchmarkInstruction FollowingAudio Evaluation+1

0 views

Multimodal & LLM

Chilean Flamingo Behavioral Responses Following an Enclosure Move

Behavioral data on a large flock of flamingos collected by animal care staff after a change in their enclosure. The dataset includes a blank template for others to use. It is a 17.9 KB XLSX file authored by Paul Rose and last updated on 2026-05-19.

TabularExcelEnclosure UseAnimal BehaviorZoo ManagementFlamingo+1

0 views

Multimodal & LLM

POPE Audit Records: 9,000 Predictions for VLM Hallucination Evaluation

Companion records for the paper Token-Set Choice Confounds POPE: A Systematic Audit of Yes/No Extraction in VLM Hallucination Evaluation (Jayakumar & Thilak, 2026). The dataset hosts 9,000 per-question prediction records, diagnostics, ablations, and cross-model audits that back every numeric claim in the paper. Authored by kesav2k04, it was last updated on June 14, 2026.

TabularHallucination EvaluationVision Language ModelsModel AuditBenchmarkReproducible Research+1

0 views

Multimodal & LLM

Akai Sports Video Captions V1: 1,119 Expert-Reviewed Video Clips

1,119 sports video clips are paired with English captions authored and reviewed by expert labelers from Akai Space Labs. The dataset is designed for training and evaluating multimodal models. It was last updated on June 14, 2026.

VideoMultimodalSports VideoMultimodal TrainingVideo Captions+1

0 views

Multimodal & LLM

SurgSync: Multi-modal dVRK Dataset for Surgical Robotics

A multi-modal dataset collected using the da Vinci Research Kit (dVRK). The dataset was created by jackzhy and a subset has been incorporated into NVIDIA's PhysicalAI-Robotics-Open-H-Embodiment collection. It was last updated on June 3, 2026.

MultimodalDvrkMedical RoboticsMultimodal DataSurgical Robotics+1

0 views

Multimodal & LLM

Bio-inspired Multimodal Imaging Scenes for Reduced Visibility Conditions

Two spectral scenes, as depicted in the paper 'Bio-inspired multimodal imaging in reduced visibility' by Pierre‐Jean Lapray. The dataset likely contains multimodal image data designed for research in computer vision under challenging visibility conditions. The specific data format, size, and collection details are not provided in the available metadata.

ImageMultimodalComputer VisionBio InspiredSpectral ImagingMultimodal imaging+1

0 views

Multimodal & LLM

Driving Behaviour Multimodal Human Factors Eye Tracking Dataset

A multimodal dataset focused on driving behavior and human factors, likely containing eye-tracking data. It was authored by Xiaoming Tao and is available via the paperswithcode platform under an Open Access (green) license. The specific scale, collection period, and detailed contents are not provided in the available metadata.

MultimodalDriving BehaviourEye TrackingHuman Factors+1

0 views

Multimodal & LLM

UAV Operator Reaction Time and Eye-Tracking Data for Multimodal Alarm Evaluation

Reaction time data from 20 participants and eye-tracking data from a subset of 10 participants, collected during an experimental study on alarm modalities for UAV signal-loss events. The dataset was contributed by Saleh, Nermeen to Harvard Dataverse and last updated in June 2026. It includes participant responses, alarm condition information, and eye-tracking metrics to support research on multimodal alarm systems.

TabularReaction TimeBenchmarkEye TrackingAlarm ModalitiesHuman FactorsSyntheticUav Operations+1

0 views

Multimodal & LLM

Medical Student OSCE Performance and Anxiety Data from a Randomized Controlled Trial

A randomized controlled trial assessed the effect of a multimodal workshop on fifth-year medical students' clinical exam performance and anxiety. The study compared an intervention group receiving stress management and communication training against a control group, with anxiety measured using the STAI-State scale at multiple time points. The dataset likely contains OSCE scores and anxiety metrics for analysis.

TabularRandomized Controlled TrialHealthcare TrainingMedical EducationHealthcareStudent AnxietyClinical Skills Assessment+1

0 views

Multimodal & LLM

TickTockVQA: Analog Clock Reading Images for Vision-Language Models

12,483 images of analog clocks with time labels support training and evaluation of Vision-Language Models. The dataset originates from research presented at CVPR 2026 Findings. It was created by jaeha-choi and last updated on May 13, 2026.

MultimodalAnalog ClockVision Language ModelsComputer VisionTime ReadingVqa+1

0 views

Multimodal & LLM

Programmable Actuation Data for Cholesteric Liquid Crystal Elastomer Fibers

Jiazhe Ma's dataset contains raw experimental data supporting a published article on cholesteric liquid crystal elastomer hollow fibers. The 72.0 MB dataset is organized by figure number from the manuscript, with each dataset presented as an Excel file or image. It was last updated on 2026-05-20 and is available under a CC-BY-4.0 license.

ImageTabularTime SeriesZIPActuationComputer VisionMultimodal SensingMaterials ScienceLiquid Crystal Elastomers+1

0 views

Multimodal & LLM

ETCHR SFT-400K: 400,000 Samples for Visual Reasoning Assistant Training

400,000 samples across five tasks were used to transfer a passive image editor into an autonomous, question-conditioned visual reasoning assistant. The dataset was created by BeichenZhang and last updated on 2026-05-25. It includes tasks such as Fine-grained Perception, Chart Understanding, Maze Solving, and Jigsaw Puzzle.

MultimodalBenchmarkMultimodal TrainingComputer VisionSft DataInstruction FollowingVisual Reasoning+1

0 views

Multimodal & LLM

Llava Instruct 9K: A Multimodal Instruction Dataset for Vision and Voice Tasks

A multimodal dataset derived from the LLaVA-Instruct-150K source, containing synthetic annotations for tasks involving text, images, and speech. It is licensed under CC-BY-4.0 and was uploaded by author dreyn74. The dataset's size is indicated to be between 10,000 and 100,000 samples.

AudioMultimodalVision LanguageMultimodal LlmComputer VisionInstruct DatasetSynthetic DataSynthetic+1

0 views

Multimodal & LLM

Nemotron RL Instruction Following Structured Outputs V2 Direct Complete

A dataset of approximately 20,000 rows containing instruction-following examples for language models. It is derived from the NVIDIA Nemotron-RL-Instruction-Following-Structured-Outputs-v2 dataset, with added thinking traces and validated final outputs. The dataset was created by electroglyph and last updated on June 17, 2026.

TabularReasoning TracesStructured OutputsLanguage ModelInstruction Following+1

0 views

Multimodal & LLM

CapRL-Video-178K: Video Path Index for Multimodal Models

CapRL-Video-178K is a dataset providing file paths to over 97,000 video clips. The dataset is hosted by internlm on Hugging Face and was last updated on 2026-05-25. It serves as an index for video data from the LLaVA-Video-178K collection, which includes clips from sources like YouTube and ActivityNet.

VideoMultimodalMachine LearningVideo CaptioningVideo Language+1

0 views

Multimodal & LLM

OmniCap-IF: Benchmark for Omni-Modal Video Captioning with 1,920 Instructions

OmniCap-IF is a benchmark dataset created by NJU-LINK for evaluating instruction following in omni-modal video captioning. It contains 480 videos and 1,920 instruction samples spanning tasks like understanding, generation, retrieval, and communication. Each sample pairs a prompt with fine-grained format and content checklists for evaluating structural, temporal, visual, audio, and audio-visual constraints.

AudioTime SeriesVideoMultimodalAi EvaluationBenchmarkVideo CaptioningMultimodal BenchmarkInstruction Following+1

0 views

Multimodal & LLM

Ayn-VQA: Culturally Grounded Arabic Vision-Language Evaluation Dataset

Ayn-VQA is a culturally grounded Arabic multimodal evaluation dataset designed for the ImageEval 2026 Shared Task at ArabicNLP 2026. It tests whether a model can read a culturally specific image from a spoken Arabic question and distinguish grounded descriptions from plausible hallucinations. The dataset is authored by QCRI and was last updated in June 2026.

MultimodalHallucination DetectionArabic LanguageVision LanguageBenchmarkCultural GroundingComputer VisionMultimodal Evaluation+1

0 views

PreviousPage 9 of 97Next