DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

CK-12 TQA Multimodal: Middle School Science Questions with Images

26,260 science questions paired with 6,206 images sourced from CK-12 Foundation's open educational resources. The dataset includes both text-only and diagram-based visual reasoning questions for middle school science. It was uploaded by 'notefill' to HuggingFace and last updated on 2025-11-21.

MultimodalMiddle SchoolTextbooksQuestion AnsweringScience Education+1

0 views

Multimodal & LLM

Critic-10K: 10,000 Image Triplets for Correcting Generative Inconsistencies

Critic-10K provides approximately 10,000 image triplets designed to train models to rectify inconsistencies in AI-generated visual content. Created by ziheng1234 and associated with the 2025 research paper 'The Consistency Critic', the data uses VLM-based selection to pair reference images with degraded and target versions.

IMAGEFOLDERSize Categories1 Kn10 KModalitytextLibrarymlcroissantArxiv251120614ModalityimageLibrarydatasetsRegionusTask Categoriesimage To Image+1

0 views

Multimodal & LLM

REFED: Synchronized EEG-fNIRS Recordings with Real-time Dynamic Emotion Labels

REFED is an affective brain-computer interface dataset integrating multimodal brain signals and real-time dynamic emotion annotation. The dataset was created by REFED2025 and last updated on the platform in November 2025. It synchronizes EEG and fNIRS signals to study the neural mechanisms of emotional dynamic evolution.

Time SeriesMultimodalEmotion RecognitionEegBrain Computer InterfaceFnirsMultimodal Neuroscience+1

0 views

Multimodal & LLM

AdvancedIF: A Benchmark for Complex and Multi-Turn LLM Instruction Following

Facebook introduces AdvancedIF, a benchmark featuring over 1,600 prompts designed to assess large language models. The dataset includes expert-curated rubrics to evaluate proficiency in complex instruction following, multi-turn interactions, and system prompt steerability. It was last updated on November 26, 2025.

TextLlm BenchmarkAi SafetyEvaluationBenchmarkInstruction Following+1

0 views

Multimodal & LLM

RecruitView: Multimodal Personality and Interview Performance Dataset

Multimodal recordings of candidate interview responses categorized by personality traits and professional performance metrics. This dataset facilitates research in affective computing and automated soft-skill evaluation within human resources contexts by providing synchronized behavioral data.

Source DatasetsoriginalSize Categories1 Kn10 KModalityaudioLanguageenLanguage CreatorsfoundModalitytextModalitytabularLibrarymlcroissantTask Idsaudio Emotion RecognitionTask Idssentiment ScoringLibrarydatasetsTask Categoriestabular RegressionTask Categoriesfeature ExtractionModalityvideoLicensecc By Nc 40Task CategoriesotherTask Idstext ScoringMultilingualitymonolingualTask Categoriesvideo ClassificationAnnotations Creatorsexpert Generated+1

0 views

Multimodal & LLM

DEJIMA: 3.88M Japanese Image-Caption and Image-QA Pairs

DEJIMA is a large-scale Japanese multimodal dataset containing 3.88 million image-caption pairs and 3.88 million image-question-answer pairs. It was created by MIL-UT using a reproducible pipeline involving web-scale image collection, strict filtering, evidence extraction, and LLM-based annotation under grounding constraints. The dataset was last updated on December 2, 2025.

MultimodalComputer VisionImage CaptioningLarge ScaleJapanese LanguageVisual Question Answering+1

0 views

Multimodal & LLM

RS-EoT-4K: Remote Sensing Evidence-of-Thought 4K Dataset

4,000 multimodal instruction-tuning samples designed to instill Evidence-of-Thought (EoT) reasoning into Vision-Language Models for remote sensing. The dataset utilizes a Socratic questioning approach to guide models through logical, step-by-step interpretation of satellite and aerial imagery.

ParquetSize Categories1 Kn10 KLibrarypolarsTask Categoriesquestion AnsweringLanguageenModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasRegionusArxiv251122396Licenseapache 20+1

0 views

Multimodal & LLM

VLDBench: Large-Scale Multimodal Disinformation Detection

VectorInstitute released VLDBench in January 2026 as a large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection. The framework provides a testing ground for AI safety by presenting models with deceptive content that integrates both visual and textual modalities.

Disinformation DetectionMachine LearningAi SafetyVision Language ModelsMultimodal AiBenchmarkingComputer VisionLarge Language ModelVlmsNatural Language ProcessingBenchmark FrameworkDeep Learning+1

0 views

Multimodal & LLM

SafeVid-350K: 350,000 Video Safety Preference Pairs

SafeVid-350K is a large-scale dataset containing 350,000 preference pairs designed to instill Helpful, Honest, Harmless principles in Video Large Multimodal Models. It covers 30 scene categories and 29 fine-grained safety sub-dimensions. The dataset was created by yxwang and was last updated on Hugging Face in November 2025.

MultimodalMultimodal AlignmentPreference PairsVideo SafetyLarge Scale+1

0 views

Multimodal & LLM

LLaVA OneVision 1.5: Reinforcement Learning Data for Vision-Language Models

A dataset named 'Llava Onevision 1.5 Rl Data' published on the Hugging Face platform by author mvp-lab. The dataset was last updated on 2026-01-06. Platform tags indicate it contains both image and text modalities, suggesting it is likely a multimodal dataset for training or fine-tuning vision-language models.

MultimodalParquetSize Categories10 Kn100 KLibrarypolarsLibrarydaskModalitytextLibrarymlcroissantVision LanguageModalityimageMultimodal LlmLibrarydatasetsRegionusRl Training+1

0 views

Multimodal & LLM

Vlmevalkit: A Vision-Language Model Evaluation Toolkit

Published on HuggingFace by author mm-eval, with a last update timestamp of 2026-01-12 07:15:59. The dataset's title suggests it is a toolkit for evaluating vision-language models. Its specific content, scale, and data types require verification after download.

MultimodalAi BenchmarkingVision LanguageLarge Language ModelsMultimodal Evaluation+1

0 views

Multimodal & LLM

Speech2Latex: 66,000 Audio Samples of Mathematical Expressions

66,000 human-annotated audio samples of spoken mathematical equations and sentences in English and Russian form the Speech2LaTeX dataset. It is the first fully open-source large-scale dataset for converting spoken math to LaTeX, drawn from diverse scientific domains. The dataset was created by marsianin500 and last updated on November 16, 2025.

AudioMultimodalMultilingualMathematicsLatexSpeech To TextLarge Scale+1

0 views

Multimodal & LLM

Formosa Vision: Taiwan-Centric Visual Language Dataset with Community-Curated Descriptions

Formosa Vision is an open-source visual language dataset focused on Taiwanese local culture, containing over two thousand images selected from the National Cultural Memory Bank 2.0. The dataset was created by the Twinkle AI community using a hybrid method where visual language models generated image dialogues, which were then manually checked and revised by participants. It was last updated on November 20, 2025.

MultimodalOpen CultureVision LanguageCommunity SourcedTaiwan CultureComputer Vision+1

0 views

Multimodal & LLM

Geoint: 1,885 Formal Geometric Problems with Diagrams and Lean 4 Proofs

1,885 curated geometric problems across plane, spatial, and solid geometry categories form this benchmark. Each problem includes structured textual descriptions and visual diagrams for multimodal understanding. The dataset, created by OpenRaiser and updated in November 2025, leverages the Lean 4 proof assistant for formal representation.

MultimodalMathematicsMultimodal LearningBenchmarkFormal ProofsGeometry Problems+1

0 views

Multimodal & LLM

Mathematical Image Editing Trajectories for Multimodal Models

MathCanvas-Edit contains 5.2 million step-by-step editing trajectories for mathematical images. The dataset was created by author shiwk24 and was last updated on the Hugging Face platform in November 2025. It forms a core component of the MathCanvas framework for training large multimodal models.

MultimodalParquetImage Editing TrajectoriesLibrarypolarsLibrarydaskSize Categories1 Mn10 MLanguageenArxiv251014958ModalitytextMathematicsLibrarymlcroissantStep By Step ReasoningMultimodal LearningModalityimageGeometry DiagramLibrarydatasetsVisual Chain Of ThoughtRegionusLarge ScaleTask Categoriesimage To ImageLicenseapache 20Vcot+1

0 views

Multimodal & LLM

WorldCuisines: Multilingual Visual Question Answering on Global Cuisines

WorldCuisines is a massive-scale benchmark for multilingual and multicultural visual question answering focused on global cuisines. The associated paper was accepted to NAACL 2025 and received the Best Theme Paper award. The dataset was last updated on November 14, 2025.

MultimodalMultilingualCuisineBenchmarkComputer VisionLarge ScaleVisual Question AnsweringMulticultural+1

0 views

Multimodal & LLM

Llama-Nemotron-VLM-Dataset v1: Vision-Language Instruction Data for Model Training

A dataset for training vision-language models, created by NVIDIA. The dataset page includes a version history with updates from August to September 2025. The dataset was last updated on the platform on 2025-10-22.

MultimodalVision Language ModelMultimodal Training+1

0 views

Multimodal & LLM

GroundCUA: UI Screenshots and Annotations for Computer Use Agents

GroundCUA is a large dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. The dataset was created by Fhrozen and last updated on Hugging Face in November 2025.

MultimodalMultimodal AnnotationsComputer Use AgentsSoftware PlatformsUi Screenshots+1

0 views

Multimodal & LLM

3DThinker-10K: Geometric Imagination Grounded Spatial Reasoning

10,000 spatial reasoning samples designed for geometric imagination from limited 2D visual perspectives. The dataset facilitates 3D mental modeling during reasoning tasks without the need for explicit 3D prior inputs or depth data.

MultimodalTask Categoriesimage Text To TextArxiv251018632Spatial UnderstandingModalityimageLicensecc By Nc 40RegionusVisual Reasoning+1

0 views

Multimodal & LLM

UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

UniBiomed is a foundation model designed for grounded biomedical image interpretation. The model was created by Luffy503 and was last updated on November 11, 2025. It is based on the MedTrinity dataset, which must be downloaded separately.

MultimodalFoundation ModelBiomedical ImagingMultimodal LearningComputer VisionMedical Ai+1

0 views

PreviousPage 63 of 98Next