DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Rbyte: Multimodal Spatial Intelligence and Robotics Data in MCAP Format

Rbyte provides multimodal datasets for spatial intelligence and robotics, released by yaak-ai and updated in February 2026. The collection utilizes MCAP and TensorDict formats to facilitate high-performance spatial computing and integration with PyTorch and Polars.

Machine LearningSpatial IntelligenceRerunPytorchRoboticsPolarsArtificial IntelligenceMcapTensordict+1

0 views

Multimodal & LLM

UNO-Bench: A Unified Benchmark for Compositionality in Omni-Modal AI Models

UNO-Bench is a unified benchmark for exploring compositional relationships between uni-modal and omni-modal capabilities in AI models. The dataset was created by meituan-longcat and was last updated on December 4, 2025. It is accompanied by released evaluation scripts and a scoring model named UNO-Scorer-Qwen3-14B.

MultimodalAi EvaluationBenchmarkMultimodal BenchmarkOmni Models+1

0 views

Multimodal & LLM

Multimodal1: A Dataset for Multimodal AI Research

A dataset titled 'Multimodal1' published on Kaggle. The title suggests it contains multiple data modalities, such as text, images, or audio, likely intended for AI model training. The author, organization, size, and specific content are unknown.

MultimodalMachine LearningAi Training+1

0 views

Multimodal & LLM

Multimodal Dataset for AI Model Training

Kaggle hosts a dataset titled 'multimodal', which likely contains data from multiple modalities such as text, images, or audio for machine learning tasks. The dataset's specific content, size, and creator are not detailed in the available metadata. Its last update date and other descriptive details are unknown.

MultimodalMachine LearningAi Training+1

0 views

Multimodal & LLM

VQA2.0: Visual Question Answering Dataset

A dataset for Visual Question Answering tasks, likely containing pairs of images and questions with corresponding answers. It is hosted on Kaggle. The specific size, creation date, and authorship are unknown.

MultimodalComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

MCD-rPPG: Multi-Camera Video Dataset for Remote Heart Rate and Health Biomarker Estimation

MCD-rPPG is a large-scale multimodal dataset designed for remote photoplethysmography and health biomarker estimation from video. The dataset includes synchronized video recordings from multiple cameras, as described in the paper "Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation". Author wengziheng uploaded the dataset to the Hugging Face Hub, with a last recorded update on 2025-12-09.

VideoMultimodalPhysiological SignalsHealth BiomarkersMedical VisionMulti View VideoHealthcareLarge ScaleRemote Photoplethysmography+1

0 views

Multimodal & LLM

MathVision-Wild: Photographic Math Problems for Real-World Visual Reasoning

MathVision-Wild provides 1,000 to 10,000 photographic versions of the MathVision test dataset captured in diverse physical environments. Created by MathLLMs and updated in late 2025, it transitions digital math problems into real-world visual contexts to evaluate Vision Language Model (VLM) performance.

MultimodalIMAGEFOLDERSize Categories1 Kn10 KTask Categoriesimage To TextLanguageenTask Categoriesvisual Question AnsweringModalitytextReal WorldLibrarymlcroissantModalityimageLibrarydatasetsRegionusReasoningMathLicenseapache 20Visual Reasoning+1

0 views

Multimodal & LLM

Semhash: Multimodal Semantic Deduplication for Text and Image Cleaning

MinishLab released Semhash in January 2026 to provide a framework for fast multimodal semantic deduplication and filtering. The project utilizes model2vec and vicinity-based hashing to identify near-duplicate records across text and image datasets.

PreprocessingSemantic DeduplicationText Dataset CleaningImage Dataset CleaningVicinityDeduplicationModel2vec+1

0 views

Multimodal & LLM

LLaVA-OneVision 1.5 Multimodal Instruction Dataset

LLaVA-OneVision-1.5-Instruct is a 22 million instruction dataset curated by MVP-Lab for training large multimodal models. It was developed to support the LLaVA-OneVision-1.5 model family and was last updated in November 2025.

MultimodalTask Categoriesimage Text To TextImage Text PairsVision Language InstructionDataset CollectionVision Language ModelLanguageenSize Categories10 Mn100 MModalitytextModalityimagePretrainingMultimodal TrainingImage CaptioningLarge Language ModelRegionusArxiv250923661FinanceVqaLmmLicenseapache 20+1

0 views

Multimodal & LLM

DAD-3DHeads: Dense 3D Head Alignment and FLAME Model Annotations

DAD-3DHeads provides dense 3D annotations for head alignment and reconstruction from single images, published by PinataFarms for CVPR 2022. The data includes FLAME model parameters and 3D landmark coordinates for 3D Morphable Model (3DMM) fitting. It was developed to address the lack of diverse head poses in existing 2D landmark datasets.

Machine Learning3d-face-modelling3d-face-reconstructionFace Alignment3d-face-alignment3d-headPytorchFace ReenactmentPapers With CodeComputer VisionHead Pose EstimationCvpr20223d-reconstructionCvpr3dmmFlame3d-computer-visionFirst Order Motion Model+1

0 views

Multimodal & LLM

Molmo2 Visual Question Answering Dataset

AllenAI provides a dataset for visual question answering tasks. It contains image-text pairs designed for evaluating multimodal language models. The dataset was updated in January 2026.

MultimodalParquetSize Categories10 Kn100 KLibrarypolarsAi EvaluationModalitytextLibrarymlcroissantModalityimageImage TextLibrarydatasetsLibrarypandasLanguage ModelRegionusVisual Question Answering+1

0 views

Multimodal & LLM

Emo-CFG: Video Emotion Recognition Dataset for Foundation Models

Emo-CFG is a dataset for emotion-centric video foundation models, accepted at the NeurIPS 2025 conference. It was created by researchers from Nankai University, Pengcheng Laboratory, and Kuaishou Technology. The dataset was last updated on December 7, 2025.

VideoMultimodalVideo EmotionAffective ComputingFoundation ModelsMultimodal Reasoning+1

0 views

Multimodal & LLM

Gemini 3 Pro Visual Question Answering Benchmark

Gemini 3 Pro benchmark dataset for multimodal evaluation. The dataset was created by AliMertTemizsoy and published on Hugging Face in January 2026. It contains image-text pairs for visual question answering tasks.

MultimodalOPTIMIZED-PARQUETParquetSize Categories1 Kn10 KGeminiLibrarypolarsModalitytextLibrarymlcroissantModalityimageLibrarydatasetsBenchmarkLibrarypandasGemini ModelRegionusVisual Question AnsweringMultimodal Evaluation+1

0 views

Multimodal & LLM

VQA v2: Visual Question Answering Version 2

265,016 images from MS COCO are paired with 1,105,904 questions and 11,059,040 ground-truth answers. The dataset is structured into balanced pairs where each question is associated with two similar images that result in different answers to minimize language bias.

EnglishComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

MultiPriv: A Multilingual & Multimodal Dataset for LLM Privacy Risk Research

MultiPriv is a dataset of Personally Identifiable Information entities and prompts designed for privacy risk research in large language models. It was created by author CyberChangAn and last updated on December 1, 2025. The dataset is multilingual and multimodal, though attribute-level VLM images are not directly included in the repository due to open-source certificate limitations.

MultimodalMultilingualLlm BenchmarkMultimodal DataComputer VisionPrivacy ResearchPii Entities+1

0 views

Multimodal & LLM

Multi-Camera Remote Photoplethysmography Video Dataset

MCD-rPPG is a large-scale multimodal dataset for remote photoplethysmography and health biomarker estimation from video. The dataset contains synchronized video recordings from multiple camera views, designed for the paper 'Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation'.

VideoSize Categories1 Kn10 KEcgPpgLicensecc By 40ModalityvideoTask CategoriesotherRegionusMedical+1

0 views

Multimodal & LLM

Spatial Transcriptomics Pre-training Corpus for Foundation Model

SToCorpus-88M is a pre-training dataset used for the SToFM multi-scale foundation model for spatial transcriptomics. The dataset is associated with a research paper and model code published on GitHub. Specific details on data volume, structure, and features are not provided in the input.

RegionusArxiv250711588Licensemit+1

0 views

Multimodal & LLM

UniDoc-Bench: 1,700 Multimodal QA Pairs from 70,000 PDF Pages

Salesforce developed UniDoc-Bench in 2024 as a benchmark for multimodal retrieval-augmented generation (MM-RAG). It contains 1,700+ multimodal QA pairs derived from a corpus of 70,000 real-world PDF pages across eight domains. The data links evidence across text, tables, and figures to support complex document-based reasoning tasks.

MultimodalParquetSize Categories1 Kn10 KTask Categoriesimage Text To TextLibrarypolarsTask Categoriesquestion AnsweringLanguageenTask Categoriesvisual Question AnsweringRagModalitytextTask Categoriestext RetrievalModalitydocumentLibrarymlcroissantModalityimageLibrarydatasetsTask Categoriesdocument Question AnsweringLibrarypandasLicensecc By Nc 40RegionusArxiv251003663+1

0 views

Multimodal & LLM

Spatial Mental Modeling Benchmark with Limited Views

MindCube is a benchmark for evaluating Vision Language Models' ability to form spatial mental models from limited visual information. It contains 21,154 questions across 3,268 images, created by MLL-Lab. The dataset was last updated in November 2025.

MultimodalSpatial ReasoningVision Language ModelsBenchmarkComputer VisionCognitive Mapping+1

0 views

Multimodal & LLM

LLaVA-OneVision-1.5: 85 Million Multimodal Mid-Training Records

Released by mvp-lab in 2025, this 85-million record multimodal collection supports the mid-training phase of the LLaVA-OneVision-1.5 framework. It aggregates image-text data from eight major sources including ImageNet-21k, LAIONCN, and SA-1B to facilitate democratized multimodal model training.

ParquetLibrarypolarsLibrarydaskSize Categories10 Mn100 MModalitytextLibrarymlcroissantModalityimageLibrarydatasetsRegionusArxiv250923661Licenseapache 20+1

0 views

PreviousPage 62 of 98Next