DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

COREVQA Multimodal Benchmark Dataset

COREVQA is a multimodal benchmark dataset for visual question answering tasks. It combines images with corresponding textual questions and answers, designed for evaluating AI models. The dataset originates from the UCI platform and is associated with computer vision and natural language processing research.

MultimodalComputer VisionNatural Language ProcessingMultimodal BenchmarkVisual Question Answering+1

0 views

Multimodal & LLM

MM-MathInstruct: Multimodal Math Problem-Solving Dataset

MathCoder-VL is a series of open-source large multimodal models tailored for general math problem-solving. The dataset likely contains multimodal math problems combining visual and textual elements. It was created by MathLLMs and last updated on October 11, 2025.

MultimodalTask Categoriestext GenerationTask Categoriesmultiple ChoiceTask Categoriesquestion AnsweringFigure QaSize Categories1 Mn10 MLanguageenTask Categoriesvisual Question AnsweringMath ReasoningArxiv250510557Textbook QaMathematicsMath QaGeometry DiagramMulti Modal QaComputer VisionGeometry QaSynthetic SceneRegionusReasoningMath Word ProblemGeometryVqaLicenseapache 20Visual Question AnsweringMultimodal Math+1

0 views

Multimodal & LLM

InstQA: A Large-Scale Instance-Aware Visual Question Answer Dataset

Over 2 million images and videos form the core of the InstQA dataset, which also contains 6 million instance captions, 2 million image/video captions, and 10 million instance-level visual question answers. This dataset was created by wovenbytoyota-vai and was last updated on October 15, 2025. It is designed for instance-aware spatio-temporal visual question answering tasks.

Time SeriesMultimodalLarge Scale DatasetMultimodal AiInstance SegmentationComputer VisionLarge ScaleVisual Question Answering+1

0 views

Multimodal & LLM

TechMB: 947 VQA Pairs for Manufacturing Evaluation from Technical Drawings

947 question-answer pairs form a Visual Question Answering benchmark for evaluating the manufacturability of objects based on 180 distinct technical drawings. The dataset was created by author WSKL and was last updated on 2025-10-14. It is designed as a domain-specific test for Vision Language Models.

MultimodalTechnical DrawingVision Language ModelsBenchmarkComputer VisionVqaManufacturing+1

0 views

Multimodal & LLM

Stvqa 7K: A Vision-Language Question Answering Dataset

Stvqa 7K is a dataset referenced in a paper available on arXiv. The dataset is hosted on the HuggingFace platform by the author OX-PIXL and was last updated on November 12, 2025. Its specific content and scale are not detailed in the provided metadata, but platform tags suggest it relates to vision-language tasks.

MultimodalVision LanguageMultimodal AiVisual Question Answering+1

0 views

Multimodal & LLM

WikiArt Captions Subset for Multimodal Art Retrieval

A curated subset of 6,000 paintings from the WikiArt collection, created by Lizagrin and last updated in October 2025. It was developed for multimodal art retrieval, combining visual, textual, and semantic information. Each artwork record includes an image row index and an automatically generated caption using the BLIP model.

MultimodalWikiartMultimodal RetrievalImage CaptioningArtSynthetic+1

0 views

Multimodal & LLM

Coding1.5B: LoRA Checkpoints for Code Generation

Coding1.5B is a repository of LoRA checkpoints tuned on coding datasets using a 1.5 billion parameter foundation model. The checkpoints are served as training data for DnD. The dataset was created by Jerrylz and last updated on November 21, 2025.

TextFoundation ModelTraining DataCode GenerationLora Checkpoints+1

0 views

Multimodal & LLM

ImgCode 8.6M: Multimodal Math Problem Images and Code

MathCoder-VL is a series of open-source large multimodal models tailored for general math problem-solving. The dataset likely contains 8.6 million multimodal examples pairing images with code, supporting the development of models like FigCodifier-8B. It was created by MathLLMs and updated on October 11, 2025.

MultimodalParquetTask Categoriesimage Text To TextTask Categoriestext GenerationLibrarypolarsTask Categoriesimage To TextTablesSize Categories1 Mn10 MLanguageenTask Categoriesvisual Question AnsweringArxiv250510557ModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasComputer VisionChartsRegionusLlm TrainingGeometryImage To CodeLicenseapache 20DiagramsVisual Question AnsweringMultimodal Math+1

0 views

Multimodal & LLM

Math7B: LoRA Checkpoints for 7B Model Fine-Tuning on Math Tasks

LoRA checkpoints tuned on mathematics datasets serve as training data for DnD. The checkpoints were created by Jerrylz and last updated on November 21, 2025. This resource likely contains parameter-efficient fine-tuning data derived from a 7 billion parameter foundation model.

TextMachine LearningLoRAMathematicsFine TuningLarge Language Models+1

0 views

Multimodal & LLM

Math1.5B: LoRA Checkpoints Tuned on Math Datasets

LoRA checkpoints were fine-tuned on mathematics datasets using a 1.5 billion parameter foundation model. The checkpoints are served as training data for DnD. The repository was created by Jerrylz and last updated on November 21, 2025.

TextMachine LearningTraining DataMathematicsLanguage ModelFine Tuning+1

0 views

Multimodal & LLM

COCO 2017 VQA: Visual Question Answering Dataset

COCO 2017 VQA is a dataset for visual question answering, derived from the COCO 2017 image dataset. It was uploaded to HuggingFace by TharunSivamani and was last updated on 2025-11-18. The dataset likely contains images paired with questions and corresponding answer annotations.

MultimodalComputer VisionImage CaptioningNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

DIM-Edit: 100K+ Image Editing Pairs for Unified Multimodal Models

DIM-Edit contains between 100,000 and 1,000,000 records designed to improve precise image editing in unified multimodal models. Released by stdKonjac in October 2025, the data supports the Draw-In-Mind (DIM) framework which rebalances designer and painter roles in diffusion-based architectures. The collection is provided in Parquet format and is associated with the DIM-4.6B model series and Arxiv paper 2509.01986.

ParquetLibrarypolarsModalitytextSize Categories100 Kn1 MLibrarymlcroissantArxiv250901986LibrarydatasetsLibrarypandasText To ImageDiffusionLicensecc By Nc 40Image EditingRegionus+1

0 views

Multimodal & LLM

FRED: Florence RGB-Event Drone Dataset for Detection and Tracking

FRED is a large-scale multimodal dataset designed for drone detection, tracking, and trajectory forecasting. The dataset, authored by GabrieleMagrini, provides spatiotemporally synchronized RGB and event data. It was last updated on Hugging Face on October 3, 2025.

Time SeriesMultimodalObject TrackingLarge ScaleTrajectory ForecastingMultimodal SensorDrone Detection+1

0 views

Multimodal & LLM

PathoCell: A Benchmark for Pathology Foundation Models

A benchmark suite for evaluating cell phenotyping capabilities of pathology Foundation Models, created by Kainmueller-Lab and last updated on 2025-10-09. The collection includes four key datasets processed into the LMDB format to facilitate large-scale experimentation. The datasets are hosted on the Hugging Face platform.

MultimodalCell PhenotypingFoundation ModelsBenchmarkComputational PathologyLarge Scale+1

0 views

Multimodal & LLM

CoralVQA: 277,653 Visual Question-Answer Pairs for Coral Reef Analysis

Coral images from 3 oceans are used in this dataset. CoralVQA contains 12,805 real-world coral images from 67 genera, paired with 277,653 question-answer pairs assessing ecological and health conditions. The dataset was created by CoralReefData and last updated on September 29,我们发现了一个问题，请关闭当前工具，使用“联网搜索”重新尝试一下。

MultimodalCoral ReefImage UnderstandingHealthcareComputer VisionLarge ScaleVisual Question AnsweringMarine Biology+1

0 views

Multimodal & LLM

Robotic Welding Defect Data With Video And Sensor Streams

Over 4000 annotated samples capture the welding process through video, audio, sensor time-series, and post-weld images. IntelLabs collected this data in an automotive production floor setting with an industry supplier. The dataset was published in September 2025 to support multimodal defect detection research.

ImageAudioTime SeriesVideoMultimodalDefect DetectionLicenseotherModalitytimeseriesAnomaly DetectionModalityimageRobotic WeldingRoboticsModalityvideoRegionusDefect ClassificationIndustrial AutomationArxiv240902290WeldingMultimodal SensorsManufacturingIndustry 40+1

0 views

Multimodal & LLM

MC-EIU: Emotion and Intent Joint Understanding in Multimodal Conversation

MC-EIU is a benchmarking dataset for joint emotion and intent understanding in multimodal conversations. The dataset was created by Rui Liu, Haolin Zuo, Zheng Lian, Xiaofen Xing, Björn W. Schuller, and Haizhou Li, with the official repository last updated on September 23, 2025. More details are available in the associated research paper.

MultimodalEmotion UnderstandingMultimodal ConversationBenchmark DatasetIntent Understanding+1

0 views

Multimodal & LLM

Multimodal Olfaction-Vision-Language Dataset for AI and Robotics

An open-sourced dataset and builder for prototyping olfaction-vision-language tasks in AI, robotics, and AR/VR domains. It is designed for applications like vision-scent navigation for drones or augmenting VR experiences with scent. Specific details on row count, column count, and file formats are not provided in the input.

ImageTextMultimodalEnglishVirtual RealityRoboticsComputer VisionAugmented RealityOlfaction+1

0 views

Multimodal & LLM

LAION-High-Quality-Pro-6M-VLV: Image-Text Pairs for Vision-Language-Vision Model Training

LAION-High-Quality-Pro-6M is a 6-million-sample image-text dataset used to train Vision-Language-Vision auto-encoder models. The dataset, hosted by author ccvl on Hugging Face, was last updated on September 20, 2025. It was created for scalable knowledge distillation from diffusion models.

MultimodalParquetLibrarypolarsTask Categoriesimage To TextLibrarydaskSize Categories1 Mn10 MVision Language ModelArxiv250707104ModalitytextLibrarymlcroissantLibrarydatasetsLicensecc By 40Computer VisionImage CaptioningRegionusLaionDiffusion ModelsKnowledge Distillation+1

0 views

Multimodal & LLM

Agentic Long Context Understanding QA: Self-Taught Query Refinement

AgenticLU is a dataset for evaluating long-context understanding in language models, created by yzhuang and last updated on September 22, 2025. It contains queries refined through self-clarifications and contextual grounding to enable robust long-document understanding in a single pass. The dataset is hosted on the Hugging Face platform.

TextAgentic AiLanguage Model EvaluationSelf ClarificationLong Context Qa+1

0 views

PreviousPage 66 of 98Next