DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Kvasir-VQA-x1: 159,549 Medical Visual Q&A Pairs for Gastrointestinal Endoscopy

159,549 new question-answer pairs form the Kvasir-VQA-x1 dataset, a large-scale benchmark for medical visual question answering in gastrointestinal endoscopy. SimulaMet created this multimodal dataset to advance robust MedVQA systems. The dataset was featured in the MediaEval Medico 2025 Challenge and was last updated on Hugging Face in August 2025.

MultimodalGastrointestinal EndoscopyMultimodal QaMedical ReasoningBenchmarkHealthcareComputer VisionLarge ScaleMedical Vqa+1

0 views

Multimodal & LLM

Terra: Multimodal Spatio-Temporal Earth Science Benchmark from NeurIPS 2024

Terra is a multimodal spatio-temporal benchmark for Earth science applications developed by CityMind-Lab and presented at NeurIPS 2024. It provides global-scale data across multiple modalities to support the development of advanced environmental and geographic models. The dataset was released in late 2024 to address the need for standardized benchmarks in the Earth science domain.

MultimodalSpatio-TemporalBenchmarkEarth Science+1

0 views

Multimodal & LLM

Osworld G: Computer-Use Grounding via UI Synthesis

Osworld G provides a benchmark for computer-use grounding through UI decomposition and synthesis, released by xlang-ai as a NeurIPS 2025 Spotlight. It facilitates the training of Large Action Models (LAMs) by generating multimodal data that pairs visual GUI elements with natural language grounding instructions.

MultimodalRpaBenchmarkNatural Language ProcessingVlmModelsAgentGuiLarge Action Model+1

0 views

Multimodal & LLM

AncientDoc: A Benchmark for Chinese Ancient Document Understanding with 2,973 Pages

2,973 pages of Chinese ancient documents form a benchmark for multimodal large model evaluation. The dataset, created by ByteDance, is designed for tasks ranging from optical character recognition to knowledge reasoning. It was last updated on the platform in September 2025.

MultimodalIMAGEFOLDERSize Categories1 Kn10 KDocument UnderstandingLicensecc0 10LibrarymlcroissantModalityimageLibrarydatasetsBenchmarkRegionusOCRAncient DocumentsMultimodal BenchmarkChinese Text+1

0 views

Multimodal & LLM

LongVideo-Reason: 52K Question-Reasoning-Answer Pairs for Long Videos

LongVideo-Reason is a dataset of 52,000 high-quality Question-Reasoning-Answer pairs for long video reasoning, constructed with chain-of-thought annotations. It was created by the LongVideo-Reason organization using a VLM and a reasoning LLM, and was last updated on August 19, 2025. The dataset includes 18,000 high-quality samples designated for a specific subset.

MultimodalLong VideoVideo ReasoningMultimodal AiQuestion AnsweringCot Annotations+1

0 views

Multimodal & LLM

Amharic Language Model Training Data With 846k Samples

Encompassing 846,113 total text samples for Amharic language model training, split into 761,501 training and 84,612 test samples. It was created by YoseAli and last updated in August 2025.

ParquetTask Categoriestext GenerationLibrarypolarsTask Categoriesquestion AnsweringDeploymentLibrarydaskProductionModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsLarge Language ModelTrainingRegionusAfrican LanguagesLanguageamAmharicLicensemitEthiopia+1

0 views

Multimodal & LLM

PlantVillageVQA: 193,609 Visual Q&A Pairs for Plant Disease Diagnosis

PlantVillageVQA is a multimodal dataset for visual question answering in plant pathology. It contains 193,609 question–answer items paired with 55,448 leaf images spanning 14 crops and 38 diseases. The dataset was created by SyedNazmusSakib and was last updated on the Hugging Face platform in September 2025.

MultimodalAgriculture AiMultimodal AiPlant PathologyLeaf ImagesVisual Question Answering+1

0 views

Multimodal & LLM

Textual Visual Context Dataset for Image Captioning

Textual visual context for image captioning, building upon the publicly available COCO caption dataset. It includes updates from October 2023, featuring a SwinV2 classifier for generating visual caption cosine scores with person labels.

0 views

Multimodal & LLM

Traditional Chinese Medicine Instruction Data for Multimodal LLM Fine-Tuning

245,000 instruction examples across text, visual, and signal modalities support the fine-tuning of ShizhenGPT, a specialized model for Traditional Chinese Medicine. FreedomIntelligence created and released this collection, with its latest update in August 2025.

MultimodalTraditional Chinese MedicineMedical Ai+1

0 views

Multimodal & LLM

MJ Showcase 8K: Top-Voted AI Art Prompts and Images from Mid-2024

MJ Showcase 2024 is a dataset of top-voted AI art creations manually collected daily between May and August 2024. The dataset includes 8,551 rows and provides both images and their associated text prompts. It was created by author shb777 and last updated on Hugging Face in August 2025.

MultimodalPrompt EngineeringAi ArtComputer VisionGenerative Art+1

0 views

Multimodal & LLM

FlowVQA RAG: A Dataset for Visual Question Answering with Retrieval-Augmented Generation

FlowVQA RAG is a dataset uploaded to Hugging Face by user 'kkyzl' on October 9, 2025. The dataset's title suggests it is designed for Visual Question Answering (VQA) tasks using a Retrieval-Augmented Generation (RAG) framework. Its specific content, scale, and structure require verification after download.

MultimodalRetrieval Augmented GenerationMultimodal QaQuestion AnsweringVqa+1

0 views

Multimodal & LLM

Chicken Farm Visual and Audio Monitoring Dataset

IceKhoffi's Chicken Health and Behavior Multimodal Dataset contains visual and audio data collected from chicken farms. It is designed for developing early detection systems for health issues and anomalous poultry behavior. The dataset was last updated on the Hugging Face platform in August 2025.

ImageAudioMultimodalAnimal BehaviorMultimodal MonitoringHealthcarePoultry HealthFinanceAgricultural Ai+1

0 views

Multimodal & LLM

DriveQA: Driving Knowledge Test Questions for Multimodal AI

DriveQA is a multimodal benchmark for evaluating driving knowledge through text and vision-based question-answering tasks. The dataset, created by DriveQA and last updated on September 1, 2025, simulates real-world driving tests. It likely contains questions on traffic regulations, sign recognition, and right-of-way reasoning.

MultimodalDriving KnowledgeVisionTask Categoriesquestion AnsweringLanguageenTask Categoriesvisual Question AnsweringLicensecc By Nc Sa 40Size Categories100 Kn1 MBenchmarkQuestion AnsweringComputer VisionRegionusTraffic RulesAutonomous DrivingArxiv250821824+1

0 views

Multimodal & LLM

Multitaskvideoreasoning

MultiTaskVideoReasoning is the official training dataset for the research project 'Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning'. The dataset was created by author 'zhang9302002' and was last updated on Hugging Face on 2025-08-25. Its structure includes multiple JSON files such as 'actnet.json', 'charades.json', and 'longvideo-reason.json', suggesting it aggregates data from several established video reasoning benchmarks.

MultimodalTool Augmented Reinforcement LearningVideo ReasoningMultimodal AiComputer VisionLong Video Analysis+1

0 views

Multimodal & LLM

ShareRobot: 51,403 Robotic Episodes with Affordance and Trajectory Labels

BAAI developed ShareRobot, a collection of 51,403 robotic episodes with 30 frames each, to enhance multi-dimensional robotic capabilities. The data includes labels for task planning, object affordance, and end-effector trajectories using 50 distinct prompt templates.

ModalityimageRegionusArxiv231008864Arxiv250221257+1

0 views

Multimodal & LLM

Image To Video Human Preference Seedance 1 Pro: 6k Human Evaluations

Approximately 6,000 human responses from around 2,000 annotators were collected to evaluate the Seedance 1 Pro video generation model on a benchmark. The data was gathered in roughly 5 minutes using the Rapidata Python API. The dataset was published by Rapidata and last updated on August 11, 2025.

TabularHuman PreferenceModel EvaluationBenchmarkVideo GenerationComputer VisionLarge Scale+1

0 views

Multimodal & LLM

Primus-Seed: Cybersecurity Text Corpus from MITRE and Expert Sources

Primus-Seed is a cybersecurity text dataset compiled from reputable sources including MITRE, Wikipedia, and cybersecurity company websites, as well as manually collected Cyber Threat Intelligence (CTI). It was created by Trend Micro's AI Lab and was last updated on the Hugging Face platform in August 2025. The dataset includes at least 2,946 samples from cybersecurity blogs and news, comprising over 9.7 million tokens.

TextJSONTask Categoriestext GenerationArxiv250211191LibrarydaskLanguageenCybersecurityModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsPretrainingWikipediaRegionusLicenseodc ByText CorpusMITRE+1

0 views

Multimodal & LLM

Gut-Halu: Hallucination Benchmark for Gastrointestinal Image Analysis

Annotated files for a benchmark assessing hallucination in large vision-language models applied to gastrointestinal image analysis. The dataset supports the paper 'Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision Language Models'. It was created by sandesh-pokhrel and last updated on September 5, 2025.

MultimodalHallucination BenchmarkMedical ImagingVision Language ModelsBenchmarkComputer VisionImage AnnotationGastrointestinal+1

0 views

Multimodal & LLM

French Driving Theory Questions with Images

DrivingVQA contains multiple-choice questions paired with real-world images for the French driving theory exam. The dataset was created by EPFL-DrivingVQA and was last updated in August 2025. It is designed to test knowledge of traffic laws, road signs, and safe driving practices.

Multimodal🇫🇷 FranceSize Categories1 Kn10 KTraffic SafetyTask Categoriesmultiple ChoiceLanguageenTask Categoriesvisual Question AnsweringDriving TheoryMultiple ChoiceRegionusReasoningReasoning Datasets CompetitionArxiv250104671LicensemitVisual Question AnsweringDriving+1

0 views

Multimodal & LLM

MedVLThinker Pmc Vqa: Tokenized Medical VQA Dataset with GPT-4o Reasoning

UCSC-VLAA provides a tokenized version of the PMC-VQA dataset for medical vision-language understanding. The dataset includes GPT-4o generated reasoning and was last updated on August 15, 2025. It is part of the MedVLThinker project, which offers several curated datasets for medical vision-language training.

MultimodalMedical Vision LanguageGpt 4o ReasoningHealthcareComputer VisionTokenizedVqaSynthetic+1

0 views

PreviousPage 68 of 98Next