DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Vlm Train 1K: Vision-Language Model Training Data

Zaynoid published a dataset titled 'Vlm Train 1K' on the Hugging Face platform on 2025-12-14. The title suggests it is likely a collection of 1,000 items for training vision-language models. The specific content, format, and structure require verification after download.

MultimodalTraining DataVision Language Model+1

0 views

Multimodal & LLM

Endoscopy Images and Text for Visual Question Answering Benchmark

EndoVQA-Instruct is a multi-modal dataset containing endoscopy images and associated text, designed for benchmarking multi-modal large language models in medical analysis. The dataset includes images from the in-house WCE2025 collection and is managed by author Saint-lsy. Access to the data is restricted and requires formal request and approval.

Task Categoriesquestion AnsweringLanguageenSize Categories100 Kn1 MArxiv250523601EndoscopyRegionusLicenseafl 30VqaMedical+1

0 views

Multimodal & LLM

Spanish Newspaper Front Pages on the Fall of the Berlin Wall, 1989 Onwards

From 1989 onwards, this dataset contains multimodal analyses of front pages from two Spanish newspapers, El País and ABC, covering the fall of the Berlin Wall. It was created by Silvia Molina Plaza and focuses on layout structure and rhetorical argumentation. The dataset was last updated on October 14, 2025.

MultimodalMedia AnalysisRhetorical AnalysisNewspaper Front PagesMultimodal DiscourseHistorical EventsSynthetic+1

0 views

Multimodal & LLM

EditReward-Data: 200K Human Preference Pairs for Instruction-Guided Image Editing

EditReward-Data is a large-scale human preference dataset for instruction-guided image editing, introduced by TIGER-Lab. It comprises over 200,000 manually annotated preference pairs curated by trained experts following a standardized protocol. The dataset was last updated on October 12, 2025.

MultimodalHuman PreferenceComputer VisionImage EditingLarge ScaleInstruction Following+1

0 views

Multimodal & LLM

Llava Stvg Data: A Vision-Language Dataset for Spatio-Temporal Video Grounding

Published on huggingface by author zaiquan and last updated on 2025-12-04. The dataset likely contains multimodal data for spatio-temporal video grounding tasks, which involve linking language queries to specific objects and time segments in videos. Its specific content, scale, and collection methodology require verification after download.

MultimodalVision LanguageMultimodal AiVideo Understanding+1

0 views

Multimodal & LLM

Kene Multimodal Gift: Spiritual Texts and Audio in Spanish, Hindi, and Regional Languages

A multimodal spiritual dataset featuring Ikaros in Spanish, Jiv Jago in Hindi, and languages of Russia, CIS, and Ukraine. The dataset was created by nativemind and was last updated on October 24, 2025. It includes enhanced multimodal data and supports up to 50 language examples.

MultimodalMultilingualSpeech AudioSpiritual TextEthnic Languages+1

0 views

Multimodal & LLM

ConstructionSite 10k: 10,013 Annotated Images for Vision Language Models

ConstructionSite 10k contains 10,013 construction site images and annotations released by LouisChen15 in October 2025. The collection is partitioned into 7,009 training and 3,004 test samples specifically designed to evaluate Vision Language Models (VLMs) in civil engineering contexts.

ParquetVisual Question Answering VqaSize Categories10 Kn100 KTask Categoriesimage Feature ExtractionLibrarypolarsTask Categoriesimage To TextLibrarydaskLanguageenArxiv250811011ModalitytextLibrarymlcroissantModalityimageLibrarydatasetsCivil EngineeringComputer VisionImage CaptioningLicensecc By Nc 40RegionusNatural Language Processing+1

0 views

Multimodal & LLM

Relation252K: 218 Image Editing Tasks for Visual Relation Transfer

Relation252K contains source-target image pairs across 218 distinct image editing tasks, released by handsomeWilliam in 2025. It serves as the evaluation set for the RelationAdapter model, focusing on the transfer of visual relations within Diffusion Transformers.

IMAGEFOLDERSize Categories10 Kn100 KLibrarymlcroissantModalityimageLibrarydatasetsArxiv250602528RegionusTask Categoriesimage To ImageLicenseapache 20+1

0 views

Multimodal & LLM

Datasetloom: Multimodal LLM Training Data Construction and Evaluation Platform

Datasetloom is an open-source platform for constructing and evaluating datasets for multimodal large language models (VLMs), developed by 599yongyang and updated in December 2025. It provides a full-stack framework using TypeScript, Next.js, and NestJS to streamline the creation of training data for vision-language tasks.

TypescriptArtificial IntelligenceLarge Language ModelVlmShadcn UiNextjsNestjs+1

0 views

Multimodal & LLM

EditReward Bench: Benchmark for Instruction-Guided Image Editing

EditScore provides a series of open-source reward models ranging from 7B to 72B parameters for evaluating instruction-guided image editing. The benchmark likely contains data used to train and evaluate these models, with the largest model reportedly surpassing GPT-5 on their internal benchmark. The dataset was last updated on October 17, 2025.

MultimodalBenchmarkComputer VisionReward Model+1

0 views

Multimodal & LLM

EVisRAG-Train: Visual Question Answering Training Data

A Visual Question Answering training dataset compiled from ChartQA, InfographicVQA, and MMLongBench-Doc. The dataset was created by openbmb and last updated on October 14, 2025. It appears to contain image data paired with text, likely for training multimodal models.

MultimodalDocument UnderstandingMultimodal TrainingComputer VisionVisual Question Answering+1

0 views

Multimodal & LLM

Multimodal European News Coverage of the 2023 Brazil Congress Attack

73 newsbites from eight major European newspapers published in the three days following the January 8, 2023, attack on Brazil's federal government buildings. Isabel Alonso Belmonte collected this multilingual sample to explore the multimodal construction of the political event. The dataset was last updated on October 14, 2025.

MultimodalMultilingualEuropean MediaBrazil Congress AttackMultimodal NewsPolitical Events+1

0 views

Multimodal & LLM

SoccerNet VQA: Multimodal Question Answering for 14 Soccer Tasks

The dataset supports the 2026 Soccernet Challenge for multimodal (text, image, video) multiple-choice question answering. It covers 14 distinct soccer understanding tasks, including assessing player and team background knowledge, determining camera status, classifying actions, and recognizing fouls. The dataset was created by SoccerNet and last updated in October 2025.

Licensecc By Sa 40Regionus+1

0 views

Multimodal & LLM

Robo2VLM: 100K-1M VQA Pairs from Robot Manipulation Trajectories

Robo2VLM 1 provides between 100,000 and 1,000,000 visual question-answering records derived from real-world robot manipulation trajectories. Created by researcher keplerccc and updated in late 2025, the dataset uses multi-modal robot data to enhance scene understanding in vision-language models. It bridges the gap between internet-scale image-text corpora and specific robotic visuomotor policies.

ParquetLibrarypolarsLibrarydaskArxiv250515517Task Categoriesvisual Question AnsweringModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageLibrarydatasetsRegionusLicenseapache 20+1

0 views

Multimodal & LLM

RLAIF-V: 10K-100K Multimodal Preference Alignment Records

RLAIF-V provides between 10,000 and 100,000 multimodal preference-alignment records developed by OpenBMB to improve Multimodal Large Language Model (MLLM) trustworthiness. The data utilizes AI-generated feedback to refine model responses, serving as a core training component for the MiniCPM-V 4.5 model released in 2024.

MultimodalSize Categories10 Kn100 KTask Categoriesimage Text To TextLanguageenTask Categoriesvisual Question AnsweringArxiv250918154Arxiv231200849MllmLicensecc By Nc 40RegionusFeedbackArxiv240517220Task Categoriesany To AnyPreference Alignment+1

0 views

Multimodal & LLM

ClaraVid: Aerial Scene Reconstruction Benchmark with 16,917 Frames

ClaraVid is a synthetic dataset for semantic and geometric neural reconstruction from low altitude UAV imagery. It contains 16,917 multimodal frames collected across 8 UAV missions over diverse environments. The dataset was created by radubeche and was last updated on October 31, 2025.

ImageMultimodalBenchmarkScene ReconstructionComputer VisionUavAerial ImagerySynthetic DataSynthetic+1

0 views

Multimodal & LLM

Safe RLHF: Human Preference Data for Constrained AI Value Alignment

PKU-Alignment developed this dataset to facilitate Constrained Value Alignment through Safe Reinforcement Learning from Human Feedback (Safe RLHF). It provides human-annotated preference data for Large Language Models, specifically targeting the balance between helpfulness and safety constraints as of late 2024.

AlpacaSafetyRlhfAi SafetySafe Reinforcement LearningLlamaGptTransformersDeepspeedSafe Reinforcement Learning From Human FeedbackLarge Language ModelReinforcement LearningTransformerLlmsLarge Language ModelsReinforcement Learning From Human FeedbackVicunaBeaverSafe Rlhf+1

0 views

Multimodal & LLM

CalliBench: 3,192 Chinese Calligraphy Images with Recognition and VQA Annotations

3,192 image–annotation pairs for evaluating vision-language models on Chinese calligraphy. The dataset supports tasks like full-page recognition and contextual visual question answering, including author identification and bilingual interpretation. It was created by author gtang666 and last updated on Hugging Face in October 2025.

MultimodalVision Language ModelComputer VisionImage AnnotationCultural HeritageMultimodal EvaluationChinese Calligraphy+1

0 views

Multimodal & LLM

ArchCAD: A Multimodal Dataset for Vectorized Engineering Drawing Understanding

40,000 samples with five strictly aligned modalities provide foundational data for AI systems to interpret CAD drawings. The dataset, created by jackluoluo and last updated in October 2025, is designed to address the challenge of understanding and utilizing computer-aided design data.

MultimodalImage To TextTask Categoriesimage To TextArxiv250322346Task Categoriesvisual Question AnsweringBenchmarkLicensecc By Nc 40Engineering DrawingsRegionusVisual Question AnsweringCAD+1

0 views

Multimodal & LLM

Common Sense Reasoning: LoRA Checkpoints for a 0.5B Foundation Model

LoRA checkpoints were tuned on common sense reasoning datasets using a 0.5 billion parameter foundation model. The checkpoints are served as training data for DnD. The repository was created by Jerrylz and last updated on November 21, 2025.

TextFoundation ModelTraining DataCommon Sense ReasoningLora Checkpoints+1

0 views

PreviousPage 65 of 98Next