DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Cosmos-Reason1 SFT: Video-Text Pairs for Embodied Reasoning Tasks

NVIDIA's Cosmos-Reason1 SFT dataset pairs videos with text annotations for embodied reasoning. The annotations support tasks from multiple sources including BridgeDatav2, RoboVQA, Agibot, HoloAssist, and AV. Released on Hugging Face in May 2025, it also includes RoboFail data for benchmarking.

VideoMultimodalBenchmarkRoboticsMultimodal ReasoningVideo Text Pairs+1

0 views

Multimodal & LLM

Llava Critic Grpo Dataset: Multimodal AI Feedback Data

Llava Critic Grpo Dataset is a collection of data for evaluating and critiquing multimodal AI models. Published by the organization lmms-lab on the Hugging Face platform, it was last updated on June 24, 2025. The dataset's specific content and structure are not detailed in the available metadata.

MultimodalMultimodal AiLlm EvaluationCritique Feedback+1

0 views

Multimodal & LLM

Arabic Image Captioning 100M: Large-Scale Multimodal Dataset

100 million Arabic image captions form the first large-scale multimodal resource for the Arabic language. Generated using the Mutarjim translation model, this dataset addresses a critical gap in Arabic multimodal AI resources. The dataset was created by Misraj and last updated on May 27, 2025.

MultimodalArabic LanguageVision LanguageComputer VisionImage CaptioningLarge ScaleSynthetic+1

0 views

Multimodal & LLM

PointArena: Language-Guided Pointing Tasks for Multimodal Grounding

PointArena is a dataset for probing multimodal grounding through language-guided pointing. It was created by researchers from the University of Washington and the Allen Institute for Artificial Intelligence. The dataset page was last updated on May 17, 2025.

MultimodalLanguage Guided PointingAi EvaluationProbing TasksVision LanguageMultimodal GroundingSynthetic+1

0 views

Multimodal & LLM

Cosmos-Reason1-RL: Video and Text Annotations for Embodied AI Reasoning

Released by NVIDIA in May 2025, this multimodal dataset contains pairs of videos and text annotations for embodied reasoning tasks. It includes data from BridgeDatav2, RoboVQA, Agibot, HoloAssist, AV, and RoboFail datasets. The annotations are structured for Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and benchmarking purposes.

MultimodalBenchmarkRoboticsVideo Text PairsReasoning Tasks+1

0 views

Multimodal & LLM

Mm Graph: Multimodal Graph Benchmark Datasets

Multimodal Graph Benchmark datasets support the paper "Multimodal Graph Benchmark". The datasets are hosted by the organization mm-graph-org on Hugging Face. The repository was last updated on 2025-05-20.

GraphMultimodalMachine LearningMultimodal GraphBenchmarkGraph Benchmark+1

0 views

Multimodal & LLM

Multimodalpv: Multimodal Computer Vision Dataset

Multimodalpv is a dataset published on HuggingFace by wealan123123. Its last update was recorded on 2025-07-05. The specific content, size, and structure are unknown from the provided metadata.

MultimodalMachine LearningComputer Vision+1

0 views

Multimodal & LLM

MTBench Finance Aligned Pairs Long: Stock Prices Aligned with Textual Context

MTBench is a multimodal time series benchmark for evaluating large language models in temporal and cross-modal reasoning. The dataset aligns high-resolution financial time series, such as stock prices, with textual context like news articles or QA prompts. It was created by GGLabYale and last updated on 2025-05-23.

Time SeriesMultimodalLlm BenchmarkBenchmarkFinance+1

0 views

Multimodal & LLM

Med VLM PMC VQA: GPT-4O Reasoning Filter None Chain-of-Thought

A multimodal dataset from HuggingFace, authored by med-vlrm and last updated on 2025-06-29. The title suggests it involves medical visual question answering (VQA) using a vision-language model (VLM) on PubMed Central (PMC) images, with reasoning processes from GPT-4O. The dataset's specific size, structure, and content are not detailed in the provided metadata.

MultimodalParquetLibrarypolarsLibrarydaskModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageMedical Vision LanguageLibrarydatasetsGPT-4RegionusReasoningVqa+1

0 views

Multimodal & LLM

Med VLM PMC VQA GPT-4O Reasoning: Medical Vision-Language Model Benchmark

Med-VLrm's dataset, published on Hugging Face on June 28, 2025, likely contains medical images and text for evaluating vision-language models. The dataset appears to be designed for benchmarking reasoning capabilities, potentially using GPT-4O as a judge or component. Its specific content and scale require verification after download.

MultimodalParquetLibrarypolarsLibrarydaskModalitytextSize Categories100 Kn1 MGpt 4oLibrarymlcroissantModalityimageMedical Vision LanguageLibrarydatasetsGPT-4RegionusReasoningMultimodal ReasoningMedical AiVisual Question Answering+1

0 views

Multimodal & LLM

Med VLM PMC VQA: GPT-4O Reasoning on Tokenized Medical Images and Text

A multimodal dataset from huggingface, created by med-vlrm and last updated on 2025-06-29. The platform tags suggest it contains medical vision-language data, likely involving images and text processed with GPT-4O for reasoning tasks. The specific content, scale, and structure require verification after download.

MultimodalParquetSize Categories10 Kn100 KLibrarypolarsLibrarydaskModalitytextGpt 4oLibrarymlcroissantModalityimageMedical Vision LanguageLibrarydatasetsGPT-4TokenizedRegionusReasoningVqaVisual Question Answering+1

0 views

Multimodal & LLM

Cvqa: Visual Question Answering Dataset

Cvqa is a dataset uploaded to Hugging Face by author 'davidanugraha'. The dataset was last updated on June 30, 2025. Its specific content, size, and structure are not described in the available metadata.

MultimodalQuestion AnsweringComputer Vision+1

0 views

Multimodal & LLM

LeetCode Python Problems for LLM Training and Evaluation

A dataset of Python LeetCode problems intended for training and evaluating large language models for code. It was created by author 'newfacade' and last updated on Hugging Face on 2025-05-29. The dataset's specific size and structure are not detailed in the provided metadata.

TextTime SeriesJSONSize Categories1 Kn10 KTask Categoriestext GenerationLibrarypolarsLanguageenArxiv250414655ModalitytextCodeLeetcodeLibrarymlcroissantLibrarydatasetsBenchmarkLibrarypandasProgramming ProblemsPythonCode GenerationRegionusLlm TrainingArxiv240906957Licenseapache 20+1

0 views

Multimodal & LLM

Wethink Multimodal Reasoning 120K

A multimodal dataset containing approximately 120,000 image-text pairs for reasoning tasks, created by WeThink and last updated on May 15, 2025. The description indicates it aggregates images from multiple established sources including COCO, Visual Genome, and TextVQA. It is hosted on the Hugging Face platform.

MultimodalVision LanguageQuestion AnsweringComputer VisionMultimodal ReasoningVqa+1

0 views

Multimodal & LLM

SCI-CQA: 5,629 Chart Understanding Questions from Scientific Literature

SCI-CQA is a multimodal benchmark dataset for evaluating chart understanding, inspired by human exams. It contains 5,629 curated objective and open-ended questions paired with 2,894 chart images from scientific literature. The dataset was created by lyndons1 and last updated on April 28, 2025.

MultimodalChart UnderstandingBenchmarkQuestion AnsweringScientific LiteratureMultimodal Benchmark+1

0 views

Multimodal & LLM

Image Wallpapers with Text Descriptions for Multimodal AI

High-quality images paired with descriptive text annotations, designed for computer vision and multimodal machine learning tasks. The dataset was created by Navanjana and last updated on May 21, 2025. Images are preprocessed to a standard dimension of 224×224 pixels in JPEG RGB format.

MultimodalMultimodal LearningComputer VisionImage CaptioningWallpapers+1

0 views

Multimodal & LLM

AVQA-R1-6K: Audio-Visual Question Answering for Multimodal LLMs

6,000 multimodal question-answer pairs presented in the EchoInk-R1 research paper. The dataset was created by author harryhsing and last updated on the Hugging Face platform in May 2025. It is designed for exploring audio-visual reasoning in multimodal large language models via reinforcement learning.

AudioMultimodalAudio Visual ReasoningMultimodal QaMultiple ChoiceLlm Training+1

0 views

Multimodal & LLM

ViRL39K: 38,870 Verifiable QAs for Vision-Language RL Training

ViRL39K contains 38,870 verifiable question-answer pairs designed for Vision-Language Reinforcement Learning training, released by TIGER-Lab in April 2025. It aggregates and refines data from seven specialized sources, including Llava-OneVision, MM-Math, and DeepScaleR, through a process of cleaning, reformatting, and verification.

Task Categoriesimage Text To TextTask Categoriesquestion AnsweringLanguageenModalityimageArxiv250408837TrainingRegionusReinforcement LearningLicensemit+1

0 views

Multimodal & LLM

PE Video Dataset: 1 Million Diverse Videos with 120,000 Annotations

Meta released the PE Video Dataset (PVD) in April 2025, featuring 1 million high-quality videos for perception encoding research. The collection includes 120,000 clips with human-verified annotations, while the full set is accompanied by descriptions and keywords.

WEBDATASETLibrarywebdatasetModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsArxiv250413181Licensecc By Nc 40Regionus+1

0 views

Multimodal & LLM

Magma-AITW-SoM: A Multimodal Foundation Model Benchmark for AI Agents

Magma is a foundation model for multimodal AI agents, developed by researchers from Microsoft Research, University of Maryland, University of Wisconsin-Madison, KAIST, and University of Washington. The associated dataset, Magma-AITW-SoM, likely serves as a benchmark for evaluating multimodal agent capabilities. The dataset page was last updated on 2025-04-29.

MultimodalFoundation ModelResearch BenchmarkMultimodal AiAi Agents+1

0 views

PreviousPage 75 of 98Next