DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

AstroLLaVA Convos: Astronomical Images with Captions and Synthetic Q&A

A large-scale collection of astronomical images paired with descriptive captions and synthetic question-answer pairs, designed for training visual language models. The dataset was created by UniverseTBD and last updated on July 28, 2025. It combines imagery from NASA's Astronomy Picture of the Day, the European Southern Observatory's public archive, and ESA's Hubble Space Telescope.

MultimodalSynthetic QaMultimodal AiAstronomyComputer VisionLarge ScaleVisual Language ModelsSynthetic+1

0 views

Multimodal & LLM

Iconclass: Visual Language and Symbol Classification

Iconclass is a classification system for art and iconography. The dataset likely contains structured codes and descriptions for visual symbols and themes. It was published on HuggingFace by davanstrien and last updated on September 10, 2025.

MultimodalClassification SystemArt HistoryIconclass+1

0 views

Multimodal & LLM

MSR-VTT Video Description Dataset With 200K Captions

MSR-VTT contains 10,000 video clips paired with 200,000 descriptive captions. The dataset, originally created by Microsoft Research, is a standard benchmark for text-video retrieval and captioning tasks. It was last updated on the platform in August 2025.

VideoMultimodalJSONSize Categories10 Kn100 KLibrarypolarsLanguageenModalitytextTask Categoriestext RetrievalModalitytabularLibrarymlcroissantLibrarydatasetsBenchmarkLibrarypandasModalityvideoVideo CaptioningRegionusTask Categoriestext To VideoTask Categoriesvideo ClassificationMultimodal BenchmarkText Video Retrieval+1

0 views

Multimodal & LLM

CV-Bench: Cambrian Vision-Centric Benchmark for Multimodal LLMs

Cambrian Vision-Centric Benchmark (CV-Bench) is a dataset introduced in the Cambrian-1 research paper for evaluating vision-centric multimodal large language models. The dataset contains annotations and images pre-loaded for processing with Hugging Face Datasets. It was created by nyu-visionx and last updated on July 20, 2025.

MultimodalAi EvaluationMultimodal LlmBenchmarkComputer VisionVision CentricVision Centric Benchmark+1

0 views

Multimodal & LLM

Long Context Multimodal Document Understanding Benchmark

Document Haystack is a benchmark dataset for evaluating multimodal Large Language Models on long-context image and document understanding tasks. It was created by AmazonScience for a 2025 research paper to address the lack of suitable benchmarks for processing long documents. The specific row count, column count, and data size are not provided in the input.

ImageTextMultimodalVisionTask Categoriesquestion AnsweringLanguageenTask Categoriesvisual Question AnsweringModalitytextPdfModalitydocumentModalityimageTask Categoriesdocument Question AnsweringBenchmarkLarge Language ModelRegionusLong ContextVlmArxiv250715882+1

0 views

Multimodal & LLM

FragFake: VLM-Based Edited-Image Detection Dataset

FragFake is a dataset for edited-image detection using Vision-Language Models (VLMs). It contains four groups of examples—Gemini-IG, GoT, MagicBrush, and UltraEdit—each with two difficulty levels: easy and hard. The dataset was created by Vincent-HKUSTGZ and was last updated on July 31, 2025.

MultimodalVision Language ModelComputer VisionEdited Image Detection+1

0 views

Multimodal & LLM

K-LLaVA-W: Korean Vision-Language Model Evaluation Benchmark

K-LLaVA-W is a Korean adaptation of the LLaVA-Bench-in-the-wild, designed for evaluating vision-language models. The benchmark was created by translating the original English text into Korean and reviewing its naturalness through human inspection. It was published by NCSOFT and last updated on July 25, 2025.

MultimodalParquetLibrarypolarsKorean LanguageSize Categoriesn1 KModalitytextLibrarymlcroissantVision LanguageModalityimageImage TextLibrarydatasetsBenchmarkLibrarypandasComputer VisionArxiv241119103Licensecc By Nc 40LanguagekoRegionusArxiv230408485Multimodal Evaluation+1

0 views

Multimodal & LLM

Doc-750K: A Multimodal Document Understanding Dataset

OpenGVLab's Doc-750K dataset, referenced in the paper 'Docopilot: Improving Multimodal Models for Document-Level Understanding', is a collection of documents for training AI models. The dataset was last updated on July 22, 2025. It appears to contain a large number of document images, as suggested by unzipping instructions for image archives.

MultimodalDocument UnderstandingMultimodal AiComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

NayanaBench: Multilingual Visual Question Answering Dataset Across 22 Languages

NayanaBench is a multilingual visual question answering dataset designed for evaluating multimodal AI systems. It includes 200 examples each for 22 languages, combining optical character recognition and layout analysis. The dataset was created by Nayana-cognitivelab and was last updated on July 28, 2025.

MultimodalMultilingualIndian LanguagesMultimodal AiOptical Character RecognitionVisual Question Answering+1

0 views

Multimodal & LLM

SynthCodeNet: 9.3 Million Synthetic Code Snippet Image-Text Pairs

Over 9.3 million synthetically generated image-text pairs form this multimodal dataset created for training the SmolDocling model. The dataset covers code snippets from 56 different programming languages, with text sourced from permissively licensed sources and images generated at 120 DPI using LaTeX and Pygments. It was created by the docling-project and last updated on July 16, -2025.

MultimodalComputer VisionCode SnippetsLarge ScaleSynthetic DataSyntheticProgramming Languages+1

0 views

Multimodal & LLM

Callitrain

3,192 image–annotation pairs form the CalliBench dataset for evaluating Vision Language Models on Chinese calligraphy. The dataset, created by author gtang666, includes tasks for full-page recognition and contextual visual question answering. It was last updated on Hugging Face in July 2025.

MultimodalParquetSize Categories1 Kn10 KImage To TextArxiv250306472LibrarypolarsLanguagezhTask Categoriesimage To TextVision Language ModelLanguageenTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasComputer VisionRegionusArtLicenseapache 20Art AnalysisVisual Question AnsweringChinese Calligraphy+1

0 views

Multimodal & LLM

Awesome Table Understanding: Curated Directory of Research Benchmarks

Aggregating multiple benchmarks for table understanding, this repository by esborisova was updated in September 2025. It categorizes resources into tasks such as table structure recognition, table-to-text, and table question answering.

Awesome ListTable Understanding+1

0 views

Multimodal & LLM

Geo170K 8K R1: A Multimodal Question-Answering Benchmark Collection

A collection of question-answering datasets, including Geo170K, Visualpuzzles, TQA, AI2D, RL, LMMS, ScienceQA, and OK-VQA, uploaded by author GY2233 to Hugging Face on 2025-09-02. The title suggests it aggregates multiple established benchmarks for visual and textual reasoning. The specific content, scale, and structure of the combined data require verification after download.

MultimodalMultimodal QaAi BenchmarkScience QaVqaVisual Reasoning+1

0 views

Multimodal & LLM

Multimodal Cold Start: 10K-100K Reasoning Samples for SFT

10,000 to 100,000 multimodal records for cold-start supervised fine-tuning (SFT) in reasoning tasks, released by WaltonFuture in 2025. It supports the research paper 'Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start' by providing initial training data for a two-stage reinforcement learning pipeline.

MultimodalParquetSize Categories10 Kn100 KTask Categoriesimage Text To TextLibrarypolarsLibrarydaskCold StartModalitytextChain Of ThoughtLibrarymlcroissantModalityimageLibrarydatasetsLarge Language ModelSftRegionusReasoningReinforcement LearningArxiv250522334+1

0 views

Multimodal & LLM

GeoGrid-Bench: Multimodal Gridded Geo-Spatial Benchmark

Three categories of multimodal geo-spatial data—tabular grids, heatmaps, and geographic visualizations—designed for foundation model evaluation. The benchmark tests the ability to process dense numerical values and interpret spatial-temporal dependencies within these grid structures.

TabularMultimodalCSVSize Categories10 Kn100 KLibrarypolarsGeo SpatialLanguageenTask Categoriesvisual Question AnsweringClimateModalitytextModalitytabularLibrarymlcroissantTask Categoriestable Question AnsweringModalityimageLibrarydatasetsArxiv250510714LibrarypandasGridRegionusLicensemit+1

0 views

Multimodal & LLM

MDocAgent: A Multi-Modal Multi-Agent Framework Dataset for Document Understanding

The MDocAgent dataset supports a framework for multi-modal document understanding, as described in the associated arXiv paper. The dataset was created by Lillianwei and last updated on August 22, 2025. It is hosted on Hugging Face and is associated with a GitHub repository containing the framework's code.

MultimodalSize Categories1 Kn10 KDocument UnderstandingMulti AgentModalitydocumentLibrarymlcroissantTask Categoriestable Question AnsweringMultimodal AiLibrarydatasetsTable QaArxiv250313964RegionusLicensemit+1

0 views

Multimodal & LLM

Ban Sign Sent 9K V1: Continuous Bangla Sign Language Videos with Gloss Sentences

A large-scale, multimodal dataset for Continuous Bangla Sign Language (BdSL) recognition and translation. It includes video samples of real-life continuous sign language performances paired with gloss sentences representing the signed content. The dataset was created by 'banglagov' and was last updated on July 22, 2025.

VideoMultimodalTranslationSign LanguageBanglaLarge Scale+1

0 views

Multimodal & LLM

Indian Cartoon Blip: A Multimedia Dataset

Indian Cartoon Blip is a dataset uploaded by Surbhipatil to the Hugging Face platform. The dataset was last updated on 2025-09-02 10:39:41. Its specific content, size, and structure are not detailed in the available metadata.

MultimodalCultural ContentMultimedia+1

0 views

Multimodal & LLM

Facecaption 1M: 1 Million Facial Image-Text Pairs

Facecaption 1M is a dataset of 1 million facial image-text pairs, as indicated by its title. The dataset was created by authors from OpenFace-CQUPT and published in a 2024 arXiv paper. The dataset listing on HuggingFace was last updated on August 1, 2025.

MultimodalImage Text PairsComputer VisionFacial Images+1

0 views

Multimodal & LLM

MedMax Data: Mixed-Modal Instruction Tuning for Biomedical Assistants

A dataset for mixed-modal instruction tuning created by researchers at the University of California, Los Angeles. It is designed for training biomedical assistants by integrating multimodal information. The dataset page was last updated on 2025-07-19.

MultimodalLarge Language ModelsBiomedical+1

0 views

PreviousPage 70 of 98Next