DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,947 datasets

Multimodal & LLM

Biz Doc VQA: 624 Japanese Receipt Question-Answer Pairs

624 Japanese-language Visual Question Answering annotations across 116 receipt images for business document OCR evaluation. Created by icoxfog417 and updated in March 2026, the collection focuses on extracting structured data from financial documents.

IMAGEFOLDERTask Categoriesvisual Question AnsweringSize Categoriesn1 KModalitytextLibrarymlcroissantModalityimageLicensecc By Sa 40LibrarydatasetsTask Categoriesdocument Question AnsweringRegionusLanguageja+1

0 views

Multimodal & LLM

LongVT-Parquet: Training Annotations and Evaluation Benchmark for Long Video Reasoning

LongVT-Parquet provides the training data annotations and evaluation benchmark for the LongVT project. The dataset supports an end-to-end agentic framework for 'Thinking with Long Videos' via interleaved Multimodal Chain-of-Tool-Thought. It was created by 'longvideotool' and last updated on March 9, 2026.

VideoMultimodalLanguageenTask Categoriesvisual Question AnsweringTool CallingLong VideoModalitytextArxiv251120785Size Categories100 Kn1 MChain Of ThoughtAgentic FrameworkTask Categoriesvideo Text To TextBenchmarkModalityvideoRegionusReasoningMultimodal ReasoningVideo QaLicenseapache 20+1

0 views

Multimodal & LLM

Space Images with Descriptive Captions for Multimodal AI

The Space Vision Dataset is a multimodal collection of space-related images paired with descriptive captions. It includes imagery of planetary views, telescopes, galaxies, and Mars rover scenes, designed for tasks like image captioning and vision-language modeling.

OPTIMIZED-PARQUETParquetLibrarypolarsSize Categoriesn1 KModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasRegionusLicensemit+1

0 views

Multimodal & LLM

IMDb Multimodal Subset

A multimodal subset of data from IMDb, likely containing information related to movies. The dataset was created by msubhaditya and is hosted on HuggingFace. It was last updated on 2026-05-01.

MultimodalMoviesImdbEntertainment+1

0 views

Multimodal & LLM

Pakistan Top Cities Quality of Life Metrics

Pakistan Top Cities Quality of Life Dataset provides urban livability and safety metrics for major Pakistani cities. The dataset likely contains human preference data related to urban living conditions. It was sourced from Kaggle, but the author, organization, and last update date are unknown.

TabularUrban LivabilityQuality of LifePakistan Cities+1

0 views

Multimodal & LLM

FF-Multimodal-CSV: A Multimodal Dataset

FF-Multimodal-CSV is a dataset published on Kaggle. The title suggests it contains multimodal data, likely combining different data types such as images and text. The dataset's specific content, size, and origin are not detailed in the provided metadata.

TabularMultimodalTabular DataComputer Vision+1

0 views

Multimodal & LLM

Community Alignment: Multilingual Human Preference Comparisons for LLMs

Over 200,000 comparisons of large language model responses were collected from more than 3,500 unique annotators. The dataset is multilingual, containing comparisons in English, French, Italian, Hindi, and Portuguese. It was created by Facebook and last updated on Hugging Face in February 2026.

TabularMultilingualCSVSize Categories10 Kn100 KArxiv250709650LibrarypolarsAlignmentModalitytextReward ModelingModalitytabularLibrarymlcroissantLibrarydatasetsLibrarypandasLarge Language ModelRegionusLarge ScalePreferenceRewardLanguageptHuman FeedbackPreference Alignment+1

0 views

Multimodal & LLM

VisWorld-Eval: Seven Tasks for Multimodal Reasoning Assessment

VisWorld-Eval is a task suite for assessing multimodal reasoning with visual world modeling. It comprises seven tasks spanning synthetic and real-world domains, each designed to isolate specific atomic world-model capabilities. The dataset was authored by 'thuml' and last updated on Hugging Face on March 9, 2026.

MultimodalBenchmark EvaluationMultimodal ReasoningVisual World ModelingSynthetic DataSynthetic+1

0 views

Multimodal & LLM

UltraMix: A Reward-Aligned, Quality-Filtered DPO Mixture

UltraMix is a lean, high-quality preference optimization dataset curated from five open-source DPO corpora. It was created by aladinDJ using the Magpie Annotation Framework and a reward-driven curation pipeline, and was last updated on Hugging Face in February 2026. The dataset removes noisy, low-reward, or redundant preference pairs while preserving task balance.

TextText GenerationReasoningPreference OptimizationInstruction Following+1

0 views

Multimodal & LLM

Hazsense: A Multimodal RGB-D Dataset for Real-World Hazard Detection

The Hazsense dataset is a multimodal collection of RGB-D data for detecting real-world hazards, as described in a 2025 IEEE conference paper. It was created by Shruti Brahma and Khaled Sayed and uploaded to Hugging Face by ShrutiBrahma. The dataset was last updated on April 6, 2026.

MultimodalRgb DComputer VisionHazard Detection+1

0 views

Multimodal & LLM

CISR Multimodal Preprocessed Cache

A preprocessed cache for multimodal data, likely intended for machine learning workflows. The dataset is hosted on Kaggle, but its specific origin, size, and creation date are unknown. Content and structure must be verified after download.

MultimodalMachine LearningMultimodal DataPreprocessed Cache+1

0 views

Multimodal & LLM

Cxr Vlm Data: Chest X-Ray Images for Vision-Language Models

Cxr Vlm Data is a dataset hosted on HuggingFace by user hieu3636. Its title suggests it contains chest X-ray images, likely paired with text for vision-language model training. The dataset was last updated on April 23, 2026.

MultimodalVision Language ModelMedical ImagingChest X RayRadiology+1

0 views

Multimodal & LLM

Real-World Driver Takeover Events During ADAS Engagement

ADAS-TO contains 15,705 real-world takeover events from 327 drivers across 163 vehicle models and 23 manufacturers. It is a multimodal dataset capturing the moment of control transition from ADAS to human drivers, created by HenryYHW.

Time SeriesMultimodalSize Categories10 Kn100 KDriver BehaviorTask Categoriestime Series ForecastingLanguageenVehicle DynamicsDriving SafetyAdasLicensecc By Nc 40Can BusRegionusHuman FactorsAutonomous DrivingTask Categoriesvideo ClassificationTakeoverArxiv260306986+1

0 views

Multimodal & LLM

Pencil Physics-SOTA: Performance Metrics for 34 Multimodal Reasoning Architectures

High-fidelity performance metrics for 34 state-of-the-art multimodal reasoning architectures. The dataset appears to be a benchmarking collection for AI models that process and reason across multiple data types, such as images and text. The source, author, and specific metrics are not detailed in the provided metadata.

MultimodalAi BenchmarkingComputer VisionMultimodal ReasoningNatural Language ProcessingPerformance Metrics+1

0 views

Multimodal & LLM

LLaVA Instruct Mix SFT: A Multimodal Instruction Dataset

An instruction-tuning dataset likely designed for training or fine-tuning large language and vision assistant models. The dataset is published on Kaggle, but details on its size, creator, and specific content are not provided in the metadata. Its title suggests it contains a mix of data for supervised fine-tuning (SFT) aligned with the LLaVA project's multimodal approach.

MultimodalMultimodal DataLlm TrainingInstruction Tuning+1

0 views

Multimodal & LLM

iNat21-1shot-fewshots: One-Shot Training Data for Hierarchical Visual Recognition

StevenHH2000 released this training dataset on March 19, 2026 for a CVPR 2026 paper on taxonomy-aware representation alignment. It consists of randomly sampled one-shot examples per category from the iNaturalist2021 dataset. The data includes images paired with text questions and coarse-to-fine category labels.

MultimodalParquetSize Categories1 Kn10 KTask Categoriesimage Text To TextVisionLibrarypolarsTaxonomy AwareLanguageenModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasVisual RecognitionFew Shot LearningComputer VisionNaturalistLarge Language ModelRegionus+1

0 views

Multimodal & LLM

OmniScience: 1M-10M Multi-modal Pairs for Scientific Image Understanding

OmniScience provides between 1 million and 10 million multi-modal records for scientific image understanding, released by UniParser in January 2026. The data pairs scientific imagery with text to support image-to-text tasks, following a collection phase completed in September 2025.

OPTIMIZED-PARQUETParquetArxiv260213758LibrarypolarsTask Categoriesimage To TextLibrarydaskSize Categories1 Mn10 MLicensecc By Nc Sa 40ModalitytextArxiv251215098LibrarymlcroissantModalityimageLibrarydatasetsRegionus+1

0 views

Multimodal & LLM

Remotion Video Gen: Multi-Stage Educational Video Synthesis Benchmark

45 courses and over 200 source documents form a benchmark for grounded synthesis. The dataset includes line-level citation ground truth from professional educators and programmatic video output in React code. Pairwise human preferences provide expert votes on output quality as a signal for reinforcement learning.

MultimodalEducational ContentReactBenchmarkVideo GenerationHuman PreferencesCitation Grounding+1

0 views

Multimodal & LLM

TCM Pretrain Data ShizhenGPT: Traditional Chinese Medicine Corpus and Image-Text Dataset

Over 5 billion tokens of Traditional Chinese Medicine text from websites and books, alongside a large-scale image-text dataset, form the pretraining data for ShizhenGPT. The dataset was created by CarsonnnNN and released on Hugging Face, with a last recorded update in March 2026. It is described as the largest existing open-source TCM corpus and image-text dataset for pretraining.

MultimodalTraditional Chinese MedicineMultimodal LlmComputer VisionMedical TextPretraining CorpusNatural Language ProcessingMedical Images+1

0 views

Multimodal & LLM

Nemotron RLHF GenRM v1: Preference Data for Training Generative Reward Models

A dataset designed to train Generative Reward Models (GenRMs) using reinforcement learning at scale. It was created by NVIDIA and last updated on March 11, 2026. The data is composed of preference data from diverse domains and a synthetic safety blend, structured with a 'meta-prompt' format.

TextReward ModelingBenchmarkSynthetic SafetyPreference DataReinforcement LearningLlm TrainingSynthetic+1

0 views

PreviousPage 35 of 97Next

Multimodal & LLM Datasets | DataSalon