DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,944 datasets

Multimodal & LLM

Doc MP-DocVQA: A Document Visual Question Answering Dataset

Doc MP-DocVQA is a dataset for Visual Question Answering on documents, hosted on Kaggle. The dataset likely contains images of documents paired with questions and answers to test machine comprehension. Specific details on size, creation date, and authorship are not provided in the available metadata.

MultimodalDocument UnderstandingMultimodal QaDocument Vqa+1

0 views

Multimodal & LLM

DocVQA: Document Visual Question Answering Dataset

DocVQA is a dataset for visual question answering on documents. It is hosted on Kaggle, but detailed metadata such as author, size, and license are not provided. The dataset's content and structure require verification after download.

MultimodalMultimodal AiVisual Question AnsweringDocument Vqa+1

0 views

Multimodal & LLM

FlipVQA-85K: A Multimodal Reasoning Benchmark from 544 College-Level STEM Documents

FlipVQA-85K is a high-fidelity reasoning benchmark curated from a corpus of 544 college-level educational PDF documents, including expert-authored textbooks and exercise sets. The collection spans 11 academic disciplines, primarily in STEM domains where problems involve rigorous and verifiable reasoning processes. It was created by OpenDCAI and last updated on the platform in April 2026.

MultimodalBenchmarkStem EducationReasoning BenchmarkNatural Language ProcessingMultimodal AssessmentVisual Question Answering+1

0 views

Multimodal & LLM

Vibe Landing Page Arena: 36,000 Human Judgments on AI-Generated Web Designs

Vibe Landing Page Arena is a large-scale human preference dataset for evaluating AI-generated landing page design quality. It contains 36,000 pairwise judgments from 3,492 annotators comparing pages generated by four AI tools across 100 prompts and multiple design dimensions. The dataset was created by datapointai and last updated on Hugging Face in April 2026.

MultimodalSize Categories1 Kn10 KLanguageenTask Categoriesvisual Question AnsweringPairwise ComparisonAi EvaluationVibe CodingHuman PreferenceLicensecc By 40Task Categoriesimage ClassificationAi Code GenerationRegionusWeb DesignLarge ScaleLanding PagesDesignSynthetic+1

0 views

Multimodal & LLM

Caveman-Style World Knowledge Dataset for Instruction Tuning

Caveman World Knowledge 150K is an instruction dataset containing approximately 150,000 entries for tuning language models. It was created by author Blackbean109 and was last updated in April 2026. The dataset blends factual world knowledge responses with reactions to unknown questions.

TextGraphAudioText GenerationStyle TransferSynthetic+1

0 views

Multimodal & LLM

CoMM: Coherent Interleaved Image-Text Dataset

CoMM is a high-quality dataset designed to improve the coherence, consistency, and alignment of multimodal content. The dataset was created by author weisuxi and was last updated on 2026-04-24. It sources raw data from diverse origins, focusing on instructional content and visual storytelling.

MultimodalInstructional ContentVisual StorytellingImage TextComputer Vision+1

0 views

Multimodal & LLM

VQA: Visual Question Answering Dataset

dataset_vqa is a dataset hosted on Kaggle. Its title suggests it contains data for Visual Question Answering tasks, which involve answering questions about images. The dataset's specific content, size, and origin are not detailed in the provided metadata.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

PostTrainBench Trajectories: Agent Actions for LLM Fine-Tuning

Agent trajectories from PostTrainBench, a benchmark measuring CLI agents' ability to post-train pre-trained LLMs. The dataset was created by aisa-group and last updated on March 16, 2026. Each agent is given a base LLM, an evaluation script, and 10 hours on an NVIDIA H100 80GB GPU to autonomously improve model performance.

TabularAi BenchmarkBenchmarkLlm Fine TuningAgent TrajectoriesCli Agents+1

0 views

Multimodal & LLM

CuriaBench: Evaluation Datasets for a Multimodal Radiology Foundation Model

CuriaBench is a collection of evaluation datasets for the Curia foundation model, as described in the associated research paper. The datasets were created by the organization 'raidium' and the benchmark repository was last updated on March 31, III. The data is intended to assess the performance of multimodal AI models in radiology.

MultimodalFoundation ModelMedical ImagingBenchmarkRadiologyFoundation Model EvaluationMultimodal Benchmark+1

0 views

Multimodal & LLM

CT_Bench: A Benchmark for Multimodal AI in Computed Tomography

CT_Bench is a benchmark dataset designed for evaluating multimodal artificial intelligence models in the analysis of computed tomography scans. The dataset likely contains paired medical images and associated clinical or textual data for structured evaluation tasks. Its creation and maintenance details are not provided in the available metadata.

MultimodalCt ScansMedical ImagingMultimodal AiAi BenchmarkBenchmark+1

0 views

Multimodal & LLM

CURA-VLM: Vision-Language Model Dataset

CURA-VLM appears to be a dataset for vision-language model training or evaluation. It is hosted on Kaggle, but no further details about its size, creator, or specific content are provided. The dataset's purpose likely relates to multimodal AI tasks involving both visual and textual data.

MultimodalVision Language ModelMultimodal AiComputer Vision+1

0 views

Multimodal & LLM

Computer Network Images for MAA and VLM Evaluation

A multimodal benchmark dataset for evaluating computer vision and language models. The dataset likely contains images related to computer networks, paired with annotations for model assessment. It is hosted on Kaggle, but detailed metadata about its size, origin, and specific content is not provided.

MultimodalVision Language ModelsComputer NetworksBenchmark DatasetMultimodal Evaluation+1

0 views

Multimodal & LLM

ChartNet: A Million-Scale Dataset for Multimodal Chart Interpretation

ChartNet is a large-scale, high-quality multimodal dataset designed for robust chart understanding and reasoning. It contains over one million chart samples, combining geometric visual patterns, structured numerical data, and natural language descriptions. The dataset was created by IBM Granite and was last updated in March 2026.

MultimodalOPTIMIZED-PARQUETParquetTask Categoriestext GenerationLibrarypolarsTask Categoriesimage To TextLibrarydaskSize Categories1 Mn10 MArxiv260327064Task Categoriesvisual Question AnsweringData VisualizationModalitytextMultimodal DataLibrarymlcroissantTask Categoriestable Question AnsweringModalityimageChart UnderstandingLibrarydatasetsAi TrainingVisual LanguageRegionusLarge ScaleNatural Language Processing+1

0 views

Multimodal & LLM

Cot Oracle Convqa Chunked: Conversational Question Answering Data

A dataset titled 'Cot Oracle Convqa Chunked Sonnet' authored by 'ceselder' and published on the HuggingFace platform. The dataset was last updated on 2026-05-11. Its title suggests it likely contains conversational question-answering data, possibly structured for language model training.

TextConversational AiOracleQuestion AnsweringLlm Training+1

0 views

Multimodal & LLM

circuit-vqa-384a: Visual Question Answering for Circuit Diagrams

A dataset titled 'circuit-vqa-384a' is hosted on Kaggle. The title suggests it likely contains images of electronic circuits paired with questions and answers. The dataset's author, organization, size, and specific contents are unknown and require verification after download.

MultimodalMultimodal AiCircuit AnalysisComputer VisionVisual Question Answering+1

0 views

Multimodal & LLM

Fashion Images with Text and Sentiment Labels

Fashion images paired with textual descriptions and sentiment labels, published on Kaggle. The dataset likely contains visual and textual data for analyzing consumer sentiment towards fashion items. Metadata is minimal; actual content requires verification after download.

MultimodalFashionMultimodal DataSentiment AnalysisComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

OpenRoboCare: Expert Caregiving Demonstrations with 19.8 Hours of Multimodal Data

A multimodal dataset capturing 19.8 hours of expert demonstrations across 315 sessions. It includes synchronized RGB-D video, tactile sensing, eye-gaze tracking, pose annotations, and action labels from 21 occupational therapists performing 15 daily caregiving tasks. The dataset was contributed by the EmPRISE Lab at Cornell University and is hosted on AWS Open Data.

Time SeriesMultimodalMachine LearningCaregivingRobot LearningRoboticsComputer VisionLife SciencesHealthExpert Demonstration+1

0 views

Multimodal & LLM

KITScenes LongTail: Multi-View Driving Data for Rare Scenario Generalization

KITScenes LongTail is a dataset for end-to-end driving research focusing on long-tail events. It provides multi-view video data, vehicle trajectories, high-level instructions, and detailed reasoning traces. The dataset was created by KIT-MRT and was last updated on Hugging Face in April 2026.

MultimodalOPTIMIZED-PARQUETParquetLibrarypolarsLanguagezhLibrarydaskLanguageenReasoning TracesSize Categoriesn1 KModalitytextLibrarymlcroissantModalityimageMulti View VideoLibrarydatasetsBenchmarkLicensecc By Nc 40RegionusLong Tail EventsLanguageesAutonomous DrivingTrajectory DataArxiv260323607+1

0 views

Multimodal & LLM

SLAKE: Medical Visual Question Answering Dataset

SLAKE is a dataset for medical visual question answering, a task combining image understanding and natural language processing. It was published on Kaggle, though the specific author, organization, and collection details are not provided in the available metadata. The dataset's size, format, and exact composition require verification after download.

MultimodalMultimodal QaVision LanguageHealthcareMedical VqaHealthcare Ai+1

0 views

Multimodal & LLM

Nutriderm Stage7 VQA: Visual Question Answering Dataset

nutriderm-stage7-vqa is a dataset hosted on Kaggle. The title suggests it is a multimodal dataset for visual question answering, likely involving images and text. The dataset's specific content, scale, and origin are not detailed in the available metadata.

MultimodalNutritionMultimodal AiDermatologyComputer VisionVisual Question Answering+1

0 views

PreviousPage 28 of 97Next