DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

UniWorld V1: 10,000 Geneval-Style Image-Text Pairs for Semantic Encoding

UniWorld V1 provides between 1,000 and 10,000 image-text pairs sourced from the BLIP3o-60k collection, released by LanguageBind in June 2025. It utilizes Geneval-style annotations to facilitate the training of high-resolution semantic encoders for unified visual understanding and generation.

WEBDATASETSize Categories1 Kn10 KLibrarywebdatasetModalitytextLibrarymlcroissantModalityimageLibrarydatasetsRegionusLicensemitArxiv250603147+1

0 views

Multimodal & LLM

FaceCaptionHQ-4M: 4 Million Facial Image-Text Pairs

FaceCaptionHQ-4M is a dataset containing approximately 4 million facial image-text pairs. It was created by OpenFace-CQUPT and was last updated on 2025-06-09. The dataset is a cleaned subset derived from the larger FaceCaption-15M dataset.

ImageTextMultimodalImage Text PairsComputer VisionFace CaptioningFacial Images+1

0 views

Multimodal & LLM

Unsafe Illegal Activity Image Captions Dataset

A dataset of image captions depicting unsafe or illegal activities, hosted on HuggingFace. The dataset was created by Lenkashell and was last updated on July 16, 2025. The specific content, scale, and structure of the data are not detailed in the available metadata.

MultimodalImage CaptionsUnsafe ContentIllegal ActivityComputer VisionContent Moderation+1

0 views

Multimodal & LLM

Safe Illegal Activity Image Captions

A dataset of image captions related to illegal activities, created by Lenkashell and last updated on July 16, 2025. The dataset is hosted on HuggingFace, but its specific content, size, and structure are not detailed. Its intended purpose appears to be for training or evaluating content safety models.

MultimodalImage CaptionsContent SafetyIllegal ActivityComputer Vision+1

0 views

Multimodal & LLM

SLAKE: 642 Medical Images with Multi-Task Annotations

SLAKE contains 642 medical image samples with multi-task annotations, curated by Voxel51 based on research published in 2021 (Arxiv 2102.09542). It provides a specialized dataset for medical visual question answering and computer vision, featuring labels for classification, detection, and segmentation tasks.

ImageIMAGEFOLDERSize Categories1 Kn10 KLibraryfiftyoneTask Categoriesobject DetectionLanguageenLibrarymlcroissantModalityimageLibrarydatasetsImage SegmentationLicensecc By 40Task Categoriesimage ClassificationObject DetectionTask Categoriesimage SegmentationRegionusImage ClassificationFiftyoneMedicalArxiv210209542+1

0 views

Multimodal & LLM

ReEdit-Bench: Benchmark Dataset for Exemplar-Based Image Editing

ReEdit-Bench is a curated dataset of approximately 1,500 samples for evaluating exemplar-based image editing methods. It was created by tarun-menta and presented in a WACV '25 paper. Each sample contains four images representing an exemplar edit pair.

ImageMultimodalBenchmarkComputer VisionImage EditingDiffusion Models+1

0 views

Multimodal & LLM

Harmful and Harmless Text Examples in Portuguese and English

Examples of harmful and harmless language. It aggregates samples from seven source datasets, including Anthropic/hh-rlhf and allenai/real-toxicity-prompts. The data is available in both Portuguese and English.

ParquetSize Categories10 Kn100 KLibrarypolarsLanguageenModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasArxiv240611039RegionusTask Categoriestext ClassificationHarmLicenseapache 20Toxicity+1

0 views

Multimodal & LLM

HDTF: High-Definition Talking Face Videos and Audio for Avatar Synthesis

400 full-length high-definition talking face videos, split into 81-frame clips and paired with audio embeddings. The dataset was curated by global-optima-research and last updated on June 4, 2025. It is intended for tasks in talking-head generation and multimodal avatar synthesis.

AudioTime SeriesVideoMultimodalMultimodal SynthesisTalking Face GenerationVideo ClipsAudio Embeddings+1

0 views

Multimodal & LLM

SlimOrca Dedup: 363k Deduplicated Instruction-Response Pairs

Open-Orca's SlimOrca Dedup is a dataset of 363,000 unique instruction-response examples derived from the SlimOrca collection. It was created by removing RLHF instances and applying minhash and Jaccard similarity techniques for deduplication. The dataset was last updated on Hugging Face on May 19, 2025.

TextLanguage ModelDeduplicationSynthetic Data+1

0 views

Multimodal & LLM

Argus: Hallucination and Omission Scores for Video-Language Models

ARGUS is a framework for calculating hallucination and omission costs in free-form video captions. The dataset, created by tomg-group-umd, provides metrics to quantify the degree of hallucinated and omitted content in video-language model outputs. It was last updated on June 10,我们发现了一个问题，您提供的原始描述中包含了中文文本。根据指令，我需要将输入翻译成英文。以下是翻译后的描述，并基于此生成输出。

MultimodalOmission EvaluationHallucination EvaluationBenchmarkVideo LlmMultimodal Evaluation+1

0 views

Multimodal & LLM

OlympiadBench: 1,000+ Bilingual Multimodal Math and Physics Problems

OlympiadBench contains between 1,000 and 10,000 bilingual scientific problems in mathematics and physics, designed for evaluating AGI reasoning. Created by Hothan and published at ACL 2024, the dataset includes both text-based and multimodal questions in English and Chinese.

ParquetSize Categories1 Kn10 KLibrarypolarsLanguagezhTask Categoriesquestion AnsweringLanguageenTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasArxiv240214008RegionusPhysicsMathLicenseapache 20+1

0 views

Multimodal & LLM

Vietnamese Dialogue Dataset for Instruction Tuning

A dataset of short, natural Vietnamese dialogues for fine-tuning language models like Mamba, LLaMA, and Gemma. It contains everyday communication, frequently asked questions, and emotional responses, formatted as JSONL and ready for instruction tuning. The dataset was created by hoanghai2110 for the Vietnamese open-source AI community.

JSONSize Categories1 Kn10 KLibrarypolarsModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusLicenseapache 20Languagevi+1

0 views

Multimodal & LLM

Primus-Reasoning: Cybersecurity Tasks for LLM Training

PRIMUS is a pioneering collection of open-source datasets for cybersecurity LLM training. The Primus-Reasoning subset contains multiple cybersecurity reasoning tasks sourced from CTI-Bench, including CTI-RCM, CTI-VSP, CTI-ATE, and CTI-MCQ. It was augmented in June 2025 with distilled samples from DeepSeek-R1, incorporating intermediate reasoning steps and final answers.

TextCybersecurityCyber Threat IntelligenceLlm TrainingReasoning Tasks+1

0 views

Multimodal & LLM

PuzzleWorld: 667 Real-World Puzzlehunt Problems for AI Reasoning

667 real-world puzzlehunt-style problems curated from Puzzled Pint's Creative Commons archives between 2010 and 2025. The dataset, created by author hzli1202 and last updated on June 10, 2025, is designed as a benchmark to evaluate open-ended, multimodal reasoning in AI models.

MultimodalAi BenchmarkBenchmarkCreative CommonsPuzzle SolvingMultimodal Reasoning+1

0 views

Multimodal & LLM

Skywork-OR1-RL-Data: 100K-1M RL Problems with 0-16 Difficulty Levels

Skywork-OR1-RL-Data is a reinforcement learning training dataset containing between 100,000 and 1,000,000 text records released by Skywork in April 2025. The collection features problems categorized by difficulty levels ranging from 0 to 16, calibrated against specific DeepSeek-R1-Distill-Qwen model variants.

ParquetLibrarypolarsLibrarydaskModalitytextSize Categories100 Kn1 MLibrarymlcroissantArxiv250522312LibrarydatasetsRegionus+1

0 views

Multimodal & LLM

PLM-Video Human: Human-Annotated Video Data for Vision Language Models

PLM-Video-Human is a collection of human-annotated resources for training Vision Language Models, focused on detailed video understanding. The dataset, created by Facebook, was last updated on May 21, -2025. Training tasks include fine-grained open-ended question answering, region-based video captioning, dense captioning, and temporal localization.

Time SeriesVideoMultimodalVision Language ModelsQuestion AnsweringComputer VisionVideo UnderstandingHuman AnnotatedVideo Captioning+1

0 views

Multimodal & LLM

MedBookVQA: A Multimodal Benchmark from Medical Textbooks

MedBookVQA is a multimodal benchmark built from open-access medical textbooks to evaluate general medical AI (GMAI) and multimodal large language models (MLLMs). The dataset was created by slyipae1 and last updated on June 10, 2025. It aims to address the underutilization of structured textbook knowledge for systematic AI evaluation.

MultimodalAi EvaluationBenchmarkHealthcareMedical VqaMedical TextbookMultimodal Benchmark+1

0 views

Multimodal & LLM

PLM-VideoBench: Human-Annotated Resources for Vision-Language Model Evaluation

PLM-VideoBench is a collection of human-annotated resources for evaluating Vision Language models, focused on detailed video understanding. The dataset includes evaluation data for tasks like FGQA, which probes fine-grained activity understanding through multiple-choice questions. It was authored by Facebook and last updated on May 21, 2025.

MultimodalVision LanguageBenchmarkComputer VisionVideo UnderstandingHuman AnnotatedMultimodal Evaluation+1

0 views

Multimodal & LLM

SVG: Synthetic Visual Genome Datasets for Scene Graph Understanding

Synthetic Visual Genome (SVG) datasets are designed for training Vision-Language Models on scene graph understanding and dense visual relationships. The datasets were created by author jamepark3922 and were last updated on June 11, 2025. They are hosted on the Hugging Face platform.

MultimodalScene GraphsVision Language ModelsVisual RelationshipsSynthetic DataSynthetic+1

0 views

Multimodal & LLM

MemoryBench: Spatial Memory and Action Recall for Robotic Manipulation

MemoryBench provides benchmark tasks across spatial memory and action recall categories for robotic manipulation. It serves as the evaluation foundation for the SAM2Act+ framework, focusing on the integration of visual foundation models with memory architectures.

Task CategoriesroboticsRegionusArxiv250118564Licenseapache 20+1

0 views

PreviousPage 73 of 98Next