DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,947 datasets

Multimodal & LLM

BLIP-Base: Vision-Language Pre-training Model

BLIP-Base is a pre-trained model for vision-language understanding tasks, hosted on Kaggle. The specific dataset content, such as the number of image-text pairs or the training corpus, is not detailed in the provided metadata. Its availability on a major data science platform suggests it is intended for AI/ML practitioners working with multimodal data.

MultimodalPre Trained ModelVision LanguageMultimodal AiImage Captioning+1

0 views

Multimodal & LLM

LLaVA-LoRA-nil-final-weights-v3: Fine-Tuned Multimodal Model Weights

A set of final model weights for a fine-tuned LLaVA (Large Language-and-Vision Assistant) model, likely using LoRA (Low-Rank Adaptation) techniques. The dataset is published on Kaggle, but its specific content, size, and creation details are not provided in the available metadata. The title suggests it contains parameters for a vision-language model, potentially for tasks like image captioning or visual question answering.

MultimodalLoRALlavaMultimodal AiModel Weights+1

0 views

Multimodal & LLM

LLaVA-LoRA-Oracle-Final: Multimodal Model Fine-Tuning Data

LLaVA-LoRA-Oracle-Final appears to be a dataset for fine-tuning multimodal large language models. The title suggests it is likely associated with the LLaVA (Large Language-and-Vision Assistant) project and involves LoRA (Low-Rank Adaptation) techniques. Published on Kaggle, its specific content and scale require verification after download.

MultimodalLoRALlavaMultimodal LlmOracleFine Tuning+1

0 views

Multimodal & LLM

Zsfood Vlm Des: Food-Related Vision-Language Data

Zsfood Vlm Des is a dataset published on HuggingFace by author LTaiQin. The title suggests it contains data related to food, likely for vision-language model tasks. The dataset was last updated on April 21, 2026.

MultimodalVision Language ModelFood+1

0 views

Multimodal & LLM

Midjourney v6 Recaptioned: 1.2M Images with Triple VLM Annotations

Comprising 1,235,432 Midjourney v6 images paired with captions generated by three different Vision Language Models (VLMs), released by Photoroom in March 2026. It provides a large-scale collection of AI-generated art with multi-perspective textual descriptions from LLaVA, Gemini Flash 1.5, and Qwen3 VL 8B. The data is formatted in Parquet for efficient processing in machine learning workflows.

ParquetLibrarypolarsTask Categoriesimage To TextLibrarydaskSize Categories1 Mn10 MModalitytextLibrarymlcroissantModalityimageLibrarydatasetsRegionusLicensemit+1

0 views

Multimodal & LLM

YFCC100M-Diverse-50K: A 50,000-Image Subset for Multimodal Research

50,000 diverse images form a subset for multimodal retrieval and vision-language research. The dataset is sourced from the YFCC100M collection. Its specific creation date, author, and update frequency are unknown.

ImageMultimodalImage SubsetVision LanguageComputer VisionMultimodal Retrieval+1

0 views

Multimodal & LLM

FAMMA: A Multimodal Dataset Without Context

FAMMA appears to be a multimodal dataset, likely containing multiple data types such as images, text, or audio. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are not provided in the available metadata. Its origin, author, and the time period it covers are currently unknown.

MultimodalMachine LearningAi Training+1

0 views

Multimodal & LLM

OmniVideo-R1: 100K-1M Records for Audio-Visual Reasoning

OmniVideo-R1 provides between 100,000 and 1,000,000 preprocessed records for audio-visual reasoning, published by jankin123 in March 2026. The collection supports a two-stage training framework for multimodal models, specifically focusing on Query-Intensive (QI) grounding and modality attention.

MultimodalJSONArxiv260205847LibrarypolarsTask Categoriesquestion AnsweringLibrarydaskLanguageenTask Categoriesvisual Question AnsweringModalitytextSize Categories100 Kn1 MLibrarymlcroissantTask Categoriesvideo Text To TextLibrarydatasetsVideo UnderstandingRegionusReasoningReinforcement LearningLicenseapache 20+1

0 views

Multimodal & LLM

Open-Personix: Person-Centered Multimodal Captions and Metadata

Open-Personix is a person-centered multimodal dataset of fewer than 1,000 records maintained by Poralus and updated in March 2026. It provides structured JSON entries containing relative image paths, natural-language captions, and descriptive person-specific annotations.

MultimodalJSONTask Categoriestext GenerationLibrarypolarsLanguageenAnnotationsSize Categoriesn1 KModalitytextLibrarymlcroissantMetadataLibrarydatasetsLibrarypandasPeopleRegionusTask Categoriestext ClassificationCaptionsLicensemit+1

0 views

Multimodal & LLM

LLaVA-LoRA-Nil-Final-Weights-V2: Vision-Language Model Fine-Tuning Parameters

LLaVA-LoRA-Nil-Final-Weights-V2 is a set of model weights published on Kaggle. The title suggests it relates to fine-tuning a Large Language and Vision Assistant (LLaVA) model using Low-Rank Adaptation (LoRA). The specific content, size, and provenance of the weights are unknown.

MultimodalVision LanguageLarge Language ModelModel Weights+1

0 views

Multimodal & LLM

Hidden Phys VQA Datasets V1: Physics Visual Question Answering

A dataset for visual question answering in the domain of physics, published on the Hugging Face platform. The dataset was uploaded by the user 'mlcf-robot' and was last updated on April 15, 2026. Its specific content, size, and structure are not detailed in the available metadata.

MultimodalScience QaPhysicsVisual Question Answering+1

0 views

Multimodal & LLM

BilgeAI: Turkish Language Model Training Data for Instruction and Pretraining

BilgeAI is a collection of Turkish text datasets for language model training, created by author vural2123 and last updated on March 28, 2026. The repository is structured into separate folders for instruction tuning and raw text pretraining. Each folder contains JSONL files with specific formats for different training tasks.

TextJSONTask Categoriestext GenerationLicenseotherLibrarypolarsLibrarydaskSize Categories10 Mn100 MText GenerationModalitytextTurkish LanguageLibrarymlcroissantLibrarydatasetsPretrainingLanguage ModelLanguagetrTask CategoriesotherRegionusTurkish+1

0 views

Multimodal & LLM

LLaVA-LoRA-nil-final-weights-2: Fine-Tuned Multimodal Model Weights

LLaVA-LoRA-nil-final-weights-2 is a set of model weights published on Kaggle. The title suggests it contains parameters for a fine-tuned version of the LLaVA (Large Language-and-Vision Assistant) model, likely using LoRA (Low-Rank Adaptation) techniques. No details on the training data, model size, or performance metrics are provided in the available metadata.

MultimodalMultimodal AiModel WeightsLlm Fine TuningLora Adapters+1

0 views

Multimodal & LLM

SingMoSub: 37 Hours of Singing-Driven 3D Head Motion with Subtitles

Over 37 hours of synchronized multimodal data for singing-driven 3D head motion, featuring motion subtitles and acoustic descriptions. The dataset, named SingMoSub, was created by ZikaiHuang and was last updated on March 1, 2026. It provides temporally aligned, region-level motion annotations for modeling expressive head and facial dynamics.

AudioMultimodalAudio VisualFacial AnimationMotion CaptureLicensecc By Nc 40RegionusSinging+1

0 views

Multimodal & LLM

Mobile-O Pre-Train: 9 Million Text-Image Pairs for Cross-Modal Alignment

Amshaker's dataset provides 9 million text-image pairs for the first-stage pre-training of the Mobile-O multimodal model. The data is intended to align a diffusion decoder and conditioning projector with a frozen vision-language backbone. The dataset was last updated on Hugging Face in February 2026.

MultimodalWEBDATASETText Image PairsTask Categoriesimage To TextLibrarywebdatasetSize Categories10 Mn100 MTask Categoriestext To ImageModalitytextLibrarymlcroissantModalityimageLibrarydatasetsPretrainingOn Device AiComputer VisionArxiv260220161Licensecc By Nc 40Cross Modal AlignmentRegionusLarge ScaleMobile O+1

0 views

Multimodal & LLM

Aegis-Safety-DPO: PolarAI's Safety Alignment Preference Data

Aegis-Safety-DPO is a manually-curated preference dataset designed for Direct Preference Optimization and Group Relative Policy Optimization. Created by PolarAI, the dataset focuses on training models to refuse malicious requests rather than provide preachy or evasive responses. The dataset was last updated on March 1, 2026.

TextAi SafetyPreference DataDpoLlm Alignment+1

0 views

Multimodal & LLM

VideoChat2-IT-clean: A Cleaned Video Instruction Tuning Dataset

VideoChat2-IT-clean is a cleaned version of the VideoChat2-IT video instruction tuning dataset, released alongside the ICLR 2026 paper 'Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs'. The dataset was created by author 'byminji' and was last updated on March 3, 2026.

MultimodalAi TrainingVideo UnderstandingVideo Llm+1

0 views

Multimodal & LLM

SmolLM3-3B Base Model Blind Spots and Failure Cases

SmolLM3-3B-Base Blind Spots is a curated set of failure cases for the HuggingFaceTB/SmolLM3-3B-Base model. The dataset contains prompts, expected aligned behavior, and the model's actual outputs, illustrating common failure patterns. It was created by aneeshadas02 and last updated in March 2026.

TextJSONTask Categoriestext GenerationSafetyBlind SpotsLibrarypolarsText GenerationSize Categoriesn1 KModalitytextSafety TestingLibrarymlcroissantBase Model EvaluationLibrarydatasetsLibrarypandasLlm EvaluationRegionusLanguageptLicenseapache 20Failure Analysis+1

0 views

Multimodal & LLM

MicroLens Images 384: Microscopy Specimens for Visual Question Answering

75,491 PNG images of microscopy specimens at a resolution of 384 by 384 pixels. The collection includes diatoms and fungal spores and is described as a companion dataset for MicroLens Visual Question Answering tasks. Its author, organization, and license are unknown.

ImageMicroscopy ImagesDiatomsVisual Question Answering+1

0 views

Multimodal & LLM

RLHF Clean: Reinforcement Learning from Human Feedback Data

RLHF_clean suggests a dataset for training AI models using reinforcement learning from human feedback. Published on Kaggle, its specific content, size, and origin are not detailed in the provided metadata. The dataset's actual structure and intended use require verification after download.

TextLanguage ModelAi TrainingReinforcement LearningHuman Feedback+1

0 views

PreviousPage 38 of 97Next