DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,947 datasets

Multimodal & LLM

Video Summarization Dataset with Most Replayed Viewer Engagement Scores

52,678 in-the-wild videos feature synchronized visual, audio, and text data. Ground-truth importance scores are derived from YouTube's 'Most Replayed' statistics, reflecting collective viewer engagement. The dataset was created by author hminjeong and was last updated in March 2026.

AudioVideoMultimodalSize Categories10 Kn100 KArxiv260301169LanguageenTask CategoriessummarizationLicensecc By 40Collective EngagementComputer VisionModalityvideoSummarizationRegionusMultimodal FusionLarge ScaleVideo Summarization+1

0 views

Multimodal & LLM

Human Preference Labels for AI-Generated Video Motion Quality

57,866 pairwise human preference labels compare 4 frontier video generation models. Datapoint AI collected these annotations across 3 quality dimensions for 417 unique prompts covering 11 motion categories. The dataset was last updated in March 2026.

TabularOPTIMIZED-PARQUETParquetSize Categories1 Kn10 KLibrarypolarsTask Categoriesreinforcement LearningRlhfLibrarydaskLanguageenAi EvaluationModalitytextHuman MotionModalitytabularLibrarymlcroissantLibrarydatasetsPreference DataLicensecc By 40Video GenerationHuman PreferencesRegionusTask Categoriestext To VideoMotion QualityTask Categoriesvideo ClassificationSynthetic+1

0 views

Multimodal & LLM

XRFv2 Plus: Multimodal Human Activity Data with WiFi, IMU, and Visual Sensors

Multimodal sensor data includes WiFi signals, inertial measurement units, AirPods audio, depth/IR cameras, DensePose, human pose, mesh, and action labels. The dataset appears to be designed for complex human activity analysis and sensor fusion tasks. Its origin, size, and collection methodology are not specified in the available metadata.

MultimodalAction RecognitionVision LanguageMultimodal AiQuestion AnsweringHuman PoseCaptioningSensor Fusion+1

0 views

Multimodal & LLM

Nemotron-SFT Chat V2: Synthetic Dialogues from Qwen3 and GLM-4.6

NVIDIA released this synthetic dialogue dataset in March 2026 to improve model interactivity and instruction following. It contains multi-turn conversations generated by an ensemble of high-parameter models including Qwen3-235B, GLM-4.6, and Kimi-K2-Thinking.

Task Categoriestext GenerationLanguageenRegionusLicenseodc By+1

0 views

Multimodal & LLM

ArSyra Instruction Tuning: Arabic Dialectal LLM Fine-Tuning Data

Arabic instruction-tuning data combining instruction-following pairs, instruction descriptions, freeform responses, and quality control data. The dataset contains over 3,300 records designed for fine-tuning Arabic language models, created by ArSyra and last updated in March 2026.

TextDialectal DataArabic LanguageNatural Language ProcessingLlm Fine TuningInstruction Tuning+1

0 views

Multimodal & LLM

MBE2.0: E-Commerce Product Images, Titles, and Annotations

A Chinese multimodal benchmark for e-commerce product understanding, released following a legal and privacy review aligned with China's PIPL. The dataset includes original images, product titles, and category/attribute annotations, with all personally identifiable information removed. It was created by author ZHNie and last updated on March 23, —.

MultimodalProduct UnderstandingE CommerceMultimodal LearningBenchmarkComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

VLM_JSON: Vision-Language Model Data

VLM_JSON likely contains data formatted for training or evaluating Vision-Language Models. The dataset is published on Kaggle, but its specific content, size, and creation details are not provided. Its title suggests a focus on multimodal tasks combining visual and textual information.

MultimodalVision Language ModelMultimodal DataJson Format+1

0 views

Multimodal & LLM

Creative Professionals Agentic Tasks: 1M Synthetic Operations for 36 Software Environments

A synthetic dataset of 1,070,917 agentic command operations for 36 creative, technical, and engineering software environments. Created by rAVEUK and last updated on March 15, 2026, it is engineered to stress-test and evaluate multimodal AI agents operating within complex software infrastructures.

MultimodalAgentic AiSynthetic TasksSoftware InteractionLarge ScaleMultimodal AgentsSynthetic+1

0 views

Multimodal & LLM

Visual-Centric Instruction Following Dataset for MLLM Training

10,000 entries support training and evaluating Multimodal Large Language Models on visual instruction following. The dataset is structured in a messages format with user instructions and assistant responses, referencing images from sources like LLaVA-Instruct and Visual Genome. It was created by KerenStone for research published in the paper 'Empowering Reliable Visual-Centric Instruction Following in MLLMs'.

LanguageenArxiv260103198RegionusLicensemit+1

0 views

Multimodal & LLM

Creative Professionals Agentic Tasks 1M

Creative Professionals Agentic Tasks 1M is a massive-scale synthetic dataset containing 1,070,917 agentic command operations across 36 diverse software environments. It is specifically engineered to stress-test, evaluate, and fine-tune multimodal AI agents designed for complex software interaction and multi-step reasoning. The dataset spans creative, technical, and engineering domains to provide a robust training ground for deep software infrastructure operations.

ParquetTextTask Categoriestext GenerationModality3dLibrarypolarsTask Categoriesquestion AnsweringLibrarydaskSize Categories1 Mn10 MLanguageenModalitytextModalitytabularLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasModalityvideoRegionusTask Categoriesany To AnyLicensemit+1

0 views

Multimodal & LLM

EveNet Tt2L Quantumcorrelation: Particle Collision Data for Foundation Model Training

EveNet is a foundation model for particle collision data analysis, as described in the arXiv preprint arXiv:2601.17126. The dataset was uploaded by Avencast and last updated on March 31, 2026. Its specific content and scale are not detailed in the provided metadata.

TextTabularParquetFoundation ModelLibrarypolarsQuantum CorrelationLibrarydaskSize Categories10 Mn100 MModalitytextModalitytabularLibrarymlcroissantLibrarydatasetsCollision DataRegionusArxiv260117126Particle PhysicsLicensemit+1

0 views

Multimodal & LLM

BraiDyn-BC: Mouse Neocortex Imaging and Behavior During Lever-Pull Learning

A multimodal dataset linking wide-field calcium imaging of the mouse neocortex to behavioral measurements during a motor skill learning task. It includes 15 sessions over two weeks from 25 mice trained to pull a lever for water rewards, with simultaneous high-speed videography and environmental monitoring. The dataset is formatted in the Neurodata Without Borders (NWB) standard and adheres to FAIR principles.

VideoMultimodalMouse ModelCalcium ImagingImagingLife SciencesBehavioral DataNeuroscienceMus musculusMotor Learning+1

0 views

Multimodal & LLM

MicroLens VQA — Hackathon: 75,000+ Image-QA Pairs of Diatoms and Fungal Spores

75,491 image and question-answer pairs depicting microscopic organisms, specifically diatoms and fungal spores. The dataset covers 95 genera and is released under a CC-BY 4.0 license. It was created for a hackathon event on the Kaggle platform.

MultimodalMicroscopy ImagesBiologyComputer VisionDiatomsVisual Question Answering+1

0 views

Multimodal & LLM

Non-STEM Textbook Arabic: 1,000-10,000 Educational Records for LLMs

InfoBayAI published this Arabic non-STEM textbook sample in March 2026, providing between 1,000 and 10,000 records for LLM training. It is derived from a larger multilingual corpus of 1.9 billion words across 27,000 textbooks and is structured for instruction tuning and evaluation.

ParquetSize Categories1 Kn10 KLibrarypolarsLanguagearLibrarydaskModalitytextLibrarymlcroissantLibrarydatasetsLicensecc By 40RegionusNon StemText Book+1

0 views

Multimodal & LLM

Visual-Centric Instruction Following Dataset For MLLM Training

VCIF-10K provides data for training Multimodal Large Language Models on visual instruction following tasks. The dataset is structured in a messages format with user instructions and assistant responses, referencing images from sources like LLaVA-Instruct and Visual Genome. It was created by WoofWoof and supports both Supervised Fine-Tuning and Direct Preference Optimization training paradigms.

LanguageenArxiv260103198RegionusLicensemit+1

0 views

Multimodal & LLM

HORA: Hand–Object to Robot Action Dataset for Cross-Embodiment Learning

HORA is a large-scale multimodal dataset that converts human hand–object interaction demonstrations into robot-usable supervision. It combines HOI-style annotations like MANO hand parameters and object pose with embodied-robot learning signals such as end-effector trajectories under a unified canonical action space. The dataset was created by HORA-DB and last updated on Hugging Face in March 2026.

MultimodalModality3dSize Categories100 Kn1 MLibrarymlcroissantModalityimageLibrarydatasetsRoboticsComputer VisionRegionusHand Object InteractionLarge ScaleLicenseapache 20+1

0 views

Multimodal & LLM

Creative Professionals Agentic Tasks 1M

A synthetic task dataset of 1,070,917 agentic command operations for testing multimodal AI agents. The dataset is engineered for evaluating AI agents operating within complex software infrastructures like creative and engineering tools. It was created by author kryp1234 and last updated on March 15, 2026.

MultimodalAgentic AiSynthetic TasksSoftware InteractionLarge ScaleMultimodal EvaluationSynthetic+1

0 views

Multimodal & LLM

VisionFoundry-10K: 10,000 Synthetic VQA Triples Across 10 Vision Tasks

VisionFoundry-10K provides 10,000 synthetic image-question-answer triples across 10 vision-centric tasks, released by TheMartyr in 2026. The data is produced via a pipeline where an LLM generates prompts, a text-to-image model synthesizes visuals, and a multimodal verifier filters for alignment.

MultimodalParquetSize Categories10 Kn100 KVisionLibrarypolarsLanguageenTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusSynthetic+1

0 views

Multimodal & LLM

Qetuoadgjl Weights1: LLaVA Model Checkpoints from Pre-training and SFT

Five LLaVA model checkpoints uploaded by author xym93168 on Hugging Face in April 2026. The checkpoints document different training stages, including pre-training and supervised fine-tuning, with varying GPU configurations and batch sizes. Specific checkpoints include 'Pre_32gpu_llava_bs8_0111_1epoch/checkpoint-2181' and 'sft_8gpu_llava_bs08_0106_1epoch/checkpoint-2353'.

MultimodalLlavaMultimodal LlmPretrainingCheckpointsModel Weights+1

0 views

Multimodal & LLM

TextEditBench: A Benchmark for Reasoning-Aware Text Editing

TextEditBench is a benchmark for evaluating reasoning-aware text editing across 14 topics and 6 task types. It was created by CSU-JPG and last updated on March 9, 2026. The benchmark emphasizes scenarios requiring understanding of physical plausibility, linguistic meaning, and cross-modal dependencies.

MultimodalText EditingBenchmarkReasoning BenchmarkNlp TasksMultimodal Evaluation+1

0 views

PreviousPage 33 of 97Next