DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,932 datasets

Multimodal & LLM

Nemotron RL Instruction Following Free Form Formatting V1: Text Formatting Training Data

A dataset created by NVIDIA Corporation on April 10, 2026, designed to teach models to follow arbitrary text formatting instructions. It uses explicit Regex and string matching for the reward signal and is intended for commercial or non-commercial use.

TextText FormattingNlp TrainingReinforcement LearningInstruction Following+1

0 views

Multimodal & LLM

Nemotron Rl Instruction Following Citation Formatting V1

Nemotron-RL-Instruction-Following-CitationFormatting-v1 is a dataset from NVIDIA Corporation designed to teach models to cite specific document parts using reference markers like [ref:1]. It supports single-reference, multi-reference, and inline citations and is ready for commercial and non-commercial uses. The dataset was created and last modified on April 10, 2026.

TextText GenerationCitation FormattingLarge Language ModelInstruction Following+1

0 views

Multimodal & LLM

Stera-10M: 200 Hours of Egocentric Multimodal Sensor Data

Stera-10M is an open egocentric multimodal dataset for embodied AI, robotics, world models, and spatial intelligence. It contains 200 hours of synchronized first-person recordings across 500+ sessions from 20 contributors in 20+ unique environments, with 10 million RGB frames, LiDAR depth, and ARKit data. The dataset was captured end-to-end on commodity iPhone Pro hardware through the open Stera platform and is authored by fpvlabs.

VideoMultimodalSpatial IntelligenceRoboticsLarge ScaleEgocentric VisionEmbodied AiMultimodal Sensor+1

0 views

Multimodal & LLM

WheelArm: Multimodal Dataset of Wheelchair-Mounted Robot Arm Demonstrations

WheelArm is a real-robot dataset collected from a Kinova Gen3 6-DOF manipulator arm mounted on a powered wheelchair. Each episode captures a single assistive daily-living task performed by a human operator. The dataset includes synchronized RGB video, depth, robot kinematics, audio, and natural-language dialogue with ambiguity annotations.

AudioMultimodalAssistive RoboticsMultimodal DatasetDaily Living Tasks+1

0 views

Multimodal & LLM

DualFusionNet vs. LLaVA-1.5-7B Performance on the MVTec Zipper Dataset

A 5.5 KB Excel file compares the performance of DualFusionNet and a fine-tuned LLaVA-1.5-7B model on the MVTec zipper dataset. Results are reported as mean ± standard deviation across five cross-validation folds, with metrics in percentages except for Kappa. The dataset was authored by Christopher Mai and last updated on April 29, 2026.

TabularExcelZipperMvtec AdAnomaly DetectionModel ComparisonComputer Vision+1

0 views

Multimodal & LLM

Lm Wheels: Metadata for Prebuilt LLM Training and Deployment Packages

Metadata and installation instructions for prebuilt Python wheels for components of the LLM training and deployment stack, such as apex, causal-conv1d, transformer-engine, and vllm. The repository is maintained by SakanaAI and was last updated on 2026-05-28. The structure includes subfolders for specific package versions, Python versions, and CUDA toolkits.

TabularPython PackagesSoftware DistributionLlm TrainingMachine Learning Tools+1

0 views

Multimodal & LLM

NuRisk: Visual Question Answering for Autonomous Driving Risk Assessment

NuRisk is a visual question answering dataset focusing on risk assessment for autonomous driving. Each row contains a BEV image, a driving-related question, and a ground truth answer. The dataset was created by Yuan-avs and was last updated on Hugging Face in May 2026.

MultimodalBev ImageryRisk assessmentComputer VisionAutonomous DrivingVisual Question Answering+1

0 views

Multimodal & LLM

LangFabric: 6,384 Language-Annotated Fabric Images for Defect Detection

6,384 paired samples of high-resolution fabric images and structured textual descriptions, covering 16 fabric textures and 12 defect types. The dataset, created by Yan-Qin Ni and last updated in April 2026, includes pixel-level segmentation masks and was collected inline from an industrial air-jet loom under realistic manufacturing conditions. LangFabric is designed as a benchmark for multimodal fabric defect analysis.

ImageMultimodalZIPTextile ManufacturingMultimodal DatasetVision LanguageBenchmarkComputer Vision+1

0 views

Multimodal & LLM

MetricScenes: A Metrically-Grounded In-the-Wild Dataset for 3D Reconstruction

A dataset for metric scale monocular geometry estimation, addressing limitations in current foundation models. The dataset was created by authors Yuanbo Xiangli, Hanyu Chen, Xueqing Tsang, and Noah Snavely, with a project page available. The dataset listing was last updated on 2026.06.01.

MultimodalComputer VisionMetric SceneLarge Scale3d-reconstructionMonocular Geometry+1

0 views

Multimodal & LLM

SpaceDG-Bench: 1,102 Questions for MLLM Spatial Intelligence Under Visual Degradation

SpaceDG-Bench is a human-verified benchmark containing 1,102 questions designed to evaluate the spatial intelligence of Multimodal Large Language Models (MLLMs) under visual degradation. The dataset spans 11 reasoning categories and 9 visual degradation types, yielding over 10,000 visual question answering (VQA) instances. It was created by author xlzhou126 and last updated on May 24, 2026.

MultimodalSpatial IntelligenceVisual DegradationMultimodal LlmBenchmarkVqa Benchmark+1

0 views

Multimodal & LLM

Multimodal-STEM-HLE-plus-plus: A STEM Benchmark for Frontier LLMs

TuringEnterprises created a multimodal STEM dataset designed to push the limits of state-of-the-art large language models. The dataset was last updated on May 13, 2026. Its design is empirically proven to address the bottleneck of finding data at the right difficulty level for current frontier models.

MultimodalLlm BenchmarkMachine LearningStem+1

0 views

Multimodal & LLM

PANORAMA-NOC4PC-Multimodal: Patentability Judgments with Drawings

PANORAMA-NOC4PC-Multimodal extends a text-only patentability benchmark by adding patent drawings. The dataset is based on the PANORAMA benchmark from NeurIPS 2025 Datasets & Benchmarks. It was uploaded by user sungjae98 to Hugging Face and last updated in June 2026.

MultimodalPatent LawDecision TrailsBenchmarkPatent ExaminationMultimodal Benchmark+1

0 views

Multimodal & LLM

Tabverse: A Multimodal Benchmark for Cross-Format Table Understanding

MBZUAI and SUTD researchers created Tabverse, a controlled multimodal table benchmark. It aligns HTML, Markdown, and LaTeX table representations with rendered PNG images to evaluate table understanding in large language and vision models. The dataset was last updated on June 5, 2026.

MultimodalBenchmarkLlm EvaluationTable UnderstandingVlm EvaluationMultimodal BenchmarkSynthetic+1

0 views

Multimodal & LLM

Perturbation Bench: A Benchmark of Sequence-Level Perturbation Tasks for DNA Models

A benchmark of sequence-level perturbation tasks for evaluating DNA foundation models. It contains pairs of genomic sequences, including tasks like synonymous codon substitution with 20,000 pairs each for human and mouse. The dataset was created by HuggingFaceBio and was last updated on May 19, 2026.

TabularSequence PerturbationBenchmarkHealthcareGenomicsDna Foundation Models+1

0 views

Multimodal & LLM

ECom-RF-IMMR: Four Image-to-Multimodal Item Retrieval Benchmarks

Four image-to-multimodal item retrieval datasets are contained in this repository. The collection includes two newly constructed evaluation datasets and two adapted from public e-commerce benchmarks. The datasets were created by author xyxy01 and last updated on 2026-05-27.

ImageMultimodalE CommerceBenchmarkImage RetrievalComputer Vision+1

0 views

Multimodal & LLM

Openart Painterly Foundations: 13,786 Public-Domain Artworks with Structured Captions

13,786 public-domain artworks form the painterly core of the OpenArt collection. The dataset includes 9,107 paintings and illustrations, 4,596 photographed objects, and 83 unclassified works, each paired with a structured VLM caption. It was created by author jaddai and last updated on Hugging Face in May 2026.

MultimodalPaintingArt HistoryFine ArtImage CaptioningPublic Domain+1

0 views

Multimodal & LLM

TAMMI: Multimodal Remote Sensing Visual Question Answering Dataset

TAMMI is a large-scale multimodal Visual Question Answering dataset for remote sensing, introduced at the CVPR 2025 EarthVision Workshop. It combines three complementary satellite image modalities with question-answer annotations automatically generated from official geographic databases. The dataset was authored by HichemBoussaid and last updated on the Hugging Face platform in May 2026.

GeospatialMultimodalSatellite ImageryComputer VisionLarge ScaleVisual Question AnsweringSynthetic+1

0 views

Multimodal & LLM

OpenArt Mythic Creatures: 12,933 Artworks with Structured Captions

OpenArt — Mythic Creatures is a collection of 12,933 public-domain artworks depicting mythological and fantastical beings. The dataset includes 4,053 paintings or illustrations and 8,828 photographed objects, each paired with a structured VLM caption and metadata on medium, attribution, and inscriptions. It was created by author jaddai and last updated on Hugging Face in May 2026.

MultimodalMythologyImage CaptioningArtPublic Domain+1

0 views

Multimodal & LLM

OpenArt Portraits Classical: 28,011 Public-Domain Artworks with Structured Captions

28,011 public-domain artworks focusing on the human figure and portraiture, created by jaddai. The collection includes 13,868 paintings or illustrations and 13,970 photographed objects, each paired with a structured VLM caption and metadata on medium, attribution, and inscriptions. The dataset was last updated on May 27, 2026.

MultimodalClassical ArtPortraitsArt HistoryPublic Domain+1

0 views

Multimodal & LLM

SVFSearch: A Multimodal Benchmark for Short-Video Frame Search in Gaming

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming vertical domain. The dataset was created by the svfsearch organization and was last updated on the Hugging Face platform in May 2026. It is described as a multimodal knowledge-intensive benchmark.

MultimodalFrame SearchShort VideoBenchmarkGamingMultimodal Benchmark+1

0 views

PreviousPage 12 of 97Next