Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,539 datasets
A dataset designed to strengthen multi-turn, interactive capabilities, including open-ended chat and precise instruction following. The chat subset uses human-written prompts from sources like lmarena, lmsys, and wildchat as seed prompts, with responses generated by GLM-5 and selected via pairwise comparisons using a reward model. It was authored by NVIDIA and last updated on the platform in June 2026.
1.1 GB of data supporting a method for automated long-term tracking of Antarctic ice shelf rift propagation. The dataset, authored by Zixiao Guo and last updated in May 2026, is shared under a CC-BY-4.0 license. It likely contains spatiotemporal corrections and tracking results derived from satellite imagery.
PP2-M is a multimodal dataset derived from Place Pulse 2.0, enriched with additional geospatial modalities. The dataset includes aligned pairs of street view images, remote sensing images from Sentinel-2, cartographic data, and geographical coordinates. It was created by author DominikM198 and last updated on HuggingFace in May 2026.
Nemotron RL Instruction Following Structured Outputs V2 is a dataset for evaluating large language models on structured output generation. It was created by NVIDIA and last updated on June 4, 2026. The dataset includes two splits testing capabilities like freeform text generation and diversified tasks across multiple data formats.
Nemotron-RL-Instruction-Following-CitationFormatting-v1 is a dataset from NVIDIA Corporation designed to teach models to cite specific document parts using reference markers like [ref:1]. It supports single-reference, multi-reference, and inline citations and is ready for commercial and non-commercial uses. The dataset was created and last modified on April 10, 2026.
A dataset created by NVIDIA Corporation on April 10, 2026, designed to teach models to follow arbitrary text formatting instructions. It uses explicit Regex and string matching for the reward signal and is intended for commercial or non-commercial use.
Stera-10M is an open egocentric multimodal dataset for embodied AI, robotics, world models, and spatial intelligence. It contains 200 hours of synchronized first-person recordings across 500+ sessions from 20 contributors in 20+ unique environments, with 10 million RGB frames, LiDAR depth, and ARKit data. The dataset was captured end-to-end on commodity iPhone Pro hardware through the open Stera platform and is authored by fpvlabs.
WheelArm is a real-robot dataset collected from a Kinova Gen3 6-DOF manipulator arm mounted on a powered wheelchair. Each episode captures a single assistive daily-living task performed by a human operator. The dataset includes synchronized RGB video, depth, robot kinematics, audio, and natural-language dialogue with ambiguity annotations.
A 5.5 KB Excel file compares the performance of DualFusionNet and a fine-tuned LLaVA-1.5-7B model on the MVTec zipper dataset. Results are reported as mean ± standard deviation across five cross-validation folds, with metrics in percentages except for Kappa. The dataset was authored by Christopher Mai and last updated on April 29, 2026.
Metadata and installation instructions for prebuilt Python wheels for components of the LLM training and deployment stack, such as apex, causal-conv1d, transformer-engine, and vllm. The repository is maintained by SakanaAI and was last updated on 2026-05-28. The structure includes subfolders for specific package versions, Python versions, and CUDA toolkits.
6,384 paired samples of high-resolution fabric images and structured textual descriptions, covering 16 fabric textures and 12 defect types. The dataset, created by Yan-Qin Ni and last updated in April 2026, includes pixel-level segmentation masks and was collected inline from an industrial air-jet loom under realistic manufacturing conditions. LangFabric is designed as a benchmark for multimodal fabric defect analysis.
A dataset for metric scale monocular geometry estimation, addressing limitations in current foundation models. The dataset was created by authors Yuanbo Xiangli, Hanyu Chen, Xueqing Tsang, and Noah Snavely, with a project page available. The dataset listing was last updated on 2026.06.01.
SpaceDG-Bench is a human-verified benchmark containing 1,102 questions designed to evaluate the spatial intelligence of Multimodal Large Language Models (MLLMs) under visual degradation. The dataset spans 11 reasoning categories and 9 visual degradation types, yielding over 10,000 visual question answering (VQA) instances. It was created by author xlzhou126 and last updated on May 24, 2026.
TuringEnterprises created a multimodal STEM dataset designed to push the limits of state-of-the-art large language models. The dataset was last updated on May 13, 2026. Its design is empirically proven to address the bottleneck of finding data at the right difficulty level for current frontier models.
PANORAMA-NOC4PC-Multimodal extends a text-only patentability benchmark by adding patent drawings. The dataset is based on the PANORAMA benchmark from NeurIPS 2025 Datasets & Benchmarks. It was uploaded by user sungjae98 to Hugging Face and last updated in June 2026.
MBZUAI and SUTD researchers created Tabverse, a controlled multimodal table benchmark. It aligns HTML, Markdown, and LaTeX table representations with rendered PNG images to evaluate table understanding in large language and vision models. The dataset was last updated on June 5, 2026.
A benchmark of sequence-level perturbation tasks for evaluating DNA foundation models. It contains pairs of genomic sequences, including tasks like synonymous codon substitution with 20,000 pairs each for human and mouse. The dataset was created by HuggingFaceBio and was last updated on May 19, 2026.
13,786 public-domain artworks form the painterly core of the OpenArt collection. The dataset includes 9,107 paintings and illustrations, 4,596 photographed objects, and 83 unclassified works, each paired with a structured VLM caption. It was created by author jaddai and last updated on Hugging Face in May 2026.
TAMMI is a large-scale multimodal Visual Question Answering dataset for remote sensing, introduced at the CVPR 2025 EarthVision Workshop. It combines three complementary satellite image modalities with question-answer annotations automatically generated from official geographic databases. The dataset was authored by HichemBoussaid and last updated on the Hugging Face platform in May 2026.
OpenArt — Mythic Creatures is a collection of 12,933 public-domain artworks depicting mythological and fantastical beings. The dataset includes 4,053 paintings or illustrations and 8,828 photographed objects, each paired with a structured VLM caption and metadata on medium, attribution, and inscriptions. It was created by author jaddai and last updated on Hugging Face in May 2026.