Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,548 datasets
Nemo Instruction Following Chat Translate is a text dataset published on Hugging Face by author pihull. The platform tags suggest it contains multilingual text formatted for instruction following and chat translation tasks, likely intended for large language model training. The dataset was last updated on February 11, 2026.
A collection of 45,882 prompt samples designed for Reinforcement Learning from Human Feedback training. Created by NVIDIA, this dataset supports language model alignment and was last updated in December 2025.
45,882 samples comprise this Reinforcement Learning from Human Feedback training dataset. NVIDIA created it for language model alignment, with the dataset last updated in December 2025.
Insufficient information is provided to create a factual summary. The dataset's title suggests a multimodal dataset for cognitive load classification, but no details on size, features, origin, or creation date are available.
ServiceNow's GroundCUA dataset provides real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity, browser, creative, communication, development, and system utility categories. The dataset was last updated on December 24, 2025.
VitaSet is a multimodal dataset for physical property reasoning, combining RGB vision and tactile sensing. It contains 5,145 human-verified question-answer pairs across three tasks: hardness classification, material property description, and surface roughness classification. The dataset was created by Bupt-Joy and last updated on 2025-12-29.
SenseNova-SI-800K is a dataset created by SenseNova to address deficiencies in spatial intelligence for multimodal foundation models. It is built upon established models like Qwen3-VL and InternVL3 and was last updated on December 23, 2025. The dataset is hosted on Hugging Face and is categorized as containing between 100K and 1M entries.
A dataset for developing real-time sepsis risk prediction models using a federated learning approach. The data likely contains multimodal clinical information from intensive care units, though specific sources and scale are not detailed. It was sourced from Kaggle under the 'Research' tag, but the author, organization, and last update date are unknown.
ZeroBench is a visual reasoning benchmark containing fewer than 1,000 image-text pairs designed to challenge contemporary Large Multimodal Models (LMMs). Created by Jonathan Roberts and associated with Arxiv paper 2502.09696, the dataset was updated in December 2025 to include refined hierarchical question structures. It focuses on tasks that were considered nearly unsolvable for multimodal models at the time of its release.
Introduced in the paper 'LLaVA-CoT: Let Vision Language Models Reason Step-by-Step', this dataset is designed to enable Vision-Language Models to perform autonomous multistage reasoning. It integrates 100,000 samples from various visual question-answering sources with structured reasoning annotations. The dataset was authored by Xkev and last updated on the Hugging Face platform in December 2025.
Multimodal wearable-based detection of panic episodes combines EEG and other sensor data. The dataset likely contains physiological signals collected from wearable devices. It is hosted on Kaggle and tagged for research purposes.
A hybrid multimodal dataset for diagnosing faults in Heating, Ventilation, and Air Conditioning (HVAC) systems. The dataset is associated with a research paper proposing a Bayesian TensorβNetwork approach. It was sourced from Kaggle and is categorized under the 'Research' tag.
A dataset titled 'Wafermap Vqa With Rubrics 2602 V2' published on HuggingFace by author Niraya666. The dataset was last updated on 2026-02-09. The title suggests it contains wafermap images and associated rubrics for visual question answering tasks, likely related to semiconductor manufacturing quality control.
Multimodal Damage Identification for Humanitarian Computing is a dataset from the UCI Machine Learning Repository. It is designed for assessing damage in disaster scenarios, likely combining multiple data types such as images and text. The dataset's creator and specific size are not detailed in the provided metadata.
Encompassing 30,000 images from the GQA dataset, intended for training Visual Question Answering models. It is tagged for scene understanding and computer vision tasks, with associated English text.
Aggregating astronomy images paired with text captions stored in JSON format, intended for fine-tuning Vision-Language Models (VLMs). It is tagged for applications in image captioning, computer vision, and multimodal AI. The specific number of rows, columns, and file size are unknown.
EarthDial-Dataset is a curated collection of 10,000 to 100,000 evaluation-only records for remote sensing and Earth observation, released by akshaydudhane and last updated in December 2024. It benchmarks vision-language models (VLMs) on real-world satellite and aerial imagery across tasks including classification, object detection, and change detection.
C3 is a cross-view cross-modality correspondence dataset containing 90,000 paired floor plans and photographs. It covers 597 scenes with 153 million pixel-level correspondences and 85,000 camera poses. The dataset was created by kwhuang and last updated on the platform in January 2026.
T2AV-Compass is a benchmark dataset created by NJU-LINK for evaluating Text-to-Audio-Video (T2AV) generation models. It targets unimodal quality, cross-modal alignment, complex instruction following, and perceptual realism. The dataset was last updated on December 25, 2025.
A Visual Question Answering dataset derived from the BD3 Building Defect Dataset. It pairs images of building surfaces with questions and defect category answers, designed for training and evaluating Vision-Language Models. The dataset was created by author 'chandrabhuma' and was last updated on December 27, 2025.