Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,541 datasets
624 Japanese-language Visual Question Answering annotations across 116 receipt images for business document OCR evaluation. Created by icoxfog417 and updated in March 2026, the collection focuses on extracting structured data from financial documents.
LongVT-Parquet provides the training data annotations and evaluation benchmark for the LongVT project. The dataset supports an end-to-end agentic framework for 'Thinking with Long Videos' via interleaved Multimodal Chain-of-Tool-Thought. It was created by 'longvideotool' and last updated on March 9, 2026.
The Space Vision Dataset is a multimodal collection of space-related images paired with descriptive captions. It includes imagery of planetary views, telescopes, galaxies, and Mars rover scenes, designed for tasks like image captioning and vision-language modeling.
A multimodal subset of data from IMDb, likely containing information related to movies. The dataset was created by msubhaditya and is hosted on HuggingFace. It was last updated on 2026-05-01.
Pakistan Top Cities Quality of Life Dataset provides urban livability and safety metrics for major Pakistani cities. The dataset likely contains human preference data related to urban living conditions. It was sourced from Kaggle, but the author, organization, and last update date are unknown.
Over 200,000 comparisons of large language model responses were collected from more than 3,500 unique annotators. The dataset is multilingual, containing comparisons in English, French, Italian, Hindi, and Portuguese. It was created by Facebook and last updated on Hugging Face in February 2026.
VisWorld-Eval is a task suite for assessing multimodal reasoning with visual world modeling. It comprises seven tasks spanning synthetic and real-world domains, each designed to isolate specific atomic world-model capabilities. The dataset was authored by 'thuml' and last updated on Hugging Face on March 9, 2026.
UltraMix is a lean, high-quality preference optimization dataset curated from five open-source DPO corpora. It was created by aladinDJ using the Magpie Annotation Framework and a reward-driven curation pipeline, and was last updated on Hugging Face in February 2026. The dataset removes noisy, low-reward, or redundant preference pairs while preserving task balance.
Cxr Vlm Data is a dataset hosted on HuggingFace by user hieu3636. Its title suggests it contains chest X-ray images, likely paired with text for vision-language model training. The dataset was last updated on April 23, 2026.
ADAS-TO contains 15,705 real-world takeover events from 327 drivers across 163 vehicle models and 23 manufacturers. It is a multimodal dataset capturing the moment of control transition from ADAS to human drivers, created by HenryYHW.
High-fidelity performance metrics for 34 state-of-the-art multimodal reasoning architectures. The dataset appears to be a benchmarking collection for AI models that process and reason across multiple data types, such as images and text. The source, author, and specific metrics are not detailed in the provided metadata.
StevenHH2000 released this training dataset on March 19, 2026 for a CVPR 2026 paper on taxonomy-aware representation alignment. It consists of randomly sampled one-shot examples per category from the iNaturalist2021 dataset. The data includes images paired with text questions and coarse-to-fine category labels.
OmniScience provides between 1 million and 10 million multi-modal records for scientific image understanding, released by UniParser in January 2026. The data pairs scientific imagery with text to support image-to-text tasks, following a collection phase completed in September 2025.
45 courses and over 200 source documents form a benchmark for grounded synthesis. The dataset includes line-level citation ground truth from professional educators and programmatic video output in React code. Pairwise human preferences provide expert votes on output quality as a signal for reinforcement learning.
Over 5 billion tokens of Traditional Chinese Medicine text from websites and books, alongside a large-scale image-text dataset, form the pretraining data for ShizhenGPT. The dataset was created by CarsonnnNN and released on Hugging Face, with a last recorded update in March 2026. It is described as the largest existing open-source TCM corpus and image-text dataset for pretraining.
A dataset designed to train Generative Reward Models (GenRMs) using reinforcement learning at scale. It was created by NVIDIA and last updated on March 11, 2026. The data is composed of preference data from diverse domains and a synthetic safety blend, structured with a 'meta-prompt' format.
Medical VQA Vi is a dataset for visual question answering in the medical domain, uploaded to HuggingFace by SpringWang08. Its last recorded update was on 2026-04-25 17:31:23. The dataset's specific content, scale, and structure are not detailed in the available metadata.
80,000 outfit pairs link multiple reference garment images to a model wearing the complete look. ArtmeScienceLab created this dataset for high-fidelity virtual try-on research, with a test set released in March 2026. Each pair includes 3 to 12 reference images, averaging 4.48 items per outfit.
A derived version of the Sentinel-2 Land Cover Dataset, precomputed and reformatted for direct use with the Clay foundation model for Earth observation. The dataset was prepared by author wtr001 and last updated on March 17, 2026. It is designed to bypass typical preprocessing steps like tiling during data loading for training or inference pipelines.
A 500-example subset of structured vehicle diagnostic logs was created by CJJones and last updated in March 2026. It contains logs for vehicle types and subsystems like transmissions, battery systems, brakes, and engines. Each entry includes parameters such as fault codes, performance metrics, measurements, temporal trends, and maintenance recommendations.