Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,540 datasets
Leak-CURBER is a dataset and code package created for the NeurIPS 2026 Evaluations and Datasets track. It likely contains multimodal data for evaluating tasks related to enzymatic reactions. The dataset was uploaded by an anonymous author on May 7, 2026.
61,000 fully annotated frames collected for aerial-ground cooperative perception. The dataset integrates synchronized multimodal sensing data and state information from vehicles and UAVs, covering 19 interaction scenarios and 5 weather conditions. It was created by LOTEAT and last updated on Hugging Face in April 2026.
A multimodal dataset designed for training Vision-Language Models to identify trading exhaustion and opportunities. It was created by author SpaceGhost using a Hindsight Mining technique to capture decision snapshots. The dataset was last updated on HuggingFace on 2026-04-10.
Vero-600k is a collection of data for training and evaluating general visual reasoning models, created by researchers at Princeton University's zlab. The dataset supports broad multimodal reasoning tasks across charts, STEM problems, spatial reasoning, and knowledge grounding. It was released in early 2026.
JAMMEval is a curated benchmark collection for evaluating Vision-Language Models on Japanese Visual Question Answering tasks. It refines seven existing Japanese VQA evaluation datasets through two rounds of human annotation to improve reliability. The dataset was created by llm-jp and was last updated in April 2026.
SFT-Dataset is a curated, medium-scale mixture designed to push a base model toward stronger step-by-step reasoning and reliable instruction following. The dataset was created by SeaFill2025 and was last updated on Hugging Face in April 2026. Quantities are chosen to be trainable on modest GPU budgets while keeping signal density high.
Xperience-10M is a large-scale egocentric multimodal dataset of human experience created by ropedia-ai. It is designed for research in embodied AI, robotics, and world models. The dataset was last updated on March 20, 2026.
A collection of 30,000 real-world chart images paired with detailed natural-language captions, intended for chart understanding and image-to-text research. The dataset was created by the 2077AIDataFoundation and was last updated on April 3, 2026.
INDOTABVQA is a benchmark dataset for evaluating Vision-Language Models on cross-lingual table understanding in Bahasa Indonesia document images. The dataset was created by NusaBharat and is associated with a paper accepted at ACL 2026 Findings. It was last updated on the Hugging Face platform on April 9, 2026.
The dataset is derived from the Niphad Grape Leaf Disease Dataset (NGLD), which contains high-quality images of table grape leaves categorized by disease. The original dataset was created by researchers from Symbiosis Institute of Technology and published on Mendeley Data under a CC BY 4.0 license. This version, uploaded by qingwuuu, appears to be adapted for use with visual language models.
FlipVQA-85K is a high-fidelity reasoning benchmark curated from a corpus of 544 college-level educational PDF documents, including expert-authored textbooks and exercise sets. The collection spans 11 academic disciplines, primarily in STEM domains where problems involve rigorous and verifiable reasoning processes. It was created by OpenDCAI and last updated on the platform in April 2026.
Vibe Landing Page Arena is a large-scale human preference dataset for evaluating AI-generated landing page design quality. It contains 36,000 pairwise judgments from 3,492 annotators comparing pages generated by four AI tools across 100 prompts and multiple design dimensions. The dataset was created by datapointai and last updated on Hugging Face in April 2026.
Caveman World Knowledge 150K is an instruction dataset containing approximately 150,000 entries for tuning language models. It was created by author Blackbean109 and was last updated in April 2026. The dataset blends factual world knowledge responses with reactions to unknown questions.
CoMM is a high-quality dataset designed to improve the coherence, consistency, and alignment of multimodal content. The dataset was created by author weisuxi and was last updated on 2026-04-24. It sources raw data from diverse origins, focusing on instructional content and visual storytelling.
Agent trajectories from PostTrainBench, a benchmark measuring CLI agents' ability to post-train pre-trained LLMs. The dataset was created by aisa-group and last updated on March 16, 2026. Each agent is given a base LLM, an evaluation script, and 10 hours on an NVIDIA H100 80GB GPU to autonomously improve model performance.
CuriaBench is a collection of evaluation datasets for the Curia foundation model, as described in the associated research paper. The datasets were created by the organization 'raidium' and the benchmark repository was last updated on March 31, III. The data is intended to assess the performance of multimodal AI models in radiology.
ChartNet is a large-scale, high-quality multimodal dataset designed for robust chart understanding and reasoning. It contains over one million chart samples, combining geometric visual patterns, structured numerical data, and natural language descriptions. The dataset was created by IBM Granite and was last updated in March 2026.
A dataset titled 'Cot Oracle Convqa Chunked Sonnet' authored by 'ceselder' and published on the HuggingFace platform. The dataset was last updated on 2026-05-11. Its title suggests it likely contains conversational question-answering data, possibly structured for language model training.
A multimodal dataset capturing 19.8 hours of expert demonstrations across 315 sessions. It includes synchronized RGB-D video, tactile sensing, eye-gaze tracking, pose annotations, and action labels from 21 occupational therapists performing 15 daily caregiving tasks. The dataset was contributed by the EmPRISE Lab at Cornell University and is hosted on AWS Open Data.
KITScenes LongTail is a dataset for end-to-end driving research focusing on long-tail events. It provides multi-view video data, vehicle trajectories, high-level instructions, and detailed reasoning traces. The dataset was created by KIT-MRT and was last updated on Hugging Face in April 2026.