Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
Over 155,000 annotated samples comprise this dataset for localizing hallucinations in Vision-Language Models. Created by author uunicee, it spans three tasks and four hallucination types. The dataset was last updated in July 2025.
A partial dataset from the MAmmoTH2 project, containing instruction data primarily sourced from web forums like StackExchange. The data is described as very high-quality and is intended to boost large language model performance through instruction tuning. The dataset was authored by TIGER-Lab and last updated on Hugging Face on October 27, 2024.
A multimodal dataset of physics problems designed for chain-of-thought reasoning. It contains 2,100 problems across three domains: 1,000 on Kinematics, 600 on Electricity and Circuits, and 500 on Thermodynamics. The dataset was created by Vikhrmodels and last updated on August 4, 2024.
MINT-1T is an open-source multimodal dataset containing 1 trillion text tokens and 3.4 billion interleaved images, representing a tenfold scale-up from prior open-source collections. It was created by a team from the University of Washington to support research in multimodal pretraining, incorporating sources like PDFs and ArXiv papers.
MMEB-V2 from TIGER-Lab is a benchmark dataset for evaluating multimodal AI models. It expands on its predecessor to include five new tasks focused on video and visual document analysis. The dataset was last updated in November 2025.
A dataset containing 50 million entries designed to improve Vision-Language Models' ability to ground semantic concepts in visual features. Created by Salesforce, it was last updated in February 2025. The data supports tasks requiring precise localization of objects and understanding of referring expressions.
Textual visual context for image captioning, building upon the publicly available COCO caption dataset. It includes updates from October 2023, featuring a SwinV2 classifier for generating visual caption cosine scores with person labels.
DIM-Edit contains between 100,000 and 1,000,000 records designed to improve precise image editing in unified multimodal models. Released by stdKonjac in October 2025, the data supports the Draw-In-Mind (DIM) framework which rebalances designer and painter roles in diffusion-based architectures. The collection is provided in Parquet format and is associated with the DIM-4.6B model series and Arxiv paper 2509.01986.
TikZ drawings and natural language captions are paired to facilitate the automated generation of LaTeX-based diagrams. This public version excludes certain drawings due to licensing but provides tools for full dataset recreation via the DaTikZ repository.
5,000 test images from the MSCOCO 2014 collection paired with human-annotated captions for image-text retrieval tasks. The data follows the Karpathy split, a standard benchmark for evaluating cross-modal alignment between visual features and natural language descriptions.
A dataset designed for instruction tuning in multimodal settings involving visual interaction data. It was created by nyu-visionx and released in 2024 to address the scarcity of high-quality multimodal instruction-tuning data. The dataset aims to maintain the language abilities of multimodal large language models.
40,000 samples with five strictly aligned modalities provide foundational data for AI systems to interpret CAD drawings. The dataset, created by jackluoluo and last updated in October 2025, is designed to address the challenge of understanding and utilizing computer-aided design data.
A collection of instruction and toxic alignment datasets for 14 Indic languages, created by ai4bharat and last updated on July 25, 2024. The datasets include subsets like IndicAlign-Instruct, Indic-ShareLlama, and IndicAlign-Toxic, which were translated using IndicTrans2. The full curation process is detailed in an associated arXiv paper.
A dataset curated from Investopedia using a technique that scrapes unstructured data and employs an LLM to generate structured question-answer pairs. The dataset generation includes a self-verification method intended to reduce the probability of LLM hallucinations. The dataset was created by FinLang and was last updated on 2024-05-06.
A dataset introduced in a 2025 paper titled 'Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari'. It supports research in transliterating the ancient Modi script of Maharashtra into the modern Devanagari script used for Marathi and other languages. The dataset was created by author historyHulk and last updated on Hugging Face in September 2025.
A translated dataset for Direct Preference Optimization (DPO) derived from the Skepsun/cvalues_rlhf source. The prompt and rejected response fields contain outputs from the huihui-ai/Huihui-gpt-oss-20b-mxfp4-abliterated-v2 model, while the chosen response field uses outputs from openai/gpt-oss-20b. The dataset was created by author puwaer and last updated on November 15, 2025.
Giving access to reasoning traces generated by Gemini-2.5-pro for the Robo2VLM-1 visual question answering benchmark. It contains logical, step-by-step explanations that justify correct answers for robotic manipulation tasks across diverse, in-the-wild environments.
PubMedVision is a large-scale medical visual question answering dataset built from image-text pairs extracted from PubMed. FreedomIntelligence enhanced the data quality using GPT-4V and added annotations for body parts and modality. The dataset was updated in February 2025.
400,000 human preference responses from 82,000 unique annotators evaluating text-to-image model outputs. The dataset categorizes feedback into preference, coherence, and alignment metrics for large-scale model ranking.
A multimodal dataset containing approximately 120,000 image-text pairs for reasoning tasks, created by WeThink and last updated on May 15, 2025. The description indicates it aggregates images from multiple established sources including COCO, Visual Genome, and TextVQA. It is hosted on the Hugging Face platform.