Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
Magma is a foundation model for multimodal AI agents developed by researchers from Microsoft Research, University of Maryland, University of Wisconsin-Madison, KAIST, and University of Washington. The dataset, last updated on April 12, 2025, is associated with a project page, arXiv paper, and GitHub repository. It likely contains multimodal data for training and evaluating AI agents capable of processing and reasoning across different data types.
BioMed-VITAL Instructions is a dataset for tuning multimodal AI models on biomedical visual tasks with clinician preference alignment. It contains multiple files ranging from 60,000 to 210,000 instruction samples, with file sizes from 127 MB to 463 MB. The dataset was created by authors including Hejie Cui, Lingjun Mao, and Carl Yang, and was last updated on August 17, 2024.
Core-Five is a multi-modal geospatial dataset built for foundation models, unifying Earth Observation data from five essential sensors into aligned spatiotemporal datacubes. It includes optical Sentinel-2 data at 10m resolution and other sensor data for multi-modal vision tasks.
UCSC-VLAA provides a tokenized version of the PMC-VQA dataset for medical vision-language understanding. The dataset includes GPT-4o generated reasoning and was last updated on August 15, 2025. It is part of the MedVLThinker project, which offers several curated datasets for medical vision-language training.
K-LLaVA-W is a Korean adaptation of the LLaVA-Bench-in-the-wild, designed for evaluating vision-language models. The benchmark was created by translating the original English text into Korean and reviewing its naturalness through human inspection. It was published by NCSOFT and last updated on July 25, 2025.
245,000 instruction examples across text, visual, and signal modalities support the fine-tuning of ShizhenGPT, a specialized model for Traditional Chinese Medicine. FreedomIntelligence created and released this collection, with its latest update in August 2025.
WorldCuisines is a massive-scale benchmark for multilingual and multicultural visual question answering focused on global cuisines. The associated paper was accepted to NAACL 2025 and received the Best Theme Paper award. The dataset was last updated on November 14, 2025.
Recap-DataComp-1B is a large-scale image-text dataset where descriptions have been enhanced using an advanced LLaVA-1.5-LLaMA3-8B model. The dataset was created by UCSC-VLAA and was last updated in January 2025.
178,510 caption entries and 960,792 open-ended question-answer pairs were compiled by lmms-lab for training the LLaVA-Video model. This multimodal dataset aggregates video-language data from five primary sources. The dataset card was last updated in October 2024.
Anthropic's HH-RLHF dataset contains between 100,000 and 1,000,000 human preference comparisons focused on model helpfulness and harmlessness, released in 2022. These text-based records are designed to facilitate the training of reward models for Reinforcement Learning from Human Feedback (RLHF) rather than supervised fine-tuning.
Caption3o-LongCap-v4 is a large-scale, high-quality image-caption dataset designed for training and evaluating image-to-text models. It is derived from prithivMLmods/blip3o-caption-mini-arrow and additional curated sources, emphasizing long-form captions covering a wide range of real-world and artistic scenes. The dataset was last updated on 2025-09-15 by prithivMLmods.
Argilla's 7,000-pair dataset, built with the distilabel tool, is designed for Direct Preference Optimization (DPO) training of chat models. This preview version, released on July 16, 2024, is based on the LDJnr/Capybara dataset and aims to address the scarcity of multi-turn dialogue preference data used in major RLHF works. A full version with more model responses is planned for a future release.
PKU-SafeRLHF is a dataset for AI safety research, particularly for reducing harmful outputs from language models. It was created by the PKU-Alignment Team and was last updated in October 2024. The dataset includes single-dimension preference data, question-answer pairs, and prompts.
ENVIDAT presents stated preference data on improved forest management measures from seven Swiss municipalities in Grisons and Valais. The data was collected via an online questionnaire between October 2019 and February 2020, receiving 939 responses from 10289 invited households. It includes a choice experiment with twelve tasks assessing willingness to pay for avalanche and rock fall risk reduction, alongside sociodemographic and attitudinal questions.
Tables 1-6 from USGS Open-File Report 02-59 contain data on salinity, discharge, and stage (water level) related to culverts under the main road in Everglades National Park. The data were gathered as part of a 2002 study by the South Florida Natural Resources Center and USGS to assess the road's influence on salinity intrusion into Florida Bay. Monitoring sites recorded water level, salinity, and flow during periods when water was present.
Caption3o-XL-v4 is a large-scale, high-quality dataset derived from prithivMLmods/blip3o-caption-mini-arrow and other curated sources. It is designed for training and evaluating image-to-text models, with an emphasis on long-form captions covering a wide range of real-world and artistic scenes. The dataset is in Parquet format, contains English text, and was last updated on September 15, 2025.
NVIDIA, UC Berkeley, and UCSF released this collection of 100,000 to 1,000,000 records in 2025 for training Describe Anything Models (DAM). The data consists of localized image and video captions stored in WebDataset tar files to support vision-language tasks.
OpenSpaces is a synthetic dataset for spatial visual question answering created using VQASynth. It synthesizes data from the first 30,000 rows of the localized narratives split of the cauldron, emphasizing greater diversity in image distribution compared to related datasets. The dataset was authored by remyxai and last updated on October 25, 2024.
A large-scale collection of astronomical images paired with descriptive captions and synthetic question-answer pairs, designed for training visual language models. The dataset was created by UniverseTBD and last updated on July 28, 2025. It combines imagery from NASA's Astronomy Picture of the Day, the European Southern Observatory's public archive, and ESA's Hubble Space Telescope.
MINT-1T is an open-source multimodal interleaved dataset containing 1 trillion text tokens and 3.4 billion images, a tenfold increase in scale compared to prior open collections. It was created by a team from the University of Washington and includes data from previously untapped sources like PDFs and arXiv papers. The dataset was uploaded to the platform in September 2024.