Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
MC-EIU is a benchmarking dataset for joint emotion and intent understanding in multimodal conversations. The dataset was created by Rui Liu, Haolin Zuo, Zheng Lian, Xiaofen Xing, BjΓΆrn W. Schuller, and Haizhou Li, with the official repository last updated on September 23, 2025. More details are available in the associated research paper.
3,625 training and 200 validation examples engineered for Reinforcement Learning with Verifiable Rewards (RLVR). The dataset, created by guox18, contains two complementary synthetic datasets with different synthesis approaches and difficulty distributions. It was last updated on August 8, 2025.
1,885 curated geometric problems across plane, spatial, and solid geometry categories form this benchmark. Each problem includes structured textual descriptions and visual diagrams for multimodal understanding. The dataset, created by OpenRaiser and updated in November 2025, leverages the Lean 4 proof assistant for formal representation.
LAION-High-Quality-Pro-6M is a 6-million-sample image-text dataset used to train Vision-Language-Vision auto-encoder models. The dataset, hosted by author ccvl on Hugging Face, was last updated on September 20, 2025. It was created for scalable knowledge distillation from diffusion models.
Emotiontalk contains 19,250 utterances totaling 23.6 hours of dyadic conversation speech recorded from 19 actors. Produced by BAAI and updated in 2025, it provides synchronized acoustic, visual, and textual data for Chinese emotional interaction analysis.
Conceptual Captions contains approximately 3.3 million images paired with captions. The captions are raw descriptions harvested from the Alt-text HTML attribute of web images, representing a wider variety of styles than curated annotations. The dataset was created by google-research-datasets.
A pre-training dataset for the OlmoEarth remote sensing foundation models, created by AllenAI. The dataset includes Sentinel-2 L2A and Sentinel-1 GRD IW imagery from the European Space Agency. It was last updated on November 1, 2025.
BigDocs-7.5M is a dataset created by ServiceNow for training multimodal models on document and code tasks, as described in the associated arXiv paper. The dataset was last updated on June 20, 2025, and is hosted on Hugging Face. It appears to contain both text and image data, with some parts distributed using an image identifier column that requires a provided script to reconstruct.
A 2024-09-01 upload of filtered VisualWebInstruct data for the OneVision training stage. The dataset, created by lmms-lab, contains subsets like ureader_kg and ureader_qa, provided as processed JSON files and compressed image folders.
BLIP3o-60k is a dataset distilled from GPT-4o for instruction tuning of text-to-image models. It includes categories such as JourneyDB, human-centric data from MSCOCO, Dalle3 outputs, Geneval, common objects, and simple text. The dataset was created by BLIP3o and last updated on May 25, 2025.
Conceptual Captions 12M (CC12M) contains 12 million image-text pairs designed for vision-and-language pre-training. It was created by pixparse and is a relaxed version of the CC3M dataset pipeline. The dataset instance was last updated on Hugging Face in December 2023.
MolmoAct is a fully open-source action reasoning model for robotic manipulation developed by the Allen Institute for AI. This mixture contains a subset of OXE formulated as Action Reasoning Data along with auxiliary robot data and a link to Multimodal Web data. The dataset page was last updated on September 10, 2025.
MSR-VTT is a benchmark dataset for text-video retrieval, containing 10,000 video clips and 200,000 captions. It was introduced in the 2016 paper 'MSR-VTT: A large video description dataset for bridging video and language' and is hosted on Hugging Face by user friedrichor. The dataset uses a standard 1K-A split protocol with training sets of 7,010 and 9,000 videos and a test set of 1,000 videos.
23,167,456 Midjourney-generated images and captions were compiled by deepghs and an anonymous provider as of December 2024. This collection contains original image files alongside metadata such as dimensions and unique identifiers.
A dataset created by nhagar on May 15, 2025, providing the URLs and top-level domains associated with training records in the HuggingFaceFW/fineweb dataset. It was created by downloading source data, extracting URLs and domains, and retaining only those identifiers to make exploring LLM training datasets more accessible.
66,000 human-annotated audio samples of spoken mathematical equations and sentences in English and Russian form the Speech2LaTeX dataset. It is the first fully open-source large-scale dataset for converting spoken math to LaTeX, drawn from diverse scientific domains. The dataset was created by marsianin500 and last updated on November 16, 2025.
Skywork-OR1-RL-Data is a reinforcement learning training dataset containing between 100,000 and 1,000,000 text records released by Skywork in April 2025. The collection features problems categorized by difficulty levels ranging from 0 to 16, calibrated against specific DeepSeek-R1-Distill-Qwen model variants.
Miriad 5.8M contains 5.8 million medical question-answer pairs distilled from peer-reviewed biomedical literature using Large Language Models. Released in June 2025 by the Miriad research team, the dataset provides structured data for medical instruction tuning and retrieval-augmented generation. It serves as a large-scale resource for training models on verified scientific knowledge rather than general web content.
LiveBench is a benchmark for large language models designed to limit test set contamination by releasing new questions monthly. Questions are based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It was created by 'livebench' and last updated in April 2025.
SimulaMet-HOST created the Kvasir-VQA dataset by augmenting the HyperKvasir and Kvasir-Instrument datasets with question-and-answer annotations. This multimodal dataset is designed for advanced machine learning tasks in gastrointestinal diagnostics, including image captioning and Visual Question Answering. The dataset was last updated on the Hugging Face platform in August 2025.