Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
Released by mvp-lab in 2025, this 85-million record multimodal collection supports the mid-training phase of the LLaVA-OneVision-1.5 framework. It aggregates image-text data from eight major sources including ImageNet-21k, LAIONCN, and SA-1B to facilitate democratized multimodal model training.
4,992 social media posts from the RedNote platform categorized into 613 advertisement and 4,379 non-advertisement samples. The dataset includes 26,324 associated images distributed across training, validation, and test splits for covert marketing detection.
Primus-Seed is a cybersecurity text dataset compiled from reputable sources including MITRE, Wikipedia, and cybersecurity company websites, as well as manually collected Cyber Threat Intelligence (CTI). It was created by Trend Micro's AI Lab and was last updated on the Hugging Face platform in August 2025. The dataset includes at least 2,946 samples from cybersecurity blogs and news, comprising over 9.7 million tokens.
26,260 science questions paired with 6,206 images sourced from CK-12 Foundation's open educational resources. The dataset includes both text-only and diagram-based visual reasoning questions for middle school science. It was uploaded by 'notefill' to HuggingFace and last updated on 2025-11-21.
Critic-10K provides approximately 10,000 image triplets designed to train models to rectify inconsistencies in AI-generated visual content. Created by ziheng1234 and associated with the 2025 research paper 'The Consistency Critic', the data uses VLM-based selection to pair reference images with degraded and target versions.
PLM-Video-Human is a collection of human-annotated resources for training Vision Language Models, focused on detailed video understanding. The dataset, created by Facebook, was last updated on May 21, -2025. Training tasks include fine-grained open-ended question answering, region-based video captioning, dense captioning, and temporal localization.
SCI-CQA is a multimodal benchmark dataset for evaluating chart understanding, inspired by human exams. It contains 5,629 curated objective and open-ended questions paired with 2,894 chart images from scientific literature. The dataset was created by lyndons1 and last updated on April 28, 2025.
PLM-VideoBench is a collection of human-annotated resources for evaluating Vision Language models, focused on detailed video understanding. The dataset includes evaluation data for tasks like FGQA, which probes fine-grained activity understanding through multiple-choice questions. It was authored by Facebook and last updated on May 21, 2025.
A large-scale multimodal benchmark for intelligent traffic surveillance created by LifeIsSoSolong and last updated on 2025-10-25. It contains 170,400 images paired with approximately 5 million instruction-following visual question answering samples. The dataset covers diverse traffic scenes including congestion, spills, unusual weather, construction, fireworks, smoke, and accidents.
29,980 synthetically generated examples designed to enhance a model's ability to follow instructions precisely and satisfy user constraints. The dataset was curated by the Allen Institute for AI and uses a persona-based methodology to generate diverse instructions, with constraints borrowed from the IFEval taxonomy. It was last updated on November 21, 2024.
A manually annotated test set of 500 question-answer pairs based on document images. The data originates from the UCSF Industry Documents Library and was curated for benchmarking by subsampling the original DocVQA test set. The dataset was last updated on June 20, 2025.
ViRL39K contains 38,870 verifiable question-answer pairs designed for Vision-Language Reinforcement Learning training, released by TIGER-Lab in April 2025. It aggregates and refines data from seven specialized sources, including Llava-OneVision, MM-Math, and DeepScaleR, through a process of cleaning, reformatting, and verification.
REFED is an affective brain-computer interface dataset integrating multimodal brain signals and real-time dynamic emotion annotation. The dataset was created by REFED2025 and last updated on the platform in November 2025. It synchronizes EEG and fNIRS signals to study the neural mechanisms of emotional dynamic evolution.
Facebook introduces AdvancedIF, a benchmark featuring over 1,600 prompts designed to assess large language models. The dataset includes expert-curated rubrics to evaluate proficiency in complex instruction following, multi-turn interactions, and system prompt steerability. It was last updated on November 26, 2025.
1,639 English-language web screenshots from over 100 websites are paired with natural-language instructions and pixel-level click targets. The dataset provides a high-quality benchmark for evaluating multimodal navigation models, created by Hcompany and released in June 2025.
53,202 instruction-tuning examples for defensive cybersecurity, covering over 200 specialized domains including cloud-native threats and AI/ML security. Developed by the Trendyol Security Team and updated in July 2025, this dataset provides system, user, and assistant triplets for model training.
Comprising human preference data for text-to-video generation, collected via the Rapidata API in approximately 12 hours. It is used to benchmark five AI models: Sora, Hunyouan, Pika 2.0, Runway ML Alpha, and Luma Ray 2. Row and column counts are unknown.
Multimodal recordings of candidate interview responses categorized by personality traits and professional performance metrics. This dataset facilitates research in affective computing and automated soft-skill evaluation within human resources contexts by providing synchronized behavioral data.
Anime Art Multicaptions V5.0 is a collection of over 6 million captions for anime and game artworks. The dataset features original character names and captions generated by top-tier vision-language models like Claude, GPT, and Gemini, covering a wide thematic range. About 20% of the most complex multi-character cases have been rechecked and corrected in a second pass.
Text2CAD is a dataset for generating sequential computer-aided design (CAD) operations from text prompts. The dataset was created by Mohammad Sadil Khan, Sankalp Sinha, Talha Uddin Sheikh, Didier Stricker, Sk Aziz Ali, and Muhammad Zeshan Afzal. The dataset page was last updated on June 11, 2025.