Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,541 datasets
A 500-example subset of structured vehicle diagnostic logs was created by CJJones and last updated in March 2026. It contains logs for vehicle types and subsystems like transmissions, battery systems, brakes, and engines. Each entry includes parameters such as fault codes, performance metrics, measurements, temporal trends, and maintenance recommendations.
53,202 instruction-tuning examples covering over 200 specialized cybersecurity domains, built by the Trendyol Security Team. The dataset is designed for training defensive security AI assistants and includes modern challenges like cloud-native threats and AI/ML security. It was last updated on March 8, 2026.
Onemillion Bench is a bilingual (English/Chinese) expert-level benchmark containing 400 entries across five professional domains, released by humanlaya-data-lab in March 2026. It utilizes weighted rubric-based grading criteria to evaluate language agents on analytical reasoning and instruction following within specialized fields.
RVMS-Bench is a benchmark for real-world video search and moment localization developed by Tencent in 2026. It contains between 1,000 and 10,000 text-based query annotations and metadata designed for agent-based retrieval frameworks. This specific repository provides the search paradigm metadata but excludes raw video assets and ground-truth keyframes.
A collection of random test images for evaluating vision-language models in diverse, unconstrained scenarios. The dataset was created by author 'merve' and was last updated in April 2026.
TripleSumm-Mr.HiSum reconstructs the original MR.HiSum dataset by crawling source videos to provide aligned visual, audio, and text features. The dataset supports multimodal research for video highlight detection and summarization. It was created by hminjeong and updated in March 2026.
Common-O contains between 10,000 and 100,000 image-text pairs designed by Meta researchers in 2026 to evaluate multimodal LLM reasoning. The data is organized into two subsets featuring household objects to test the ability of models to identify common elements across 3 to 16 different scenes.
DeepGen 1.0 contains fewer than 1,000 image-text pairs for multimodal generation and editing, released by deepgenteam in March 2026. The data supports five core tasks including reasoning-based generation and text rendering for a 5B parameter model. It is formatted as an imagefolder and licensed under Apache 2.0.
MCIF is a human-annotated benchmark for evaluating instruction-following across speech, vision, and text modalities in four languages. The dataset was created by FBK-MT and was last updated in February 2026.
HardNegativeDiverseVQA is a dataset published on Kaggle. Its title suggests it contains hard negative examples for Visual Question Answering (VQA) tasks. The dataset's specific size, author, and update date are unknown.
The Multi-Level Existence Benchmark (MLE-Bench) is a dataset designed for fine-grained evaluation of multimodal models' perceptual abilities. It assesses 'pure' perception using 4-choice questions about object or scene existence within images. The dataset was created by JunlinHan and was last updated on March 8, 2026.
Sample test images likely associated with a multimodal visual question answering method for pathology. The dataset is hosted on Kaggle, but its scale, creator, and update history are unspecified. Columns and detailed metadata are unknown.
Multimodal-PathVQA-Method-01-outputs is a dataset from Kaggle. The title suggests it contains outputs from a method applied to a pathology visual question answering (VQA) task, likely involving images and text. The dataset's specific content, scale, and origin are not detailed in the provided metadata.
60 web applications from the Vibe Coding Showcase were evaluated in a pairwise human preference study. The dataset contains 1,770 pairwise comparisons, with 30 human votes collected for each pair to judge visual design based on screenshots. It was created by datapointai and last updated in March 2026.
Motion capture data records full body and finger movements using 10 IMU sensors and a Phi9 Glove. The dataset was created by phi-9 and last updated in March 2026. It is released for non-commercial research under a CC-BY-NC-4.0 license.
OSWorld File Cache provides reliable access to evaluation files for the OSWorld project. The repository, created by xlangai, hosts files previously stored on Google Drive to support scalable, real computer environment testing. It was last updated in February 2026.
Multimodal HSI-LiDAR dataset captures a combined Italian rural and urban scene. The data is annotated for 6 distinct land cover classes, supporting classification tasks. The dataset's author, organization, and specific collection details are not provided in the input metadata.
OpenDataArena published ODA-Fin-RL-12K in March 2026, providing 12,187 hard-but-verifiable samples for reinforcement learning in the financial domain. The dataset focuses on complex reasoning tasks with concise answers optimized for automated reward modeling and distillation.
Path VQA Turkish Final is a dataset hosted on Kaggle. The title suggests it contains visual question answering data in the Turkish language, likely pairing images with questions and answers. The dataset's specific scale, origin, and update history are not detailed in the provided metadata.
Synthetic conversation examples generated by a Java-based Arduino project suggestion system. The dataset, created by Cameron Jones, contains structured multiturn dialogues where a user interacts with a bot. It has no affiliation with the Arduino brand.