Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
137,000 images containing Vietnamese text paired with 822,679 synthetic visual question-answering pairs generated by Gemini 1.5 Flash. Created by 5CD-AI and updated in February 2026, this collection focuses on Vietnamese OCR and scene understanding.
A finetuned version of the BLIP model, likely adapted for vision-language tasks. The dataset is hosted on Kaggle, but its specific content and scale are not detailed in the provided metadata. The original Flickr8K dataset is a standard benchmark for image captioning, suggesting this resource may contain model weights or related training data.
Boss Alignment Dataset is a collection for calibrating expectations of AI capabilities, likely containing examples or feedback. Authored by ChenZiHong-Gavin, it was last updated on GitHub on 2026-04-19. The specific content, scale, and structure require verification after download.
500 TEST VQA is a dataset for evaluating visual question answering models. It was published on Kaggle, but its author, organization, and creation date are unknown. The dataset's exact size, format, and annotation details require verification after download.
Multimodal Smart Grid Condition Records likely capture sensor data related to electrical relay performance. The dataset is hosted on Kaggle, but its specific size, origin, and update history are unspecified. Columns and sample data are unknown, requiring verification after download.
EgoBench is a multimodal interactive benchmark designed for evaluating tool-using agents. The benchmark likely contains tasks requiring agents to process and interact with multiple data modalities. Its specific size, format, and creation details are unknown.
VLMNS6 is a dataset published on Kaggle, a platform for data science competitions and open data. Its title suggests a focus on vision-language models, which combine computer vision and natural language processing. The dataset's specific content, scale, and origin are not detailed in the available metadata.
Sequential Movie Preference Dataset is a collection of user behavior data for personalized movie insights, published on Kaggle. The dataset likely contains sequences of user interactions or preferences related to movies. Its specific size, origin, and update history are not detailed in the provided metadata.
February 21, 2026 marks the creation of this dataset by Willy08. It contains 11 carefully selected examples of blind spots discovered while experimenting with the Nanbeige/Nanbeige4-3B-Base model. The examples are deliberately diverse and target real weaknesses that even frontier models showed in 2026.
MMSI-Video-Bench is a holistic benchmark for evaluating spatial intelligence in video-based multimodal models. The dataset, created by author 'rbler', includes video clips and was last updated on February 10, 2026. It is hosted on Hugging Face and has been integrated into the VLMEvalKit framework.
SJTU-ViSYS developed M2DGR, a multi-modal and multi-scenario dataset for ground robot navigation, published in RA-L 2021 and ICRA 2022. It provides synchronized sensor data across diverse environments to support Simultaneous Localization and Mapping (SLAM) research.
A collection of model checkpoints for a vision-language model, published on Kaggle. The specific architecture, training data, and performance metrics are not detailed in the available metadata. The author, organization, and last update date are unknown.
Pre-rendered 3D multi-room environments support the Theory of Space benchmark for evaluating spatial reasoning in Vision Language Models. The dataset is designed to test whether foundation models can construct spatial beliefs through active exploration. It was created by MLL-Lab and last updated on February 11, 2026.
A dataset titled 'nexus-hh-rlhf-enriched' published on Kaggle. The title suggests it contains data enriched for Reinforcement Learning from Human Feedback (RLHF), likely involving human preferences for language model outputs. Specific details on size, origin, and creation date are unavailable from the provided metadata.
Kaggle hosts this dataset on power-grid worker safety behavior. The raw description indicates it contains multimodal data related to risk and standard operating procedure (SOP) operations. The dataset's author, organization, and specific scale are unknown.
RadImgNet-VQA is a dataset hosted on Kaggle, likely designed for visual question answering tasks in the medical domain. The title suggests it contains pairs of radiology images and associated questions, potentially for training AI models to interpret medical scans. Its specific size, source, and creation date are not provided in the available metadata.
A dataset likely containing multiple data types related to phishing attacks. The dataset is published on Kaggle, but its specific contents, size, and creation details are not described. Further verification after download is required to confirm its scope and utility.
NVIDIA's PhysicalAI dataset provides pre-processed 3D assets for predicting volumetric mechanical properties. The dataset combines four individual 3D asset collections, processed to include multi-view renders, voxelized representations, and LLM-annotated material descriptions. It was last updated on February 5, 2026.
Tokyo driving data provides a large-scale visual question answering dataset for physically grounded spatiotemporal reasoning. It contains 16 million question-answer pairs over 270,000 frames, constructed from 100 hours of multi-sensor driving data. The dataset was created by turing-motors and last updated on the platform in January 2026.
Multimodal_Diet_Dataset is a dataset hosted on Kaggle. Its title suggests it contains data related to diet and nutrition, potentially combining multiple data types. Further details regarding its size, origin, and specific contents are unavailable from the provided metadata.