Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
A dataset titled 'bmps_vlm' published on Kaggle. The title suggests it is likely related to vision-language models, a subfield of multimodal AI. No further metadata is available to confirm its specific contents, size, or origin.
EngVQA-GRPO is a dataset hosted on Kaggle. The title suggests it likely contains English-language visual question answering data. The dataset's specific content, size, and origin are unknown from the provided metadata.
gvlma is an R package for the global validation of linear model assumptions, authored by Edsel A. Pena and Elizabeth H. Slate. The dataset likely contains statistical test results and diagnostic metrics for assessing model fit. It is sourced from the paperswithcode platform, which aggregates resources for the computer science and mathematics communities.
236 real classroom sessions provide data across 5 modalities and 25 features. The dataset is designed for human-AI collaborative analysis of teaching quality. Its origin and creation date are unknown.
BEAT2 Aligned Multimodal Features is a dataset hosted on Kaggle. The title suggests it contains features extracted from multiple data modalities that have been aligned. The dataset's specific content, size, and provenance are unknown.
IMDB multimodal likely contains data from the Internet Movie Database, combining multiple types of media. The dataset is published on Kaggle, but its specific content, size, and creation details are unknown. Its last update date and authorship are not provided.
ROCO Multimodal 4 Clusters Dataset is a dataset hosted on Kaggle. The title suggests it contains multimodal data organized into four clusters. The dataset likely contains data from multiple modalities, such as images and text, intended for clustering tasks.
A multimodal dataset from Kaggle, likely containing data organized into four clusters. The dataset's title suggests it may combine different data types such as images and text. Specific details regarding its size, creation date, and authorship are not provided in the available metadata.
A script for streaming large language model training, authored by uv-scripts and last updated in January 2026. It demonstrates training a Qwen model on Latin using 1.47 million texts streamed directly from the FineWeb-2 dataset on Hugging Face Hub. The associated blog post details the method for training on massive datasets without local downloads.
Deeptumorvqa Image is a dataset hosted on HuggingFace by author ZiyueWang. The title suggests it contains medical images, likely related to tumor analysis, combined with a visual question answering task. The dataset was last updated on March 5, 2026.
A Chinese-language dataset of 595,000 items, created by machine-translating the LLaVA-CC3M-Pretrain-595K dataset. The data was uploaded by author 'cyberlangke' to Hugging Face and last updated on 2026-02-25. The description notes the translations are unverified and may contain errors.
Multimodal images for AI-based fall recognition, published on Kaggle. The dataset likely contains visual data intended for training models to detect falls in elderly individuals. Specific details on volume, collection method, and authorship are not provided in the available metadata.
ViTextVQA contains over 16,000 images and 50,000 question-answer pairs focused on Vietnamese text comprehension within visual contexts. Developed by researcher minhquan6203 and documented in Arxiv paper 2404.10652, it serves as a benchmark for text-based visual question answering in the Vietnamese language.
EyeVLM Dataset is a collection of data likely designed for training or evaluating vision-language models. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are not provided in the available metadata. Further verification is required to confirm the exact nature and scope of the included data.
Kaggle hosts the SmolVLM_vigor_annotations dataset. The title suggests it contains annotations for evaluating vision-language models, likely on tasks like visual grounding or reasoning. The dataset's specific content, size, and origin require verification after download.
Ecological environment interaction data is collected in this multimodal dataset. The author, organization, and specific volume of data are not specified. The last update date is also unknown.
A multimodal dataset containing paired RGB and thermal image traces for reconstructing events in dynamic environments. The dataset is designed for research in multimodal sensor fusion and computer vision. Information on the creator, size, and specific temporal coverage is not provided in the input.
OLIMP is a heterogeneous multimodal dataset designed for advanced environment perception tasks. The dataset likely contains multiple data types, such as images, video, or sensor readings, integrated for perception modeling. Its author, organization, and specific size are not provided in the metadata.
RSCoVLM is a dataset for co-training vision-language models on remote sensing imagery for multi-task learning. The dataset is associated with a published academic paper and was created by a team of researchers including Qingyun Li, Shuran Ma, and Junwei Luo. It was last updated in January 2026.
AgentNet contains 22.6K human-annotated computer-use trajectories across Windows, macOS, and Ubuntu operating systems. Developed by xlangai and released in early 2026, it serves as a foundation for training vision-language-action (VLA) models for desktop automation.