Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
HADES-VLM-Data is a dataset for training vision-language models, published on Kaggle. The dataset's specific content, size, and creation details are not described in the available metadata. Its intended use likely involves aligning visual and textual information for AI model development.
A multimodal dataset focused on student engagement, published on Kaggle. The dataset likely contains multiple data types such as video, audio, or sensor readings to capture behavioral and interaction patterns. Specific details on volume, collection method, and authorship are not provided in the available metadata.
Patch embeddings for the CAMELYON16 dataset generated using the UNI foundation model. The embeddings are derived from 128x128 micrometer tissue patches, with segmentation and patching performed using a modified version of the CLAM toolkit. The dataset was authored by kaczmarj and last updated on December 10, 2025.
A collection of question-answer pairs in the Myanmar language designed for instruction tuning of Large Language Models. The dataset aggregates content from multiple sources covering domains like agriculture, health, microbiology, general knowledge, and Buddhism. It was created by chuuhtetnaing and last updated on Hugging Face in December 2025.
JMMMU-Pro is an image-based Japanese multi-discipline multimodal understanding benchmark. It extends the JMMMU benchmark by composing question images and text into a single image, requiring integrated visual-textual understanding. The dataset was created by JMMMU and last updated on Hugging Face in December 2025.
SpecVQA is a benchmark dataset for evaluating Multimodal Large Language Models on spectral understanding and visual question answering tasks using scientific images. The dataset is authored by UniParser and was last updated in December 2025. It contains images and text data, with specific row and column counts unknown.
The AQI Multimodal Dataset is a collection of data related to air quality, likely containing measurements from various sources. The dataset is hosted on Kaggle, but specific details about its size, origin, and creation date are not provided in the available metadata. Further verification is required to confirm the exact contents, scale, and authorship.
598,000 high-quality samples for training and evaluating multimodal code generation models. The dataset covers HTML generation, chart-to-code, image-augmented QA, and algorithmic problems, supporting research in unifying vision-language understanding with code generation. It was created by author 'lingjie23' and last updated on December 24, —2025.
Surveillance video data supports anomaly and crime detection tasks. The dataset is tagged for applications in security monitoring and video analysis. Specific details on volume, features, and origin are unavailable.
vigor_annotations_llava is a dataset hosted on Kaggle. The title suggests it contains annotations likely intended for training or evaluating vision-language models, such as those based on the LLaVA architecture. Specific details regarding the data volume, creation method, and update history are not provided in the available metadata.
VIGOR_annotations_llava is a dataset published on Kaggle. Its title suggests it contains annotations for the LLaVA (Large Language and Vision Assistant) model framework, likely linking images with descriptive text. The specific content, scale, and origin require verification after download.
A validation dataset for the Visual Question Answering (VQA) task, published on Kaggle. The dataset likely contains image-question-answer pairs designed to test models' ability to answer questions about visual content. The specific number of samples, data source, and creation date are not provided in the available metadata.
Multimodal data likely collected for assessing cognitive load, a psychological state related to mental effort. The dataset is published on Kaggle, but its specific size, collection method, and authorship are unknown. Its content and structure require verification after download.
A dataset from Kaggle focused on rumors circulating on the social media platform Twitter. The dataset likely contains multimodal content, such as text and associated images or videos, for analysis. Metadata is minimal; actual content, size, and collection details require verification after download.
Nemotron-Instruction-Following-Chat-v1 is designed to strengthen model capabilities in open-ended chat, precise instruction following, and structured output generation. It combines refreshed multi-turn chat data with synthetic dialogues generated by frontier models like GPT-OSS-120B and Qwen3-235B variants. The dataset was created by NVIDIA and last updated on December 15, 2025.
A research dataset for self-supervised learning on multimodal time-series data. It is designed for contrastive and adversarial augmentation techniques. The dataset's origin, size, and specific temporal coverage are not detailed in the provided metadata.
Kaggle hosts a dataset for multimodal radiomic-genomic fusion via graph-augmented deep learning for early prediction. The dataset's specific content, size, and origin are unknown. It is categorized as research data.
A language model dataset titled 'colsmolvlm-instruct-500m-base', published on Kaggle. The title suggests it is likely related to instruction tuning for a 500 million parameter language model. The dataset's specific content, size, and authorship are not detailed in the provided metadata.
Vision and audio data streams are categorized into structural crack classes for concrete integrity assessment. Synchronized multimodal inputs pair visual surface evidence with acoustic signatures to facilitate structural health monitoring research.
Multimodal video and audio recordings categorized into single-label violence classes. This dataset provides synchronized visual and auditory data streams to support the development of automated violence detection models.