Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
NEXUS is a multi-modal, hierarchical, temporal representation derived from the HuggingFaceFV/finevideo dataset. The primary unit is a 10-millisecond 'slice' that aggregates into moments (100 ms), seconds (1 s), experiences (10 s), and minutes (60 s). It was created by Ardea and last updated on 2025-12-29.
TimeLens-100K is a large-scale training dataset for video temporal grounding, created by TencentARC. The dataset was proposed in the paper 'TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs' and annotated using an automated pipeline powered by Gemini-2.5-Pro. It was last updated on December 19, 2025.
A multimodal collection contains images and 3D representations of cultural relics paired with textual descriptions. The dataset's creator, size, and update date are not specified. It integrates visual and text data for analysis.
A dataset likely containing images paired with descriptive text captions, sourced from Kaggle. The dataset's title suggests it is related to the BLIP (Bootstrapping Language-Image Pre-training) model, a vision-language framework. Specific details on volume, creation date, and authorship are unavailable from the provided metadata.
A Multimodal Approach is a dataset hosted on Kaggle. Its specific content, size, and origin are not detailed in the provided metadata. The dataset likely contains multiple data types, such as text, images, or audio, aligned for multimodal machine learning tasks.
Kaggle hosts a dataset titled 'A Multimodal Approach'. The dataset's specific content, size, and creator are not detailed in the provided metadata. Its title suggests it likely contains data from multiple modalities, such as text, images, or audio, integrated for analysis.
Forensic-RS-VQA is a dataset published on Kaggle for forensic analysis using visual question answering. The dataset likely contains multimodal data, such as images paired with textual questions and answers, for reasoning tasks. Specific details on volume, authorship, and update history are not provided in the available metadata.
Qwen3.5-9B-VLM-Q4_K_M GGUF Model is a quantized version of a large language model with vision capabilities, published on Kaggle. The dataset likely contains the model weights and architecture files for deployment. Specific details on the model's training data, original authors, and last update date are not provided in the metadata.
Customer insights for hotel competitiveness are provided in this dataset. The data appears focused on tourism preferences and hotel selection. The author, organization, and specific data volume are unknown.
A GGUF format model file for the Qwen3.5-9B-VLM, a 9-billion parameter multimodal large language model. The dataset includes the quantized model and a projection file (mmproj), likely enabling vision-language tasks. It was published on Kaggle, but the author, organization, and last update date are unknown.
A dataset for vision-language model tasks, published on Kaggle. The dataset's specific content, size, and creation details are not provided in the metadata. Further details require verification after download.
FSVQA_Training is a dataset for visual question answering tasks, likely containing paired images and textual questions. It is hosted on Kaggle, a platform for open data and machine learning competitions. The dataset's specific content, size, and origin are not detailed in the available metadata.
sEMG+pFMG multimodal gesture data likely contains signals from surface electromyography and pressure-sensitive fiber myography sensors. The dataset is hosted on Kaggle, but specific details about its size, collection method, and origin are unknown. Users should verify the actual content and structure after download.
Pathology foundation model features likely extracted from cervical and ovarian tissue images. The dataset is hosted on Kaggle, but its specific scale, creation details, and update history are not provided in the metadata. Columns and sample data are unknown, requiring download for full content verification.
A monitoring system for acute pilot fatigue is described, focusing on low-overhead, real-time sensor fusion. The dataset is hosted on Kaggle and is categorized for research purposes. Specific details on data volume, collection period, and authorship are not provided in the input.
NVIDIA's Nemotron-Cascade-RL-IF-RL dataset contains 108,938 samples designed for Instruction-Following Reinforcement Learning (IF-RL). The dataset includes prompts and associated metadata to improve language models' instruction-following capability and is ready for commercial use with attribution. It was last updated on December 16, III.
LLaVA_dataset is a dataset hosted on Kaggle. The dataset's title suggests it is related to the LLaVA (Large Language-and-Vision Assistant) project, which typically involves multimodal data for training vision-language models. The dataset likely contains image-text pairs or instruction-following examples, but its specific content, size, and origin require verification after download.
RadImageNet-VQA contains 750,000 CT and MRI images paired with 7.5 million generated visual question answering samples and 750,000 medical captions. Developed by Raidium and updated in late 2025, the dataset is built upon expert-curated anatomical and pathological annotations from the RadImageNet corpus.
Multimodal data for document understanding tasks, sourced from the UCI Machine Learning Repository. The dataset combines visual and textual information for analysis. Specific details on volume, creation date, and authors are not provided in the available metadata.
YouTube Comedy Slam Preference Data contains human judgments on comedy content from the YouTube platform. The dataset is hosted by the UCI Machine Learning Repository and is tagged for multimodal and LLM applications. Specific details on volume, creators, and recency are not provided.