Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
Cambrian-S-3M is a collection of approximately 3 million video instruction tuning records developed by nyu-visionx for the third training stage of the Cambrian-S multimodal model. Released in early 2026, the dataset aggregates video-text pairs from Cambrian-S-3M, LLaVA-Video-178K, and LLaVA-Hound (ShareGPTVideo).
MoreVQA-Output is a dataset published on Kaggle. The title suggests it contains outputs or results related to a Vision Question Answering (VQA) task, likely involving images and text. Specific details on size, columns, and creation are unavailable from the provided metadata.
A multimodal dataset likely containing images and text questions related to Vietnamese culture. The dataset is hosted on Kaggle, but its specific size, creation details, and update history are not provided in the available metadata. Its content and structure must be verified after download.
Tedim Zolai LLM train/val/test splits and a training script are provided. The dataset's author, organization, and specific size are unknown. The original platform is Kaggle.
Kaggle hosts a dataset titled 'final-vqa-done'. The dataset's content likely relates to visual question answering, a multimodal AI task. Specific details such as the number of samples, collection method, and creator are not provided in the available metadata.
final-vqa-rank16 is a dataset hosted on Kaggle. The title suggests it is likely related to Visual Question Answering (VQA), a multimodal AI task. Its specific content, scale, and origin are not detailed in the provided metadata.
MMTA-v1.0 is a benchmark dataset for multimodal time-series analysis, published on Kaggle. The dataset likely contains aligned data from multiple modalities, such as sensor readings, images, or text, over time. Specific details on volume, authorship, and update recency are unavailable from the provided metadata.
HER-Dataset is a high-quality role-playing dataset featuring reasoning-augmented dialogues extracted from literary works. It introduces dual-layer thinking for cognitive-level persona simulation. The dataset was authored by ChengyuDu0123 and last updated on February 4, 2026.
Multimodal data on college student stress, emotion, and anxiety, likely collected for intervention analytics. The dataset's author, organization, and specific size are unknown. The last update date is also unknown.
VisGym consists of 17 diverse, long-horizon environments for evaluating Vision-Language Models on interactive tasks. The dataset contains agent trajectories where actions are conditioned on past actions and observation history, challenging multimodal sequence handling.
A dataset likely for Visual Question Answering (VQA) tasks, as suggested by the title abbreviation. It was published on the Kaggle platform. The specific content, size, and creation details are not provided in the available metadata.
final-vqa-rank32 is a dataset for Visual Question Answering (VQA) tasks, likely containing image-question pairs with multiple ranked answer candidates. The dataset is hosted on Kaggle, but its specific origin, size, and creation details are not provided in the available metadata. Metadata is minimal; actual content requires verification after download.
Experimental data collected on a 1-inch bore gas-liquid two-phase CO2 flow rig in real time. The dataset includes time-stamped mass flowrates, temperatures, densities, tube frequencies, and differential pressure readings from Coriolis flowmeters installed on multiple test sections.
2,615 ancient Scandinavian runic inscriptions paired with photographs of the runestones. The dataset, created by birgermoell, provides scholarly transliterations, Old Norse normalizations, and English translations for each entry. It was last updated on Hugging Face in February 2026.
A benchmark dataset bridging language and visual heritage through Arabic calligraphy, developed by researchers from Mohamed bin Zayed University of AI, NUCES, NUST, and Australian National University. It was last updated on January 28, 2026.
Tokyo-based driving data provides 16 million question-answer pairs over 270,000 frames. The STRIDE-QA dataset is a large-scale visual question answering resource for physically grounded spatiotemporal reasoning in autonomous driving. It was constructed from 100 hours of multi-sensor driving data and includes dense annotations such as 3D bounding boxes, segmentation masks, and multi-object tracks.
A collection of multi-resolution satellite images from both public and commercial satellites. The dataset is specifically curated for training geospatial foundation models. It is hosted on AWS Open Data and was contributed by the organization Coastal Carbon.
A Visual Question Answering dataset focused on Vietnamese food. The dataset likely contains images of Vietnamese dishes paired with questions and answers in text format. It is published on Kaggle, but details on size, creation date, and authorship are currently unknown.
Multimodal competition data published on Kaggle. The dataset likely contains multiple data types such as images, text, or audio, structured for a competitive machine learning task. Metadata is minimal; actual content and scale require verification after download.
Multimodal physiological data collected during cycling activity. The dataset is hosted on Kaggle, but the author, collection method, and specific time range are not provided in the available metadata. The title suggests it likely contains synchronized sensor readings from multiple modalities recorded during physical exertion.