Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
LLaVA-2 is a dataset hosted on Kaggle, likely related to vision-language tasks and multimodal AI. Its specific content, scale, and creation details are not provided in the available metadata. The dataset appears to be intended for training or benchmarking large language models with visual capabilities.
Kaggle hosts the LLaVA-3 dataset, a resource for multimodal AI development. The dataset likely contains paired image and text data for training vision-language models. Its specific size, creator, and update history are not detailed in the provided metadata.
Spa3R Vlm is a dataset for vision-language model tasks, hosted on HuggingFace by the author hustvl. The dataset was last updated on March 6, 2026.
SPRITE is a spatial reasoning dataset for Vision-Language Models (VLMs) developed by zhihelu and released in early 2026. It provides image-text pairs designed to improve embodied intelligence by balancing linguistic diversity with computational precision, as detailed in Arxiv paper 2512.16237.
4 distinct subsets including MSCOCO and VisualNews provide multimodal queries and documents for cross-modal retrieval evaluation. The dataset utilizes queries.jsonl files to benchmark performance on text-only, image-only, and combined image-text search tasks.
BLIP2-OPT-27B is a large-scale vision-language model likely designed for tasks like image captioning and visual question answering. The dataset appears to be hosted on Kaggle, but its specific contents, such as training data or model weights, are not detailed in the provided metadata. Further inspection is required to confirm the exact data format and scope.
WavLM-Large is a model for speech representation learning, published on Kaggle. The dataset's specific content, size, and origin require verification after download.
LIMO_VQA is a dataset for Visual Question Answering (VQA) tasks, likely containing pairs of images and associated questions. The dataset is hosted on Kaggle, a popular platform for data science competitions and projects. Specific details on its size, creation date, and authors are not provided in the available metadata.
Audio embeddings generated by the WavLM-Large model, a transformer-based architecture for audio representation learning. The dataset likely contains precomputed feature vectors for audio samples, facilitating downstream machine learning tasks. It is hosted on Kaggle, a platform for data science competitions and datasets.
A benchmark dataset for evaluating vision-language models, likely focusing on synthetic-to-real transfer. The dataset is hosted on Kaggle and is tagged as a benchmark. Specific details regarding size, columns, and creation date are unknown.
Chest X-ray images and their associated radiology reports are provided for training vision-language models. The dataset is a processed subset, indicating curation for machine learning tasks. The creator and specific volume of data are not specified.
A multimodal dataset containing chest X-ray (CXR) images and their corresponding textual radiology reports, intended for training vision-language models. The specific volume of image-report pairs, creation date, and original author are not specified in the provided metadata. It is identified as a processed subset, version 2, sourced from Kaggle.
1,000 curated multimodal medical cases featuring paired medical images and structured JSON annotations. The data is formatted to support vision-language understanding and medical question-answering tasks through its integrated image-text architecture.
WangVQA is a dataset for visual question answering tasks, likely containing paired images and textual questions with answers. The dataset's creator and specific size are not documented in the provided metadata. Its release date and update frequency are also unknown.
OctoBench is an instruction-following benchmark for coding agents, created by MiniMaxAI and released in January 2026. It is an extended version of OctoCodingBench, expanded from 72 manually annotated instances to 217 instances using AI-assisted augmentation. The dataset is hosted on Hugging Face and is intended for evaluating agentic coding performance.
WavLM-Base is a pre-trained model for speech representation learning. It was published on the Kaggle platform, but detailed information about its training data, architecture specifics, and performance benchmarks is not provided in the available metadata. The dataset likely contains the model weights and configuration files necessary for inference or fine-tuning.
MF-RSVLM is a remote sensing vision-language model (VLM) combining a CLIP vision encoder and a Vicuna-7B language model. The model was trained in two stages for modality alignment and instruction following. The dataset is associated with the FUSE-RSVLM project and was uploaded by RL-MIND.
RSVLM-SFT is a remote sensing instruction-tuning dataset released by FelixKAI in 2026 for training the MF-RSVLM vision-language model. It contains image-text pairs for modality alignment and instruction following, although the specific record count is not disclosed in the metadata.
Presenting a Data Management and Sharing Plan outlining the strategy for handling scientific data generated for a research project on ethical, multimodal AI in health. The plan describes the types of data to be used and the framework for its management and sharing. Specific details on data volume, structure, and features are not provided.
Presenting a Data Management and Sharing Plan (DMS Plan) authored by Tianlong Chen, outlining the strategy for managing and sharing scientific data generated for research on trustworthy, domain-informed scientific foundation models. The plan describes the scientific data to be used and generated but does not contain the actual dataset. Specific details on data volume, structure, and features are not provided.