Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
1,000 historical recipes prepared for Vision-Language Model training. The dataset includes JSON metadata, suggesting structured information about the recipes. It is hosted on Kaggle, but the original source and collection methodology are not detailed in the provided metadata.
City of Austin data details the development of a new pedestrian and bicycle bridge over Lady Bird Lake near Longhorn Dam. The dataset is tagged for urban planning, geospatial analysis, and multimodal infrastructure within Austin, United States. It was last updated in March 2026.
3MDAD is a multimodal, multiview, and multispectral dataset focused on driver actions and distraction. It contains video and image data from multiple camera perspectives and spectral bands for analyzing driver behavior. The dataset was created for research in automotive safety and computer vision.
Urban Friction Atlas is a multimodal dataset designed for place suitability prediction tasks. The dataset integrates multiple data types, as indicated by its platform tags, to model urban environments. The author, organization, and specific temporal coverage are not provided.
A subset of the BLIP3o-Pretrain-Long-Caption and BLIP3o-Pretrain-Short-Caption datasets translated into Turkish. The dataset is intended for training or fine-tuning image-to-text models. It was created by the author 'ituperceptron' and was last updated on January 15, 2026.
A dataset likely designed for Visual Question Answering (VQA) tasks, focusing on salience and conflict within images. It is hosted on Kaggle, but specific details about its size, creation date, and authorship are unknown. The dataset's content and scope require verification after download.
MHAL Dataset Annotations for LLaVA is a dataset published on Kaggle. The title suggests it contains annotations for the LLaVA (Large Language and Vision Assistant) model, likely involving multimodal data linking images and text. The dataset's specific content, size, and authorship are unknown.
Kaggle dataset titled 'data_vlm_diff_ready_40_cmd'. The name suggests a collection of data prepared for vision-language models and diffusion processes. The dataset's specific content, size, and origin are not detailed in the provided metadata.
A multimodal dataset from the LLaVA-CoT project, likely containing image-question-answer pairs structured for visual reasoning tasks. The dataset includes a train.jsonl file with conversation data linking images to questions and answers, suggesting a format for training or evaluating vision-language models. It was authored by 'berhaan' and last updated on 2026-01-17.
Kaggle hosts the LLaVA model, a multimodal AI system. The dataset likely contains model weights for a large language model with vision capabilities. The author, organization, and last update date are unknown.
The VQA-Autopilot dataset is hosted on Kaggle. Its title suggests it contains data for visual question answering, a task combining computer vision and natural language processing, potentially for applications in autonomous systems. Metadata is minimal; actual content, scale, and authorship require verification after download.
A dataset named LLaVA, hosted on Kaggle, likely contains multimodal data for training vision-language models. The platform tags suggest it is intended for large language model (LLM) training and multimodal AI tasks. Specific details on size, structure, and creation are not provided in the available metadata.
A dataset titled 'data_vlm_diff_ready_30' is hosted on Kaggle. The title suggests it is prepared for training or evaluating vision-language models, likely containing paired image and text data. Its specific content, size, and creation details are not provided in the available metadata.
Adversarial test cases combine images and text to validate multimodal large language models. The dataset is designed to challenge evidence-based reasoning capabilities in models like Gemini. Its origin, size, and creation details are not specified.
A multimodal dataset likely containing transcripts and potentially audio recordings from corporate earnings conference calls. The dataset is hosted on Kaggle, but specific details about its size, source, and creation date are not provided in the available metadata. Its content suggests it is intended for analyzing corporate financial performance and communication.
WBC-AttrDescVQA is a dataset for Visual Question Answering (VQA) tasks, likely involving images of white blood cells (WBCs). The dataset is hosted on Kaggle, but its specific scale, creation date, and authorship are not detailed in the provided metadata. Its content and structure must be verified after download.
A dataset titled 'WildFire_VQA' is hosted on Kaggle. The dataset likely contains image and text pairs for visual question answering tasks related to wildfire scenes. Metadata is minimal; the specific number of samples, data source, and creation date are unknown.
Image captions for nova 2 is a dataset published on Kaggle. The title suggests it likely contains descriptive text paired with images. Metadata is minimal; actual content requires verification after download.
Chart2Code is a benchmark of 2,023 tasks designed to evaluate Large Multimodal Models (LMMs) on chart understanding and code generation, released by CSU-JPG in 2026. The dataset is structured into three hierarchical difficulty levels containing 863, 1,010, and 150 tasks respectively. It maps visual data visualizations to executable code to test the reasoning capabilities of multimodal systems.
RL GSPO Qwen2.5VLM PhaseB Best Composite 180 is a dataset published on Kaggle. The title suggests it is likely a benchmark or evaluation dataset for a vision-language model, possibly related to reinforcement learning. The dataset's specific content, size, and origin are unknown from the provided metadata.