Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,541 datasets
LLaVA-LoRA-Nil-Final-Weights-V2 is a set of model weights published on Kaggle. The title suggests it relates to fine-tuning a Large Language and Vision Assistant (LLaVA) model using Low-Rank Adaptation (LoRA). The specific content, size, and provenance of the weights are unknown.
A dataset for visual question answering in the domain of physics, published on the Hugging Face platform. The dataset was uploaded by the user 'mlcf-robot' and was last updated on April 15, 2026. Its specific content, size, and structure are not detailed in the available metadata.
BilgeAI is a collection of Turkish text datasets for language model training, created by author vural2123 and last updated on March 28, 2026. The repository is structured into separate folders for instruction tuning and raw text pretraining. Each folder contains JSONL files with specific formats for different training tasks.
LLaVA-LoRA-nil-final-weights-2 is a set of model weights published on Kaggle. The title suggests it contains parameters for a fine-tuned version of the LLaVA (Large Language-and-Vision Assistant) model, likely using LoRA (Low-Rank Adaptation) techniques. No details on the training data, model size, or performance metrics are provided in the available metadata.
Over 37 hours of synchronized multimodal data for singing-driven 3D head motion, featuring motion subtitles and acoustic descriptions. The dataset, named SingMoSub, was created by ZikaiHuang and was last updated on March 1, 2026. It provides temporally aligned, region-level motion annotations for modeling expressive head and facial dynamics.
Amshaker's dataset provides 9 million text-image pairs for the first-stage pre-training of the Mobile-O multimodal model. The data is intended to align a diffusion decoder and conditioning projector with a frozen vision-language backbone. The dataset was last updated on Hugging Face in February 2026.
VideoChat2-IT-clean is a cleaned version of the VideoChat2-IT video instruction tuning dataset, released alongside the ICLR 2026 paper 'Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs'. The dataset was created by author 'byminji' and was last updated on March 3, 2026.
SmolLM3-3B-Base Blind Spots is a curated set of failure cases for the HuggingFaceTB/SmolLM3-3B-Base model. The dataset contains prompts, expected aligned behavior, and the model's actual outputs, illustrating common failure patterns. It was created by aneeshadas02 and last updated in March 2026.
75,491 PNG images of microscopy specimens at a resolution of 384 by 384 pixels. The collection includes diatoms and fungal spores and is described as a companion dataset for MicroLens Visual Question Answering tasks. Its author, organization, and license are unknown.
RLHF_clean suggests a dataset for training AI models using reinforcement learning from human feedback. Published on Kaggle, its specific content, size, and origin are not detailed in the provided metadata. The dataset's actual structure and intended use require verification after download.
Kaggle hosts the LLaVA-LoRA-Noisy-Baseline-Final dataset. The title suggests it is likely related to instruction tuning for vision-language models, specifically for the LLaVA (Large Language-and-Vision Assistant) architecture using LoRA (Low-Rank Adaptation) techniques. It may contain a baseline dataset with noisy annotations intended for model training or evaluation.
Shimin Qi produced this dataset in 2026 to support Reinforcement Learning from Human Feedback (RLHF) for large language models in urban planning. It comprises raw redevelopment data from Chinese municipal websites and multi-stakeholder annotated preference pairs used to fine-tune ChatGLM3-6B.
M3-MedQA contains between 1,000 and 10,000 medical image-question pairs across five languages, developed by pnu-clink in 2024. It extends the WorldMedQA-V dataset to evaluate cross-lingual consistency and medical reasoning in English, Korean, Japanese, Arabic, and Wolof.
Spatial457 contains between 10,000 and 100,000 image-text pairs designed for 6D spatial reasoning diagnostics. Created by researchers at Johns Hopkins University and DEVCOM Army Research Laboratory in 2025, it benchmarks the ability of multimodal models to interpret 3D orientations. The data is released under an Apache 2.0 license.
Heal Medvqa is a dataset for medical visual question answering, likely containing image-text pairs. The dataset was published on huggingface by the author tuandung2812 and was last updated on 2026-04 13. Its specific content, scale, and collection methodology require verification after download.
Gemma3_vlm_finetune_dataset is a dataset published on Kaggle, likely intended for fine-tuning vision-language models. Its specific content, size, and structure are not described in the available metadata. The dataset's author, organization, and license details are unknown.
Human Behavior Atlas aggregates and standardizes multiple behavioral datasets into a single training and evaluation framework. The dataset, created by HumanBehaviorAtlas, was last updated on Hugging Face in February 2026. It is designed to enable consistent training and evaluation of foundation models on psychological and social behavior tasks.
Ghosthunter RLHF contains under 1,000 gameplay screenshots from the 8-bit first-person shooter "Ghost Hunter," developed by webxos and updated in March 2026. The collection captures specific instances of successful ghost elimination using precision auto-fire to facilitate reinforcement learning from human feedback (RLHF).
Pred_LLaVA_LLaVA likely contains predictions from a vision-language model evaluation. Published on Kaggle, the dataset's specific content and scale require verification after download. Its platform tags suggest it is part of a multimodal benchmark.
Kaggle hosts this dataset titled 'predictions_llava'. The dataset likely contains outputs or predictions from the LLaVA (Large Language-and-Vision Assistant) model. Its specific scale, origin, and creation date are not detailed in the available metadata.