Sign in to view source links and access this dataset
Description
Minimind V Dataset is a multimodal collection for training vision-language models, assembled by jingyaogong from sources including Chinese-LLaVA-Vision, llava-en-zh-300k, and LLaVA-SFT-665K. It contains approximately 570,000 pre-training images and 965,000 instruction-following data points, with content in both English and Chinese. The dataset was last updated on Hugging Face on April 4, -2026.
Use Cases
Training vision-language models for image captioning based on the provided conversation examples.
Fine-tuning models for instruction-following tasks based on the 665k SFT data points.
Developing bilingual (English-Chinese) multimodal assistants based on the translated and curated content.
Pre-training models on image-text pairs based on the 570k images from CC-3M and COCO 2014.
Strengths
Large scale with approximately 570,000 pre-training images.
Includes a substantial 965,000 instruction-following data points for supervised fine-tuning.
Explicitly curated for Chinese language support with translated content.
Images are pre-processed to consistent resolutions (128x128 for pre-train, 160x160 for SFT).
Limitations
Row count, file formats, and license information are unknown.
Column-level documentation is absent; field semantics must be inferred after download.
Freshness should be verified as the last update date is in the future (2026-04-04).
Provenance
Source
Aggregated from Chinese-LLaVA-Vision, llava-en-zh-300k, and LLaVA-SFT-665K.
Collection Method
Data was collected, translated, resized, and curated from the listed sources.
Time Range
null
Freshness
Last updated 2026-04-04 07:05:41; freshness should be verified.
Geography
null
License is unknown; users must verify permissions before use. The dataset contains binary image data (image_bytes) and JSON strings (conversations).