Name: Minimind V: Multimodal Vision-Language Dataset for Instruction Tuning
Creator: jingyaogong
Published: 2024-10-04T14:45:40
Keywords: Computer Vision, Chinese Nlp, Image Captioning, Multimodal Vision Language, Multimodal

Description

Minimind V Dataset is a multimodal collection for training vision-language models, assembled by jingyaogong from sources including Chinese-LLaVA-Vision, llava-en-zh-300k, and LLaVA-SFT-665K. It contains approximately 570,000 pre-training images and 965,000 instruction-following data points, with content in both English and Chinese. The dataset was last updated on Hugging Face on April 4, -2026.

Use Cases

Training vision-language models for image captioning based on the provided conversation examples.
Fine-tuning models for instruction-following tasks based on the 665k SFT data points.
Developing bilingual (English-Chinese) multimodal assistants based on the translated and curated content.
Pre-training models on image-text pairs based on the 570k images from CC-3M and COCO 2014.

Strengths

Large scale with approximately 570,000 pre-training images.
Includes a substantial 965,000 instruction-following data points for supervised fine-tuning.
Explicitly curated for Chinese language support with translated content.
Images are pre-processed to consistent resolutions (128x128 for pre-train, 160x160 for SFT).

Limitations

Row count, file formats, and license information are unknown.
Column-level documentation is absent; field semantics must be inferred after download.
Freshness should be verified as the last update date is in the future (2026-04-04).

Provenance

Source: Aggregated from Chinese-LLaVA-Vision, llava-en-zh-300k, and LLaVA-SFT-665K.
Collection Method: Data was collected, translated, resized, and curated from the listed sources.
Time Range: null
Freshness: Last updated 2026-04-04 07:05:41; freshness should be verified.
Geography: null

License is unknown; users must verify permissions before use. The dataset contains binary image data (image_bytes) and JSON strings (conversations).

Multimodal Computer Vision Chinese Nlp Image Captioning Multimodal Vision Language

Minimind V: Multimodal Vision-Language Dataset for Instruction Tuning

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info