CogVLM-SFT-311K is the primary aligned corpus used in the initial training of CogVLM v1.0. The dataset contains approximately 311,000 bilingual visual instruction samples, constructed by selecting 3500 high-quality samples from MiniGPT-4, integrating them with LLaVA-Instruct-150K, and translating them into Chinese via a language model. The dataset was created by zai-org and last updated on December 26, 2023.
Use Cases
- Fine-tuning visual language models for bilingual instruction-following based on the described visual instruction data.
- Training or evaluating model performance on Chinese-English multimodal tasks based on the translated corpus.
- Studying the impact of curated high-quality subsets (e.g., minigpt4-3500) on model alignment.
- Analyzing noise and quality in detailed image-text annotations as mentioned in the description.
Strengths
- Designed as the primary aligned corpus for training CogVLM v1.0.
- Integrates approximately 3500 high-quality samples from the open-source MiniGPT-4 dataset.
- Bilingual dataset combining English (LLaVA-Instruct-150K) and Chinese-translated content.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
Provenance
- Source
- huggingface, author zai-org
- Collection Method
- Constructed by selecting samples from MiniGPT-4, integrating with LLaVA-Instruct-150K, and translating into Chinese via a language model.
- Time Range
- null
- Freshness
- Last updated 2023-12-26 10:03:17; freshness should be verified.
- Geography
- null