42,678 Vietnamese images paired with detailed text descriptions and visual question-answering pairs generated by GPT-4o. The dataset includes spatial metadata for objects and text, covering specific attributes such as font style, color, and size within a Vietnamese linguistic context.
Use Cases
- Train Vietnamese OCR models that require font and color recognition using the text description fields
- Develop multimodal LLMs capable of spatial reasoning by leveraging the object location and quantity data
- Fine-tune visual question answering (VQA) systems for Vietnamese using the detailed long-form answer pairs
- Build image captioning models that describe complex scenes including object composition and text attributes
Strengths
- 42,678 Vietnamese images with corresponding GPT-4o generated annotations
- Includes text-specific metadata such as font style, color, position, and size for all recognized text
- Features object-level details including location coordinates and quantity counts within the image descriptions
- Provides long-form, detailed answers for visual question-answering tasks in the Vietnamese language