VisCon-100K is a dataset of 100,000 image-conversation samples designed for fine-tuning vision-language models. It is derived from 45,000 web documents in the OBELICS dataset, with captions generated by GPT-4V and converted into free-form conversations by OpenChat 3.5. The dataset was created by tiiuae and last updated on February 17, 2025.
Use Cases
- Fine-tuning vision-language models based on interleaved image-text web documents.
- Training models for contextual image captioning based on GPT-4V generated descriptions.
- Developing conversational AI agents that can discuss images based on free-form conversation data.
- Benchmarking model performance on tasks requiring integration of visual and textual web data.
Strengths
- Contains 100,000 image-conversation samples, providing a substantial volume for training.
- Derived from 45,000 web documents, suggesting a diverse source of contextual data.
- Leverages GPT-4V for caption generation, which may indicate high-quality initial annotations.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is known, but the specific distribution of images per document and other structural details are unknown.
- Data may reflect the biases inherent to the source web documents and the AI models used for annotation.
Provenance
- Source
- huggingface, author tiiuae
- Collection Method
- Derived from the OBELICS dataset's web documents, with annotations generated by GPT-4V and OpenChat 3.5.
- Time Range
- null
- Freshness
- Last updated 2025-02-17 06:29:32.
- Geography
- null