SRefCOCO is a large-scale dataset designed for visual grounding tasks. It was created by xutao2025 and was last updated on June 3, 2026. The dataset breaks from traditional text-only constraints by incorporating speech alongside text and image data.
Use Cases
- Train speech-based object detection models based on the novel speech-text-image interaction paradigm.
- Benchmark visual grounding systems that go beyond traditional text-image bimodal interactions.
- Develop flexible AI applications for dynamic physical scenarios based on the multi-modal instructions.
Strengths
- Described as a 'large-scale' dataset, suggesting substantial size.
- Introduces a novel speech-text-image paradigm for visual grounding.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count and file size are unknown, which may limit suitability assessment.
Provenance
- Source
- xutao2025
- Freshness
- Last updated 2026-06-03 08:23:23; freshness should be verified.