Description

Textual visual context for image captioning, building upon the publicly available COCO caption dataset. It includes updates from October 2023, featuring a SwinV2 classifier for generating visual caption cosine scores with person labels.

Use Cases

Fine-tune image captioning models using the provided visual_caption_cosine_score_v2 as a training signal.
Evaluate semantic similarity between generated captions and visual context using the soft/hard-label scoring mechanism.
Analyze the impact of person label thresholds (0.2, 0.3, 0.4) on caption quality and visual grounding.

Strengths

Based on the established COCO caption dataset, providing a known foundation for research.
Includes recent updates from October 2023 with a state-of-the-art SwinV2 classifier for feature extraction.
Provides multiple person label thresholds (0.2, 0.3, 0.4) for analyzing visual context.

Limitations

Specific row count, column details, and file formats are unknown from the provided input.
The dataset's scope and modifications relative to the original COCO data are not fully detailed.
Potential for label noise or bias inherent in the underlying COCO dataset and automated scoring methods.

Provenance

Source: Derived from the COCO caption dataset (Lin et al., 2014).
Collection Method: Augmented with a SwinV2 classifier to generate visual caption cosine scores and person labels.
Freshness: Last updated on the platform on 2025-09-03, with a noted update in October 2023.
Geography: Region tag indicates 'us', but specific spatial coverage is not detailed.

The full description is hosted externally; users should review the dataset page on Hugging Face for complete details, license, and access instructions.

Textual Visual Context Dataset for Image Captioning

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info