DataConcept-128M contains 128 million web-crawled image-text pairs annotated with fine-grained concept composition details. It is derived from DataComp-CLIP and designed to enable Concept-Aware Batch Sampling for multimodal pretraining.
Use Cases
- Train multimodal models using concept composition annotations for improved representation learning.
- Implement Concept-Aware Batch Sampling to construct training batches based on specific concept criteria.
- Pretrain vision-language models on 128M image-text pairs with fine-grained concept labels.
- Analyze the relationship between visual concepts and their textual descriptions across a large-scale web corpus.
Strengths
- 128 million image-text pairs provide a large-scale foundation for pretraining.
- Includes fine-grained concept composition annotations derived from DataComp-CLIP.
- Specifically designed to support a flexible batch sampling framework (CABS).
Limitations
- Specific column definitions and data schema are not publicly documented.
- Potential for web-crawled data to contain noise, biases, or inappropriate content.
- Lack of information on geographic and temporal coverage of the source data.
Provenance
- Source
- Derived from DataComp-CLIP, web-crawled.
- Collection Method
- Web-crawled image-text pairs, annotated with concept composition details.
- Time Range
- null
- Freshness
- null
- Geography
- null