This multimodal fashion dataset provides image-text pairs annotated across categories, style, colors, materials, keywords, and fine-details. It is specifically curated to evaluate vision-language models like Marqo-FashionCLIP and Marqo-FashionSigLIP using fine-grained attribute metadata.
Use Cases
- Train vision-language models using the text descriptions and image pairs for contrastive learning
- Develop attribute-based search systems using the colors, materials, and style columns
- Perform fine-grained fashion classification using the categories and fine-details labels
- Benchmark retrieval accuracy for fashion-specific models using the keywords and metadata
Strengths
- Includes multi-attribute labels for categories, style, and colors
- Provides fine-grained metadata including materials and specific keywords
- Contains textual descriptions paired with visual fashion data
- Used as the evaluation benchmark for Marqo-FashionCLIP and Marqo-FashionSigLIP models