A curated, high-quality multimodal dataset engineered for training and evaluating vision-language models and multimodal retrieval-augmented generation systems. The dataset contains 15,000 samples, each pairing a rendered recipe card image with a text description. It was created by author tiptoghosh and last updated on Hugging Face in April 2026.
Use Cases
- Training vision-language models based on paired recipe card images and text descriptions.
- Evaluating multimodal retrieval systems based on cross-modal recipe data.
- Fine-tuning models for recipe understanding and generation based on the multimodal samples.
Strengths
- Dataset is explicitly curated for high quality, as stated in the description.
- Contains 15,000 multimodal samples, providing a substantial collection.
- Each sample pairs a rendered recipe card image (PNG, 300 DPI) with text, offering structured multimodal data.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Food.com, as indicated by the dataset title.
- Collection Method
- Likely involves scraping or API access to Food.com recipes, followed by rendering and curation.
- Time Range
- null
- Freshness
- Last updated 2026-04-23 14:18:29; freshness should be verified.
- Geography
- null