Name: Multimodal Image-Caption Pairs with Synthetic Data Enrichment
Creator: starriver030515
Published: 2025-02-08T15:04:15
Keywords: Librarypolars, Languagezh, Task Categoriesquestion Answering, Languageen, Task Categoriesvisual Question Answering, Size Categoriesn1 K, Modalitytext, Librarymlcroissant, Task Categoriestable Question Answering, Modalityimage, Librarydatasets, Librarypandas, Parquet, Arxiv250409925, Regionus, Licenseapache 20

Description

FUSION-10M is a large-scale dataset of image-caption pairs designed for pretraining multimodal AI models. It builds upon established datasets like LLaVA, ShareGPT4, and PixelProse and includes 2 million synthesized task-specific pairs. The dataset was created by author starriver030515 and was last updated in April 2025.

Use Cases

Pretrain multimodal models using the image-caption pairs to learn joint visual-language representations.
Fine-tune image captioning models on the enriched dataset, leveraging the 2 million synthesized task-specific pairs.
Benchmark model performance on tasks derived from foundational datasets like LLaVA, ShareGPT4, and PixelProse.
Analyze the impact of synthetic data enrichment on downstream multimodal task performance.

Strengths

Designed for pretraining large models like FUSION-3B and FUSION-8B, indicating a scale suitable for modern AI.
Enriched with 2 million synthesized task-specific image-caption pairs to augment coverage.
Built upon multiple established, high-quality datasets including LLaVA, ShareGPT4, and PixelProse.

Limitations

Specific row count, column structure, and file formats are unknown, hindering precise technical assessment.
Potential quality variance exists as it aggregates and synthesizes data from multiple sources.
The license, geographic scope, and temporal coverage are unspecified, creating uncertainty for commercial or specific applications.

Provenance

Source: Aggregated from LLaVA, ShareGPT4, and PixelProse datasets, with additional synthetic data.
Collection Method: Compilation and synthesis of existing image-caption datasets, generating 2 million new synthetic pairs.
Freshness: Last updated on 2025-04-15.

Users must consult the linked paper (arXiv:2504.09925) and GitHub repository for detailed methodology, license information, and data structure before use.

Parquet Librarypolars Languagezh Task Categoriesquestion Answering Languageen Task Categoriesvisual Question Answering Size Categoriesn1 K Modalitytext Librarymlcroissant Task Categoriestable Question Answering Modalityimage Librarydatasets Librarypandas Arxiv250409925 Regionus Licenseapache 20

Multimodal Image-Caption Pairs with Synthetic Data Enrichment

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info