Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
FUSION-10M is a large-scale dataset of image-caption pairs designed for pretraining multimodal AI models. It builds upon established datasets like LLaVA, ShareGPT4, and PixelProse and includes 2 million synthesized task-specific pairs. The dataset was created by author starriver030515 and was last updated in April 2025.
Users must consult the linked paper (arXiv:2504.09925) and GitHub repository for detailed methodology, license information, and data structure before use.