Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
S1-MMAlign contains 15.5 million image-text pairs extracted from 2.5 million open-access scientific papers across biology, chemistry, and physics. Developed by ScienceOne-AI and released in 2026, it provides a large-scale resource for aligning complex scientific imagery with textual descriptions. The dataset is designed to bridge the semantic gap in scientific multimodal learning using peer-reviewed literature.
Dataset is provided in WebDataset format and is licensed under CC BY-NC 4.0, which prohibits commercial use. Users should refer to Arxiv paper 2601.00264 for specific alignment methodology and DOI 10.57967/hf/8008 for citation.