Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
MINT-1T is an open-source multimodal interleaved dataset designed for pretraining research. It contains one trillion text tokens and 3.4 billion images, representing a 10x scale-up from prior open-source collections and includes sources like PDFs and arXiv papers. The dataset was created by a team from the University of Washington and was last updated on the platform in September 2024.
License is listed as 'cc By 40' on the platform, but specific terms should be verified on the dataset page.