Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
MINT-1T is an open-source multimodal interleaved dataset containing 1 trillion text tokens and 3.4 billion images, a 10x scale-up from prior open-source collections. It includes previously untapped sources such as PDFs and ArXiv papers and is designed for multimodal pretraining research. The dataset was created by a team from the University of Washington and was last updated on the platform in September 2024.
License is listed as CC BY 4.0 on the platform; confirm terms on the dataset page before use.