Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
MINT-1T is an open-source multimodal interleaved dataset containing 1 trillion text tokens and 3.4 billion images, a tenfold increase in scale compared to prior open collections. It was created by a team from the University of Washington and includes data from previously untapped sources like PDFs and arXiv papers. The dataset was uploaded to the platform in September 2024.
The full description and specific details such as license, exact file formats, and data structure are available only on the original dataset page. The '1 trillion tokens' figure likely refers to a tokenized count, not raw characters.