Description

MINT-1T contains 1 trillion text tokens and 3.4 billion images, a tenfold scale increase from prior open-source multimodal collections. Created by a University of Washington team, this dataset interleaves text and images from sources including ArXiv papers and PDFs to support multimodal pretraining research.

Use Cases

Train multimodal models on interleaved sequences of 1 trillion text tokens and 3.4 billion images for next-token prediction.
Benchmark model performance on ArXiv paper and PDF-derived content, assessing scientific document understanding.
Analyze the co-occurrence patterns of text and images within the 3.4 billion image-text pairs for representation learning.

Strengths

Massive scale with 1 trillion text tokens, a 10x increase over previous open-source datasets.
Includes 3.4 billion images, providing a substantial visual component for multimodal training.
Incorporates previously untapped data sources such as ArXiv papers and PDF documents.

Limitations

Specific data composition, class balance, and potential biases within the 3.4 billion images are not detailed.
The dataset's large size may require significant computational resources for downloading and processing.
The quality and consistency of image-text alignments across the diverse sources are not verified.

Provenance

Source: mlfoundations (Machine Learning Foundations team, University of Washington).
Collection Method: Aggregated and scaled from open-source multimodal sources, including ArXiv papers and PDFs.
Freshness: Last updated in September 2024.

Dataset is hosted on Hugging Face; users should check the specific page for license details, access terms, and download requirements for the large-scale files.

Multimodal WEBDATASET Task Categoriestext Generation Task Categoriesimage To Text Size Categories1 Mn10 M Arxiv240611271 Languageen Librarywebdataset Modalitytext Librarymlcroissant Modalityimage Librarydatasets Licensecc By 40 Regionus

Multimodal ArXiv Papers with One Trillion Text Tokens

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info