Description

MINT-1T contains 1 trillion text tokens and 3.4 billion images, scaling open-source multimodal data by a factor of ten. The dataset was created by a team from the University of Washington and released in 2024, incorporating sources like PDFs and arXiv papers to facilitate research in multimodal pretraining.

Use Cases

Pretrain models to understand interleaved text and image sequences using the dataset's interleaved document structure.
Train vision-language models on the 3.4 billion image-text pairs to improve image captioning or visual question answering.
Benchmark model scaling laws using the 1 trillion token text corpus from diverse sources like PDFs.
Fine-tune models for scientific document understanding using the included arXiv paper data.

Strengths

Contains 1 trillion text tokens, a 10x scale-up from prior open-source multimodal datasets.
Includes 3.4 billion images, providing a large-scale visual corpus.
Integrates previously untapped sources such as PDFs and arXiv papers.

Limitations

Specific row counts, column details, and data distributions are not provided in the description.
Potential for class imbalance or geographic bias is unknown without detailed metadata.
The dataset's large size may require significant computational resources for downloading and processing.

Provenance

Source: University of Washington research team (mlfoundations).
Collection Method: Aggregated from open sources including PDFs and arXiv papers.
Time Range: Includes data up to 2023 based on the title.
Freshness: Last updated on the platform in September 2024.
Geography: null

Dataset is very large (1T tokens, 3.4B images); ensure sufficient storage and bandwidth. License is indicated as CC BY 4.0 via platform tags.

Text Multimodal WEBDATASET Task Categoriestext Generation Task Categoriesimage To Text Size Categories1 Mn10 M Arxiv240611271 Languageen Librarywebdataset Modalitytext Librarymlcroissant Modalityimage Librarydatasets Licensecc By 40 Interleaved Documents Computer Vision Regionus Large Scale Natural Language Processing

Interleaved PDF and Image Corpus for Multimodal Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info