Description

MINT-1T is an open-source multimodal dataset containing 1 trillion text tokens and 3.4 billion interleaved images, representing a tenfold scale-up from prior open-source collections. It was created by a team from the University of Washington to support research in multimodal pretraining, incorporating sources like PDFs and ArXiv papers.

Use Cases

Pretrain multimodal models on 1 trillion text tokens and 3.4 billion interleaved images for tasks like visual question answering.
Analyze the composition and information density of PDF and ArXiv paper content within the dataset's text corpus.
Benchmark scaling laws for vision-language models using the dataset's specified token and image counts.
Study the interleaved structure of text and image data for next-token or next-image prediction tasks.

Strengths

Contains 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source multimodal datasets.
Incorporates previously untapped document sources such as PDFs and ArXiv papers.
Specifically designed and released by an academic team to facilitate research in multimodal pretraining.

Limitations

The specific structure, columns, and file formats are not detailed, complicating initial data exploration.
Potential for source-specific biases from the inclusion of academic papers (ArXiv) and PDF documents.
The massive scale may require significant computational resources for downloading and processing.

Provenance

Source: University of Washington research team (mlfoundations).
Collection Method: Aggregated from open-source multimodal sources, including PDFs and ArXiv papers.
Freshness: Last updated in September 2024.

The dataset page on Hugging Face must be consulted for the full description, access details, and any specific licensing terms, which are currently unknown.

Multimodal WEBDATASET Task Categoriestext Generation Task Categoriesimage To Text Size Categories1 Mn10 M Arxiv240611271 Languageen Librarywebdataset Modalitytext Librarymlcroissant Modalityimage Librarydatasets Licensecc By 40 Regionus

Multimodal Dataset with One Trillion Text Tokens and 3.4 Billion Images

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info