Description

MINT-1T is an open-source multimodal interleaved dataset designed for pretraining research. It contains one trillion text tokens and 3.4 billion images, representing a 10x scale-up from prior open-source collections and includes sources like PDFs and arXiv papers. The dataset was created by a team from the University of Washington and was last updated on the platform in September 2024.

Use Cases

Multimodal model pretraining based on the interleaved text and image data.
Scaling up training data for vision-language models based on the one trillion token and 3.4 billion image scale.
Research on incorporating academic and document sources based on the inclusion of PDFs and arXiv papers.

Strengths

Contains one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets.
Includes previously untapped sources such as PDFs and arXiv papers.
Explicitly designed for multimodal pretraining research.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and exact license details are unknown, which may limit suitability assessment.

Provenance

Source: University of Washington team (mlfoundations)
Collection Method: Aggregated from multiple open data sources, including PDFs and arXiv papers.
Time Range: Likely includes content up to 2024, but specific temporal coverage is not stated.
Freshness: Last updated 2024-09-19 21:02:55; freshness should be verified.
Geography: Region tag indicates 'us', but specific spatial coverage is not detailed.

License is listed as 'cc By 40' on the platform, but specific terms should be verified on the dataset page.

Multimodal Task Categoriestext Generation Image To Text Task Categoriesimage To Text Multimodal Pretraining Arxiv240611271 Languageen Text Generation Pretraining Licensecc By 40 Regionus Large Scale Open Source Size Categories100 Bn1 T

MINT-1T: A Multimodal Dataset with One Trillion Text Tokens and 3.4 Billion Images

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info