PIN-200M: A Knowledge-Intensive Dataset of Paired and Interleaved Multimodal Documents

Name: PIN-200M: A Knowledge-Intensive Dataset of Paired and Interleaved Multimodal Documents
Creator: m-a-p
Published: 2024-05-25T04:58:09
Keywords: Paired Documents, Knowledge Intensive, Interleaved Documents, Multimodal Documents, Multimodal

by m-a-pUpdated 3mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

PIN-200M contains approximately 200 million samples of paired and interleaved multimodal documents, requiring around 312 terabytes of storage. The dataset is a mini version of the PIN dataset introduced in a paper from June 2024. It was created by author m-a-p and last updated on Hugging Face in April 2026.

Use Cases

Training multimodal large language models based on knowledge-intensive document pairs.
Benchmarking model performance on interleaved text and image understanding tasks.
Developing retrieval-augmented generation systems based on paired document structures.
Conducting research on cross-modal alignment and reasoning using paired samples.

Strengths

Contains approximately 200 million samples, indicating a large scale.
Stores around 312 terabytes of data, suggesting high sample complexity or resolution.
Includes quality signals for swift sample assessment, as mentioned in the description.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
License information is unavailable, which restricts clarity on permissible use.

Provenance

Source: huggingface
Collection Method: Method of gathering is not specified in the provided input.
Freshness: Last updated 2026-04-15 07:30:56; freshness should be verified.

License is unknown, which may impose restrictions on commercial use or redistribution.

Multimodal Paired Documents Knowledge Intensive Interleaved Documents Multimodal Documents

Related Datasets

Quality Score

C45

Description

51

Source

41

Reputation

56

Access

26

Community

168.4K downloads

23 likes

0 views

Dataset Info

Author: m-a-p
Created: May 25, 2024
Updated: Apr 15, 2026
Last synced: May 15, 2026

Access

26

Community

168.4K downloads

23 likes

0 views

Dataset Info

Author: m-a-p
Created: May 25, 2024
Updated: Apr 15, 2026
Last synced: May 15, 2026

PIN-200M: A Knowledge-Intensive Dataset of Paired and Interleaved Multimodal Documents

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info