Name: NarraDolma: Narrative Feature Vectors for the 3-Trillion-Token Dolma Corpus
Creator: teagrjohnson
Published: 2026-06-17T04:37:47
Keywords: Text, Llm Pretraining, Dolma, Large Scale, Natural Language Processing, Narrative Analysis, Text Corpus

Description

NarraDolma provides a large-scale narrative characterization of the Dolma pretraining corpus. It contains approximately 3 million passages drawn from about 785,000 unique documents across all 12 Dolma sub-corpora, each labeled with a fine-grained narrative feature vector. The dataset was created by teagrjohnson and is intended as a resource for studying how narrative qualities are distributed in web-scale data.

Use Cases

Analyze the distribution of narrative qualities across different web domains based on the 12 Dolma sub-corpora.
Study the relationship between narrative features and model performance on downstream tasks.
Train or evaluate models for narrative understanding or generation based on the fine-grained feature vectors.
Conduct comparative studies of narrative content across different data sources within the Dolma corpus.

Strengths

Large scale with ~3 million passages labeled with narrative features.
Broad coverage across ~785,000 unique documents from all 12 Dolma sub-corpora.
Provides fine-grained narrative feature vectors produced by NarraBert.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: huggingface
Collection Method: Passages drawn from the Dolma corpus and labeled with narrative feature vectors by NarraBert.
Freshness: Last updated 2026-06-19 19:02:17; freshness should be verified.

License is unknown, which may restrict usage.

Text Llm Pretraining Dolma Large Scale Natural Language Processing Narrative Analysis Text Corpus

NarraDolma: Narrative Feature Vectors for the 3-Trillion-Token Dolma Corpus

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info