Name: DCLM Crossover Source: A Subset for Format-Aware Synthetic Augmentation
Creator: essobi
Published: 2026-04-12T16:12:39
Keywords: Language Modeling, Document Formats, Benchmark, Text, Synthetic Augmentation, Text Corpus, Synthetic

Description

251,661 documents were selected from a scan of 255,841 source documents, totaling 196,694,035 words with an average of 781 words per document. This subset of the DCLM-Baseline dataset was created by author essobi for synthetic augmentation with format-aware prompt routing, and was last updated on April 13, 2026. Selection criteria included picking every third shard, applying a word count filter of 50-8000 words, and skipping prompts that duplicate a document's native format.

Use Cases

Training language models on format-diverse text based on the described format detection and filtering.
Generating synthetic training data for instruction-following models based on the format-aware prompt routing methodology.
Benchmarking text generation quality across different document formats based on the described selection process.
Studying the impact of document length on model performance based on the 50-8000 word count filter applied.

Strengths

Large scale with 251,661 selected documents and nearly 200 million total words.
Explicit length filtering (50-8000 words) and format-aware selection to likely improve data consistency.
Per-site capping was applied, with zero instances reported, suggesting controlled source representation.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for specific batch processing needs.
The license is unknown, which restricts clarity on permissible use cases.

Provenance

Source: Subset of DCLM-Baseline dataset, hosted on Hugging Face by author essobi.
Collection Method: Selected from 255,841 scanned source docs by picking every third shard, applying word count and format filters, and implementing a per-site cap.
Time Range: null
Freshness: Last updated 2026-04-13 14:23:02; freshness should be verified.
Geography: null

License is unknown, which is a critical consideration before use.

Text Language Modeling Document Formats Benchmark Synthetic Augmentation Text Corpus Synthetic

DCLM Crossover Source: A Subset for Format-Aware Synthetic Augmentation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info