Sign in to view source links and access this dataset
Description
251,661 documents were selected from a scan of 255,841 source documents, totaling 196,694,035 words with an average of 781 words per document. This subset of the DCLM-Baseline dataset was created by author essobi for synthetic augmentation with format-aware prompt routing, and was last updated on April 13, 2026. Selection criteria included picking every third shard, applying a word count filter of 50-8000 words, and skipping prompts that duplicate a document's native format.
Use Cases
Training language models on format-diverse text based on the described format detection and filtering.
Generating synthetic training data for instruction-following models based on the format-aware prompt routing methodology.
Benchmarking text generation quality across different document formats based on the described selection process.
Studying the impact of document length on model performance based on the 50-8000 word count filter applied.
Strengths
Large scale with 251,661 selected documents and nearly 200 million total words.
Explicit length filtering (50-8000 words) and format-aware selection to likely improve data consistency.
Per-site capping was applied, with zero instances reported, suggesting controlled source representation.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for specific batch processing needs.
The license is unknown, which restricts clarity on permissible use cases.
Provenance
Source
Subset of DCLM-Baseline dataset, hosted on Hugging Face by author essobi.
Collection Method
Selected from 255,841 scanned source docs by picking every third shard, applying word count and format filters, and implementing a per-site cap.
Time Range
null
Freshness
Last updated 2026-04-13 14:23:02; freshness should be verified.
Geography
null
License is unknown, which is a critical consideration before use.