Dclm Baseline 8B Docpack Ctx2048: Tokenized Training Data Shards

Name: Dclm Baseline 8B Docpack Ctx2048: Tokenized Training Data Shards
Creator: marissatech
Published: 2026-06-12T19:17:00
Keywords: Machine Learning, Training Data, Benchmark, Tokenized Text, Text, Language Model

by marissatechUpdated 21d ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

201 tokenized data shards for training language models, stored in .tokbin format. The data was uploaded by marissatech and last updated on June 12, 2026. Each shard contains sequences of token IDs from a vocabulary of 24,256 tokens.

Use Cases

Train decoder-only language models based on the provided tokenized text sequences.
Benchmark model performance using the standardized baseline data shards.
Preprocess text data for large-scale language model training pipelines.
Analyze token distribution and sequence structure within the training corpus.

Strengths

Contains 201 data shards, providing a substantial volume of training material.
Uses a defined vocabulary size of 24,256 tokens, offering a consistent tokenization scheme.
Includes a dataset manifest (dataset_manifest.json) for managing the shard collection.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.

Provenance

Source: marissatech on Hugging Face
Collection Method: Tokenized text data prepared as training shards for language models.
Freshness: Last updated 2026-06-12 23:19:13; freshness should be verified.

The dataset is intended to be private. Data is stored in a custom .tokbin binary format with uint16 little-endian length headers.

Text Machine Learning Training Data Benchmark Tokenized Text Language Model

Related Datasets

Quality Score

D37

Description

42

Source

36

Reputation

35

Access

26

Community

1 likes

0 views

Dataset Info

Author: marissatech
Created: Jun 12, 2026
Updated: Jun 12, 2026
Last synced: Jun 14, 2026

Access

26

Community

1 likes

0 views

Dataset Info

Author: marissatech
Created: Jun 12, 2026
Updated: Jun 12, 2026
Last synced: Jun 14, 2026

Dclm Baseline 8B Docpack Ctx2048: Tokenized Training Data Shards

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info