Naime Corpus V1: Multilingual Pre-training Text with 28.1 Billion Tokens

Name: Naime Corpus V1: Multilingual Pre-training Text with 28.1 Billion Tokens
Creator: Leonharper
Published: 2026-05-30T10:12:58
Keywords: English, Pre Training Corpus, Chinese, Text, Language Model, Natural Language Processing, Multilingual Text

by LeonharperUpdated 24d ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Leonharper's Naime Corpus V1 is a multilingual text dataset for language model pre-training, containing approximately 28.1 billion tokens across over 38 million documents. The data is tokenized using the Qwen3-8B tokenizer and formatted into sequences of length 4096. It was last updated on Hugging Face in May 2026.

Use Cases

Pre-training language models based on the large-scale, multilingual text corpus.
Benchmarking tokenizer performance based on the standardized Qwen3-8B tokenizer.
Analyzing domain distribution for model training based on the provided general, wiki, math, and code categories.
Conducting comparative NLP research based on the English and Chinese language splits.

Strengths

Contains approximately 28.1 billion tokens, providing substantial volume for training.
Includes over 38 million documents with a defined domain distribution (e.g., 56.6% general English, 28.2% general Chinese).
Uses a standardized sequence length of 4096, which is compatible with many modern transformer architectures.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Freshness should be verified as the last update timestamp is from the future (2026-05-31).

Provenance

Source: huggingface
Freshness: Last updated 2026-05-31 17:59:32; freshness should be verified.

License is unknown; users should verify permissions before use.

Text English Chinese Pre Training Corpus Language Model Natural Language Processing Multilingual Text

Related Datasets

Quality Score

C43

Description

51

Source

41

Reputation

42

Access

26

Community

29 downloads

2 likes

0 views

Dataset Info

Author: Leonharper
Created: May 30, 2026
Updated: May 31, 2026
Last synced: Jun 7, 2026

Access

26

Community

29 downloads

2 likes

0 views

Dataset Info

Author: Leonharper
Created: May 30, 2026
Updated: May 31, 2026
Last synced: Jun 7, 2026

Naime Corpus V1: Multilingual Pre-training Text with 28.1 Billion Tokens

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info