Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Leonharper's Naime Corpus V1 is a multilingual text dataset for language model pre-training, containing approximately 28.1 billion tokens across over 38 million documents. The data is tokenized using the Qwen3-8B tokenizer and formatted into sequences of length 4096. It was last updated on Hugging Face in May 2026.
License is unknown; users should verify permissions before use.