228 billion BPE tokens aggregated from permissively-licensed sources including Case Law and the public domain subset of Pile of Law. The corpus is tokenized using the GPT-NeoX tokenizer and categorized into domains such as Legal to support the development of open-source language models.
Use Cases
- Pre-train large language models using the 228B token corpus to ensure the resulting weights are permissively licensed.
- Fine-tune models on legal reasoning tasks using the Case Law and Pile of Law (PD subset) components.
- Analyze tokenization density across different domains using the GPT-NeoX tokenizer counts provided in the source metadata.
Strengths
- Contains 228 billion BPE tokens as measured by the GPT-NeoX tokenizer.
- Includes a dedicated Legal domain featuring Case Law and the public domain subset of Pile of Law.
- Categorized by Domain, Source, and Specific License to ensure transparency in data provenance.