Open License Corpus

Name: Open License Corpus
Creator: kernelmachine
Published: 2023-08-08T23:21:52
Keywords: Task Categoriestext Generation, Languageen, Size Categories10 Mn100 M, Modalitytext, Librarymlcroissant, Librarydatasets, Regionus, Licenseapache 20

by kernelmachineUpdated 2y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

228 billion BPE tokens aggregated from permissively-licensed sources including Case Law and the public domain subset of Pile of Law. The corpus is tokenized using the GPT-NeoX tokenizer and categorized into domains such as Legal to support the development of open-source language models.

Use Cases

Pre-train large language models using the 228B token corpus to ensure the resulting weights are permissively licensed.
Fine-tune models on legal reasoning tasks using the Case Law and Pile of Law (PD subset) components.
Analyze tokenization density across different domains using the GPT-NeoX tokenizer counts provided in the source metadata.

Strengths

Contains 228 billion BPE tokens as measured by the GPT-NeoX tokenizer.
Includes a dedicated Legal domain featuring Case Law and the public domain subset of Pile of Law.
Categorized by Domain, Source, and Specific License to ensure transparency in data provenance.

Task Categoriestext Generation Languageen Size Categories10 Mn100 M Modalitytext Librarymlcroissant Librarydatasets Regionus Licenseapache 20

Related Datasets

Quality Score

D36

Description

46

Source

36

Reputation

28

Access

22

Community

1.1K downloads

17 likes

0 views

Dataset Info

Author: kernelmachine
Created: Aug 8, 2023
Updated: Aug 9, 2023
Last synced: Jul 5, 2026

Access

22

Community

1.1K downloads

17 likes

0 views

Dataset Info

Author: kernelmachine
Created: Aug 8, 2023
Updated: Aug 9, 2023
Last synced: Jul 5, 2026

Open License Corpus

Description

Use Cases

Strengths

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info