A 2023 deduplication of the OSCAR web text corpus using a suffix array method. The process removed documents with overlapping text spans, resulting in 136 million documents representing 31% of the original dataset. The dataset was created by datablations.
Use Cases
- Training large language models on deduplicated web text to reduce redundancy.
- Benchmarking deduplication algorithms based on the described suffix array method.
- Analyzing the distribution of web text content after removing pervasive duplicates.
- Studying the characteristics of the 31% subset retained from the original OSCAR corpus.
Strengths
- Contains 136 million documents, a substantial corpus size.
- Removes pervasive duplicates via a described deduplication process.
- Represents a 31% subset of the original OSCAR dataset.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Last updated 2023-05-10 06:57:52; freshness should be verified.
Provenance
- Source
- OSCAR corpus via HuggingFace.
- Collection Method
- Deduplication using a 25% suffix array to remove documents with overlapping text spans.