DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Oscar Dedup Expanded: Deduplicated Web Text Corpus | DataSalon

Home Government & LegalOscar Dedup Expanded: Deduplicated Web Text Corpus

Government & Legal

Oscar Dedup Expanded: Deduplicated Web Text Corpus

Name: Oscar Dedup Expanded: Deduplicated Web Text Corpus
Creator: datablations
Published: 2023-02-10T18:42:08
Keywords: Text Deduplication, Text, Large Scale, Natural Language Processing, Oscar Corpus

by datablations·Updated 3y ago

Available on 1 platform

Description

A 2023 deduplication of the OSCAR web text corpus using a suffix array method. The process removed documents with overlapping text spans, resulting in 136 million documents representing 31% of the original dataset. The dataset was created by datablations.

Use Cases

Training large language models on deduplicated web text to reduce redundancy.
Benchmarking deduplication algorithms based on the described suffix array method.
Analyzing the distribution of web text content after removing pervasive duplicates.
Studying the characteristics of the 31% subset retained from the original OSCAR corpus.

Strengths

Contains 136 million documents, a substantial corpus size.
Removes pervasive duplicates via a described deduplication process.
Represents a 31% subset of the original OSCAR dataset.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2023-05-10 06:57:52; freshness should be verified.

Provenance

Source: OSCAR corpus via HuggingFace.
Collection Method: Deduplication using a 25% suffix array to remove documents with overlapping text spans.

Text Text Deduplication Large Scale Natural Language Processing Oscar Corpus

Related Datasets

Quality Score

D32

Description

Source

Reputation

Quality Score

D32

Description

Source

Reputation

Access

Community

154 downloads

1 likes

0 views

Dataset Info

Author: datablations
Created: Feb 10, 2023
Updated: May 10, 2023
Last synced: May 11, 2026

Access

Community

154 downloads

1 likes

0 views

Dataset Info

Author: datablations
Created: Feb 10, 2023
Updated: May 10, 2023
Last synced: May 11, 2026

Oscar Dedup Expanded: Deduplicated Web Text Corpus

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info