Name: Toxicity-Balanced Text Corpus for Language Model Safety
Creator: tomekkorbak
Published: 2022-04-07T13:24:34
Keywords: Balanced Dataset, Librarypolars, Librarydask, Text Toxicity, Ai Safety, Modalitytext, Size Categories100 Kn1 M, Librarymlcroissant, Language Model Training, Librarydatasets, Text, Parquet, Regionus

Description

Pile Toxicity Balanced2 is a text dataset designed for training and evaluating language models on toxic content. The dataset, created by researcher tomekkorbak, was uploaded to Hugging Face in April 2022. It is part of a series of datasets derived from The Pile, a large-scale text corpus used for AI development.

Use Cases

Training binary classifiers to detect toxic vs. non-toxic text segments.
Evaluating language model propensity to generate harmful completions by analyzing its text samples.
Mitigating toxicity in model outputs by fine-tuning on this balanced subset of The Pile corpus.
Studying the distribution of specific toxic language features (e.g., hate speech, threats) within a large text collection.

Strengths

Derived from The Pile, a foundational corpus containing over 800 GB of diverse text data.
Specifically balanced for toxicity, a critical property for reducing bias in safety evaluations.
Hosted on Hugging Face with supporting libraries like `datasets` and `polars` for efficient access.

Limitations

Specific row count, column structure, and exact balancing methodology are not documented.
The underlying data from The Pile has a cutoff around 2020, limiting coverage of recent language and events.
Potential label noise or subjective judgments inherent in toxicity classification tasks.

Provenance

Source: Subset of The Pile, a large-scale open-source language modeling dataset.
Collection Method: Curated and balanced from The Pile based on toxicity labels or scores.
Time Range: Reflects the temporal coverage of The Pile, primarily pre-2020.
Freshness: Last updated on the platform in April 2022; the source data is older.
Geography: Likely reflects the English-language and web-crawl origins of The Pile, with a bias towards US-centric sources.

License terms for derived use are not explicitly stated and depend on the original licenses within The Pile. Users must verify compliance for their intended application.

Text Parquet Balanced Dataset Librarypolars Librarydask Text Toxicity Ai Safety Modalitytext Size Categories100 Kn1 M Librarymlcroissant Language Model Training Librarydatasets Regionus

Toxicity-Balanced Text Corpus for Language Model Safety

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info