Name: Balanced Toxicity Text Subset for Language Models
Creator: tomekkorbak
Published: 2022-06-01T15:17:03
Keywords: Librarypolars, Librarydask, Text Toxicity, Ai Safety, Size Categories100 Kn1 M, Librarymlcroissant, Language Model Training, Librarydatasets, Text, Parquet, Regionus, Natural Language Processing

Description

A curated subset of The Pile dataset focused on toxic text examples, balanced for training and evaluation. The dataset was created by researcher tomekkorbak and uploaded to Hugging Face in June 2022. It is part of a series of balanced subsets derived from the larger 825GB Pile corpus.

Use Cases

Train a binary toxicity classifier using the 'text' field and inferred toxicity labels.
Evaluate language model outputs for safety by comparing generated text against toxic examples in the dataset.
Analyze linguistic patterns and markers associated with toxic language using the curated text samples.
Balance training data for language models to reduce bias by incorporating this controlled subset.

Strengths

Derived from The Pile, a large 825GB text corpus used for training state-of-the-art language models.
Specifically balanced for toxicity, addressing class imbalance common in raw text data.
Hosted on Hugging Face with available libraries like datasets, dask, and polars for efficient loading.

Limitations

Exact row count and specific column structure are unknown from available metadata.
Data is from 2022, which may not reflect the most current language or online toxicity trends.
The balancing methodology and specific toxicity criteria are not detailed in the provided information.

Provenance

Source: Subset of The Pile dataset.
Collection Method: Curated and balanced selection by the author.
Freshness: Last updated in June 2022.
Geography: Region tag suggests 'us', but specific coverage is unknown.

License is unknown; users should verify terms before commercial use. The dataset is stored in Parquet format, requiring compatible libraries for access.

Text Parquet Librarypolars Librarydask Text Toxicity Ai Safety Size Categories100 Kn1 M Librarymlcroissant Language Model Training Librarydatasets Regionus Natural Language Processing

Balanced Toxicity Text Subset for Language Models

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info